Statistical Methods for Machine Learning

阿新 • • 發佈：2018-05-25

AS n-2 cal 元素 n) pan size AC 情況

機器學習中的統計學方法。

統計學是機器學習的一個支柱。

原始觀察僅僅是數據, 但它們不是信息或知識。數據引發問題, 例如:

什麽是最常見的或預期的觀察？
觀察的限制是什麽？
數據是什麽樣子的？

哪些變量最相關？
兩個實驗的區別是什麽？
這些差異是真實的還是數據中噪音的結果？

眾數(mode)、平均數(Mean)和中位數(Median)

眾數、平均數和中位數在某些情況下測量的都是數據的中心。

下面兩個公式分別計算的是sample和population的平均數：

技術分享圖片

要想找出數據的中位數，我們首先要給數據排序。假設我們有n個已經排好序的數，它們是x1,x2,x3,…,xn。下面是找出它們中位數的公式：

技術分享圖片

Q1、Q3、IQR、方差和標準差

參見Boxplot。請看下圖：

技術分享圖片

上圖中已經很明白地說明Q1、Q3和IQR各自的含義了。從上圖我們也看到了小於Q1?1.5?IQR或大於Q3+1.5?IQR是可能存在的異常值。在一些情況下，統計學家用這樣的方法去掉異常值。

下面，介紹一個找Q1、Q3的方法。

找到Q2，也就是數據集的Median，因此把數據集分成兩部分

找上半部分的Median，即Q3
找下半部分的Median，即Q1

方差和標準差度量的是數據的分散程度。計算方差和標準差的公式如下：

技術分享圖片

但是絕對值不是更簡單明了嗎，它也可以度量數據的分散程度啊？為什麽我們要費這麽大功夫去平方然後在開根號求標準差？這是因為在統計分析中，標準差有一些很Cool的性質。

技術分享圖片

從上圖我們可以看出，在正態分布中，有大約68%的數據落在距離平均值1個標準差的範圍內，有大約95%的數據落在距離平均值2個標準差的範圍內，等等。實際上，我們可以求出任意百分比的數據落在什麽樣的標準差範圍內。因此，求出標準差至關重要。

如果我們的數據集是整個population，那麽求標準差的公式和上面的一樣。但是如果我們的數據集僅僅是從population中抽取的sample，我們的公式如下：

技術分享圖片

把它叫做Sample standard deviation. 直觀上來講，population中數據大多數都分布在中心，因此我們的Sample中的數據基本上都來自於中心，這樣所計算出的標準差要比真實的標準差要小，因為它的數據分散程度要小。因此我們要用N-1來求解（叫做Bessel’s Correction），這樣會使我們求出的標準差更加接近真實的標準差。Sample standard deviation也就是population標準差 $σ$

σ的估算。

$σ$

$X ：$
$μ ：$
$σ ：$

$σ ：$ 當我們standardization正態分布時（即z-score過程），我們將得到一個標準的正態分布，即平均值為0，標準差為1的正態分布。

技術分享圖片

在上圖中的正態分布中，X軸上隨機選擇一個小於x的概率等於負無窮到x與曲線形成的面積。

可以用微積分的知識求出任意兩點與曲線之間形成的面積。我們也可以用Z-Table來求出小於某個x值的面積。但是，在用Z-Table之前，我們必須要把正態分布standardization，也就是求出對應x值的z-score。

中心極限定理（Central limit theorem）

假設一個sample包含很多的observations，每個observation是隨機生成的並且它們之間是相互獨立的，計算這個sample的平均值。重復計算這樣sample的平均值，中心極限定理告訴我們這些平均值服從正態分布。

在概率理論中，中心極限定理的定義為：在特定的條件下，不管潛在的population分布是什麽樣的，大量重復地計算獨立隨機變量的算術平均值，這些平均值將服從正態分布。

抽樣分布（Sampling Distribution）

維基百科上給出抽樣分布的定義為：In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample.

舉個例子，假設我們有一個mean為μ，方差為σ²的正態分布。我們重復地從這個population中取出samples，然後分別計算每個sample的平均值，這個統計值叫做sample mean.

每個sample都有一個平均值，這些平均值的分布叫做sampling distribution of the sample mean.

由於population的分布是正態分布，這個分布也是正態分布，它服從N(μ, σ²/n)，這裏n為sample size. 根據中心極限定理，即使population分布不是正態的，sampling distribution也通常接近於正態分布。

例子

以下是應用機器學習項目中使用統計方法的10個例子。

問題框架: 需要使用探索性數據分析和數據挖掘。
數據理解: 需要使用摘要統計信息和數據可視化。
數據清理。需要使用異常檢測、歸一化等。
數據選擇。需要使用數據取樣和特征選擇方法。
數據準備。需要使用數據轉換、縮放、編碼等等。

模型計算。需要實驗設計和重新取樣方法。
模型配置。需要使用統計假設測試和估計統計。
模型選擇。需要使用統計假設測試和估計統計。
模型表示。需要使用估計統計信息, 如置信區間。
模型預測。需要使用估計統計信息, 如預測間隔。

Statistical Methods for Machine Learning

AS n-2 cal 元素 n) pan size AC 情況機器學習中的統計學方法。統計學是機器學習的一個支柱。原始觀察僅僅是數據, 但它們不是信息或知識。數據引發問題, 例如: 什麽是最常見的或預期的觀察？觀察的限制是什麽？數據是什麽樣子的？

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

ali 很多好的 info 可能斯坦福大學公開課數據 div http 下圖為四種不同算法應用在不同大小數據量時的表現，可以看出，隨著數據量的增大，算法的表現趨於接近。即不管多麽糟糕的算法，數據量非常大的時候，算法表現也可以很好。數據量很大時，學習算法表現比

U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》

when calculating U C

《C4.5: Programs for Machine Learning》chaper4實驗結果重現

使用自帶的ｖｏｔｅ資料集：實驗結果如下：剪枝前： physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution

the resource for machine learning

Questions and Answers What's matrix dot product in Deep Learning? Deep Neural Network with Matrices https://matrices.io/deep-neural-network-from-scrat

[Infographic] The Best Tools for Machine Learning Gengo AI

Machine learning projects can range from small datasets and standard algorithms, to much larger projects that use neural networks engines with massive data

The 50 Best Public Datasets for Machine Learning

The 50 Best Public Datasets for Machine LearningWhat are some open datasets for machine learning? After scrapping the web for hours after hours, we have cr

Facebook's PyTorch plans to light the way to speedy workflows for Machine Learning • DEVCLASS

Facebook's development department has finished a first release candidate for v1 of its PyTorch project – just in time for the first conference dedicated to

Essential libraries for Machine Learning in Python

Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie

Top 10 Open Image Datasets for Machine Learning Research

This article would succinctly describe the best ten image datasets used for certain fundamental computer vision problems such as classification, detecti

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning modelsNormalization is a technique often applied as part of data preparation for machine learning.

NXP Owns the Stage for Machine Learning in Edge Devices

SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri

NXP's New Development Platform for Machine Learning in the IoT

NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra

Free Online Course: Neural Networks for Machine Learning from Coursera Class Central

I honestly can't understand the multiple 5 star reviews presented on this site about the course. I'm giving it a 1 star which is a bit harsh I know but I'm

Marginally Interesting: Slides for Machine Learning on Streams

Tweet Yesterday I gave a talk at the Big Data Beers meetup in Berlin on

Using Amazon’s Mechanical Turk for Machine Learning Data

How to build a model from Mechanical Turk resultsAmazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled da

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Today, MIT and Community Jameel, the social enterprise organization founded and chaired by Mohammed Abdul Latif Jameel ’78, launched the Abdul Latif Jameel

Statistical Methods for Machine Learning

眾數(mode)、平均數(Mean)和中位數(Median)

Q1、Q3、IQR、方差和標準差

$σ$

中心極限定理（Central limit theorem）

抽樣分布（Sampling Distribution）

例子

Statistical Methods for Machine Learning

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》

《C4.5: Programs for Machine Learning》chaper4實驗結果重現

the resource for machine learning

[Infographic] The Best Tools for Machine Learning Gengo AI

The 50 Best Public Datasets for Machine Learning

Facebook's PyTorch plans to light the way to speedy workflows for Machine Learning • DEVCLASS

Essential libraries for Machine Learning in Python

Top 10 Open Image Datasets for Machine Learning Research

Why Data Normalization is necessary for Machine Learning models

NXP Owns the Stage for Machine Learning in Edge Devices

NXP's New Development Platform for Machine Learning in the IoT

Free Online Course: Neural Networks for Machine Learning from Coursera Class Central

Marginally Interesting: Slides for Machine Learning on Streams

Using Amazon’s Mechanical Turk for Machine Learning Data

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine Learning

Setting Up Python for Machine Learning on Windows

Differential geometry for machine learning （微分幾何在機器學習中的應用）

Statistical Methods for Machine Learning

眾數(mode)、平均數(Mean)和中位數(Median)

Q1、Q3、IQR、方差和標準差

Z-Score和正態分布

中心極限定理（Central limit theorem）

抽樣分布（Sampling Distribution）

例子

相關推薦

$σ$