scikit-learn：3.5. Validation curves: plotting scores to evaluate models

阿新 • • 發佈：2017-05-30

ror 例如最大的 dsm models 不能 utl ring 告訴

參考：http://scikit-learn.org/stable/modules/learning_curve.html

estimator‘s generalization error can be decomposed in terms ofbias, variance and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise

is a property of the data.

首先介紹背景，進而引入本節要講的內容。背景就是：

針對函數COS(1.5π x)，分別使用不同的estimators fit the function：linear regression with polynomial features of degree 1, 4 and 15。結果圖例如以下：

技術分享

圖一high bias，圖二剛好，圖三high variance。但，，，。這並非重點。。

。。。。。。

。。

。

重點是：對於一維的COS函數。能夠通過繪圖來辨別bias或variance。但對於高維的樣例，不能通過繪圖來識別。此時。以下要講的內容就helpful了

。。。

。

1、Validation curve

為了驗證一個模型。我們須要一個scoring function(see Model evaluation: quantifying the quality of predictions。翻譯文章：http://blog.csdn.net/mmc2015/article/details/47121611)。而為了找到較好的超參數的組合。我們常使用grid search或類似方法 (seeGrid Search: Searching for estimator parameters，翻譯文章：http://blog.csdn.net/mmc2015/article/details/47100091) ，在grid search過程中，我們希望找到使validation sets最大的score相應的超參數組合

。（註意，validation sets一旦使用，對於模型就是有bias的，所以對於generalization，一定要再選擇其它獨立的test sets驗證。

）

然而並非重點。

。。

重點是，我們希望能夠plot the influence of a single hyperparameter on the training score and the validation score，這樣有助於分析estimator是否overfitting、underfitting。。

技術分享

training score and the validation score都low，說明estimator underfittig；training score high but the validation score low，說明overfitting。training score and the validation score都high，說明效果比較好（上圖告訴我們。參數gamma最好選擇0.001-0.0001）；training score low but the validation score high，可能性不大。

（事實上該方法不是非常有用，由於模型不僅受一個參數的影響。還會受其它參數的綜合影響，還是grid search靠譜；假設僅僅有一個參數。那麽該方法比較好。

）

2、Learning curve

A learning curve shows the validation and training score of an estimator for varying numbers of training samples.

技術分享

如上圖。If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.這時，須要考慮換estimator或者調參數。

技術分享

如上圖，If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.這時。須要考慮獲取很多其它的samples。

上面幾幅圖的產生程序參考：

Underfitting vs. Overfitting
Plotting Validation Curves
Plotting Learning Curves

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

ror 例如最大的 dsm models 不能 utl ring 告訴參考：http://scikit-learn.org/stable/modules/learning_curve.html estimator‘s generalization error

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

scikit-learn：3. Model selection and evaluation

scikit-learn： isotonic regression（保序回歸，非常有意思，僅做知識點了解，但差點兒沒用到過）

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

scikit-learn：4. 數據集預處理（clean數據、reduce降維、expand增維、generate特征提取）

柯夢嬌：3.5非農強勢來襲，你準備好了嗎？

中國的互聯網企業逐步走向“單一企業多樣化，商業生態同質化”，美國的互聯網企業則會走向“單一企業專業化，商業生態多樣化”：3.5星|《VUCA時代，想要成功，這些原則你一定得明白》

銷售人員提問的數量跟銷售轉化率緊密相關：3.5星|《哈佛商業評論》2018年第6期

10萬+爆款文章的套路與技巧：3.5星|粥左羅《公眾號運營實戰手冊》

Linux核心完全註釋閱讀筆記：3.5、Linux 0.11目標檔案格式

《連載 | 物聯網框架ServerSuperIO教程》- 18．整合OPC Client，及使用步驟。附：3.5 釋出與更新說明。

Scikit-learn：聚類clustering

scikit-learn：CountVectorizer提取詞頻

Scikit-learn：scikit-learn快速教程及例項

俄亥俄州農場主們基本需要下煤礦幹活掙錢養活農場。美國一些冷門行業的從業者速寫：3.5星|《看不見的美國》

【SciKit-Learn學習筆記】5：核SVM分類和預測乳腺癌資料集

【SciKit-Learn學習筆記】3：線性迴歸測算波士頓房價,Logistic迴歸預測乳腺癌

ML：Scikit-Learn 學習筆記（3） --- Nearest Neighbors 最近鄰迴歸及相關演算法

python3.5《機器學習實戰》學習筆記（三）：k近鄰演算法scikit-learn實戰手寫體識別

找出指定數字的所有質因數，比如：90=233*5。

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

相關推薦