Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

阿新 • • 發佈：2019-01-01

系列目錄：

訓練資料拆分

把訓練資料拆分為訓練集和交叉驗證集，比例為7:3。x_train和y_train用來訓練模型，x_test和y_test用來交叉驗證。

data_train = data_train.set_index('UserI_Id')
y = data_train[data_train.columns[0]]
x = data_train[data_train.columns[1:]]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=10 
)

隨機森林預設引數

首先隨機森林採用預設引數,用袋外分數評估模型好壞.在bagging的每輪隨機取樣中，訓練集中大約有36.8%的資料沒有被取樣集採集中。對於這部分大約36.8%的沒有被取樣到的資料，我們常常稱之為袋外資料(Out Of Bag, 簡稱OOB)。這些資料沒有參與訓練集模型的擬合，因此可以用來檢測模型的泛化能力。

rf = RandomForestClassifier(oob_score=True, random_state=0)
rf.fit(x_train, y_train)
print rf

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from 
 sklearn import metrics
y_train_pred = rf.predict(x_train)
y_train_predprob = rf.predict_proba(x_train)[:,1]
print u'訓練集袋外分數: ', rf.oob_score_
print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred)
print u'訓練集查準率: ', precision_score(y_train, y_train_pred)
print 
 u'訓練集召回率: ', recall_score(y_train, y_train_pred)
print u'訓練集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf.predict(x_test)
y_test_predprob = rf.predict_proba(x_test)[:,1]
print u'測試集準確率: ', accuracy_score(y_test, y_test_pred)
print u'測試集查準率: ', precision_score(y_test, y_test_pred)
print u'測試集召回率: ', recall_score(y_test, y_test_pred)
print u'測試集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))

預設引數下，袋外分數0.77，但是訓練集和交叉驗證集的F1值差距比較大，模型泛化能力不強，所以通過網格搜尋進行調參工作。

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=0,
            verbose=0, warm_start=False)
訓練集袋外分數:  0.775510204082
訓練集AUC Score: 0.999003
訓練集準確率:  0.98693877551
訓練集查準率:  0.988970588235
訓練集召回率:  0.984947111473
訓練集F1值:  0.986954749287
             precision    recall  f1-score   support

          0       0.98      0.99      0.99      2442
          1       0.99      0.98      0.99      2458

avg / total       0.99      0.99      0.99      4900

[[2415   27]
 [  37 2421]]
測試集準確率:  0.805714285714
測試集查準率:  0.810176125245
測試集召回率:  0.79462571977
測試集F1值:  0.802325581395
             precision    recall  f1-score   support

          0       0.80      0.82      0.81      1058
          1       0.81      0.79      0.80      1042

avg / total       0.81      0.81      0.81      2100

[[864 194]
 [214 828]]

對弱學習器的最大迭代次數n_estimators進行網格搜尋

在10到100中搜索,步長為10,度量測度為roc_auc,5折交叉驗證，oob_score :即是否採用袋外樣本來評估模型的好壞，得到了最佳的弱學習器迭代次數{‘n_estimators’: 90} 0.907140520132

from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators': range(10,101,10)}
gsearch1 = GridSearchCV(estimator=RandomForestClassifier(max_depth=8, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(x_train, y_train)
print gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜尋

得到引數{'min_samples_split': 45, 'max_depth': 8} 0.90777502455，看現在模型的袋外分數,有一定提高。從0.77到0.828979591837。

param_test2 = {'max_depth':range(3,14,1), 'min_samples_split':range(5,51,5)}
gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_estimators=90, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(x_train, y_train)
print gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

rf1 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=45,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(x_train, y_train)
print rf1.oob_score_

對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參

對於內部節點再劃分所需最小樣本數min_samples_split，我們暫時不能一起定下來，因為這個還和決策樹其他的引數存在關聯。下面我們再對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參。
得到引數{'min_samples_leaf': 10, 'min_samples_split': 70}, 0.90753607636810929)

param_test3 = {'min_samples_split':range(30,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8,max_features='sqrt' ,oob_score=True, random_state=10),param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

對最大特徵數max_features做調參

得到引數{'max_features': 5}, 0.90721677436061976)

param_test4 = {'max_features':range(3,20,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8, min_samples_split=70,min_samples_leaf=10,oob_score=True,random_state=10),param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(x_train, y_train)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

模型交叉驗證效果

訓練集的袋外分數為0.82，提升了5個百分點，且訓練集和交叉驗證集的F1值接近，都能到0.8以上，模型泛化能力提升。

rf2 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=70,min_samples_leaf=10,max_features=5 ,oob_score=True,random_state=10)
rf2.fit(x_train, y_train)
print rf2

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf2.predict(x_train)
y_train_predprob = rf2.predict_proba(x_train)[:,1]
print u'訓練集袋外分數: ', rf2.oob_score_
print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred)
print u'訓練集查準率: ', precision_score(y_train, y_train_pred)
print u'訓練集召回率: ', recall_score(y_train, y_train_pred)
print u'訓練集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf2.predict(x_test)
y_test_predprob = rf2.predict_proba(x_test)[:,1]
print u'測試集準確率: ', accuracy_score(y_test, y_test_pred)
print u'測試集查準率: ', precision_score(y_test, y_test_pred)
print u'測試集召回率: ', recall_score(y_test, y_test_pred)
print u'測試集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features=5, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=70, min_weight_fraction_leaf=0.0,
            n_estimators=90, n_jobs=1, oob_score=True, random_state=10,
            verbose=0, warm_start=False)
訓練集袋外分數:  0.82612244898
訓練集AUC Score: 0.922081
訓練集準確率:  0.838367346939
訓練集查準率:  0.817938931298
訓練集召回率:  0.871847030106
訓練集F1值:  0.844033083891
             precision    recall  f1-score   support

          0       0.86      0.80      0.83      2442
          1       0.82      0.87      0.84      2458

avg / total       0.84      0.84      0.84      4900

[[1965  477]
 [ 315 2143]]
測試集準確率:  0.824761904762
測試集查準率:  0.806363636364
測試集召回率:  0.851247600768
測試集F1值:  0.828197945845
             precision    recall  f1-score   support

          0       0.84      0.80      0.82      1058
          1       0.81      0.85      0.83      1042

avg / total       0.83      0.82      0.82      2100

[[845 213]
 [155 887]]

微信公眾號「資料分析」，分享資料科學家的自我修養，既然遇見，不如一起成長。

Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

訓練資料拆分

隨機森林預設引數

對弱學習器的最大迭代次數n_estimators進行網格搜尋

對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜尋

對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參

對最大特徵數max_features做調參

模型交叉驗證效果

Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

Python資料探勘與機器學習_通訊信用風險評估實戰(2)——資料預處理

未明學院活動：機器學習熱門專案開始報名，一次收穫資料探勘&機器學習技能、行業專案經歷！

《資料探勘-實用機器學習技術》下載

【機器學習_3】常見術語區別(人工智慧&資料探勘&機器學習&統計模型等)

資料探勘（機器學習）面試--SVM面試常考問題

走在前往架構師的路上（專注於分散式計算，大資料，資料探勘，機器學習演算法等領域的研究）

資料探勘和機器學習中距離和相似度公式

18名校資料探勘及機器學習課程資源彙總

資料探勘，機器學習，人工智慧的簡單區別分析

資料探勘，機器學習，自然語言處理這三者是什麼關係?

機器學習實戰與python資料探勘與python計算機視覺

帶你入門Python資料探勘與機器學習（附程式碼、例項）

Python資料探勘與機器學習技術入門實戰

python資料探勘入門與實戰——學習筆記（第3、4章）

分享《Python資料探勘入門與實踐》高清中文版+高清英文版+原始碼

Python資料探勘入門與實戰:第一章

Python資料探勘學習筆記（12）淘寶圖片爬蟲實戰

Python資料探勘學習——親和性分析

python資料探勘入門與實踐----------特徵值，主成分分析

Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

訓練資料拆分

隨機森林預設引數

對弱學習器的最大迭代次數n_estimators進行網格搜尋

對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜尋

對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參

對最大特徵數max_features做調參

模型交叉驗證效果

相關推薦