1. 程式人生 > >Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優

系列目錄:

訓練資料拆分

把訓練資料拆分為訓練集和交叉驗證集,比例為7:3。x_trainy_train用來訓練模型,x_testy_test用來交叉驗證。

data_train = data_train.set_index('UserI_Id')
y = data_train[data_train.columns[0]]
x = data_train[data_train.columns[1:]]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=10
)

隨機森林預設引數

首先隨機森林採用預設引數,用袋外分數評估模型好壞.在bagging的每輪隨機取樣中,訓練集中大約有36.8%的資料沒有被取樣集採集中。對於這部分大約36.8%的沒有被取樣到的資料,我們常常稱之為袋外資料(Out Of Bag, 簡稱OOB)。這些資料沒有參與訓練集模型的擬合,因此可以用來檢測模型的泛化能力。

rf = RandomForestClassifier(oob_score=True, random_state=0)
rf.fit(x_train, y_train)
print rf

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from
sklearn import metrics y_train_pred = rf.predict(x_train) y_train_predprob = rf.predict_proba(x_train)[:,1] print u'訓練集袋外分數: ', rf.oob_score_ print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob) print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred) print u'訓練集查準率: ', precision_score(y_train, y_train_pred) print
u'訓練集召回率: ', recall_score(y_train, y_train_pred) print u'訓練集F1值: ', f1_score(y_train, y_train_pred) print(metrics.classification_report(y_train, y_train_pred)) print(metrics.confusion_matrix(y_train, y_train_pred)) y_test_pred = rf.predict(x_test) y_test_predprob = rf.predict_proba(x_test)[:,1] print u'測試集準確率: ', accuracy_score(y_test, y_test_pred) print u'測試集查準率: ', precision_score(y_test, y_test_pred) print u'測試集召回率: ', recall_score(y_test, y_test_pred) print u'測試集F1值: ', f1_score(y_test, y_test_pred) print(metrics.classification_report(y_test, y_test_pred)) print(metrics.confusion_matrix(y_test, y_test_pred))

預設引數下,袋外分數0.77,但是訓練集和交叉驗證集的F1值差距比較大,模型泛化能力不強,所以通過網格搜尋進行調參工作。

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=0,
            verbose=0, warm_start=False)
訓練集袋外分數:  0.775510204082
訓練集AUC Score: 0.999003
訓練集準確率:  0.98693877551
訓練集查準率:  0.988970588235
訓練集召回率:  0.984947111473
訓練集F1值:  0.986954749287
             precision    recall  f1-score   support

          0       0.98      0.99      0.99      2442
          1       0.99      0.98      0.99      2458

avg / total       0.99      0.99      0.99      4900

[[2415   27]
 [  37 2421]]
測試集準確率:  0.805714285714
測試集查準率:  0.810176125245
測試集召回率:  0.79462571977
測試集F1值:  0.802325581395
             precision    recall  f1-score   support

          0       0.80      0.82      0.81      1058
          1       0.81      0.79      0.80      1042

avg / total       0.81      0.81      0.81      2100

[[864 194]
 [214 828]]

對弱學習器的最大迭代次數n_estimators進行網格搜尋

在10到100中搜索,步長為10,度量測度為roc_auc,5折交叉驗證,oob_score :即是否採用袋外樣本來評估模型的好壞,得到了最佳的弱學習器迭代次數{‘n_estimators’: 90} 0.907140520132

from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators': range(10,101,10)}
gsearch1 = GridSearchCV(estimator=RandomForestClassifier(max_depth=8, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(x_train, y_train)
print gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜尋

得到引數{'min_samples_split': 45, 'max_depth': 8} 0.90777502455,看現在模型的袋外分數,有一定提高。從0.77到0.828979591837。

param_test2 = {'max_depth':range(3,14,1), 'min_samples_split':range(5,51,5)}
gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_estimators=90, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(x_train, y_train)
print gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

rf1 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=45,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(x_train, y_train)
print rf1.oob_score_

對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參

對於內部節點再劃分所需最小樣本數min_samples_split,我們暫時不能一起定下來,因為這個還和決策樹其他的引數存在關聯。下面我們再對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參。
得到引數{'min_samples_leaf': 10, 'min_samples_split': 70}, 0.90753607636810929)

param_test3 = {'min_samples_split':range(30,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8,max_features='sqrt' ,oob_score=True, random_state=10),param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

對最大特徵數max_features做調參

得到引數{'max_features': 5}, 0.90721677436061976)

param_test4 = {'max_features':range(3,20,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8, min_samples_split=70,min_samples_leaf=10,oob_score=True,random_state=10),param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(x_train, y_train)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

模型交叉驗證效果

訓練集的袋外分數為0.82,提升了5個百分點,且訓練集和交叉驗證集的F1值接近,都能到0.8以上,模型泛化能力提升。

rf2 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=70,min_samples_leaf=10,max_features=5 ,oob_score=True,random_state=10)
rf2.fit(x_train, y_train)
print rf2

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf2.predict(x_train)
y_train_predprob = rf2.predict_proba(x_train)[:,1]
print u'訓練集袋外分數: ', rf2.oob_score_
print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred)
print u'訓練集查準率: ', precision_score(y_train, y_train_pred)
print u'訓練集召回率: ', recall_score(y_train, y_train_pred)
print u'訓練集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf2.predict(x_test)
y_test_predprob = rf2.predict_proba(x_test)[:,1]
print u'測試集準確率: ', accuracy_score(y_test, y_test_pred)
print u'測試集查準率: ', precision_score(y_test, y_test_pred)
print u'測試集召回率: ', recall_score(y_test, y_test_pred)
print u'測試集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features=5, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=70, min_weight_fraction_leaf=0.0,
            n_estimators=90, n_jobs=1, oob_score=True, random_state=10,
            verbose=0, warm_start=False)
訓練集袋外分數:  0.82612244898
訓練集AUC Score: 0.922081
訓練集準確率:  0.838367346939
訓練集查準率:  0.817938931298
訓練集召回率:  0.871847030106
訓練集F1值:  0.844033083891
             precision    recall  f1-score   support

          0       0.86      0.80      0.83      2442
          1       0.82      0.87      0.84      2458

avg / total       0.84      0.84      0.84      4900

[[1965  477]
 [ 315 2143]]
測試集準確率:  0.824761904762
測試集查準率:  0.806363636364
測試集召回率:  0.851247600768
測試集F1值:  0.828197945845
             precision    recall  f1-score   support

          0       0.84      0.80      0.82      1058
          1       0.81      0.85      0.83      1042

avg / total       0.83      0.82      0.82      2100

[[845 213]
 [155 887]]

微信公眾號「資料分析」,分享資料科學家的自我修養,既然遇見,不如一起成長。
資料分析