Python資料探勘與機器學習_通訊信用風險評估實戰(4)——模型訓練與調優
系列目錄:
訓練資料拆分
把訓練資料拆分為訓練集和交叉驗證集,比例為7:3。x_train
和y_train
用來訓練模型,x_test
和y_test
用來交叉驗證。
data_train = data_train.set_index('UserI_Id')
y = data_train[data_train.columns[0]]
x = data_train[data_train.columns[1:]]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=10 )
隨機森林預設引數
首先隨機森林採用預設引數,用袋外分數評估模型好壞.在bagging的每輪隨機取樣中,訓練集中大約有36.8%的資料沒有被取樣集採集中。對於這部分大約36.8%的沒有被取樣到的資料,我們常常稱之為袋外資料(Out Of Bag, 簡稱OOB)。這些資料沒有參與訓練集模型的擬合,因此可以用來檢測模型的泛化能力。
rf = RandomForestClassifier(oob_score=True, random_state=0)
rf.fit(x_train, y_train)
print rf
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf.predict(x_train)
y_train_predprob = rf.predict_proba(x_train)[:,1]
print u'訓練集袋外分數: ', rf.oob_score_
print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred)
print u'訓練集查準率: ', precision_score(y_train, y_train_pred)
print u'訓練集召回率: ', recall_score(y_train, y_train_pred)
print u'訓練集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))
y_test_pred = rf.predict(x_test)
y_test_predprob = rf.predict_proba(x_test)[:,1]
print u'測試集準確率: ', accuracy_score(y_test, y_test_pred)
print u'測試集查準率: ', precision_score(y_test, y_test_pred)
print u'測試集召回率: ', recall_score(y_test, y_test_pred)
print u'測試集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
預設引數下,袋外分數0.77,但是訓練集和交叉驗證集的F1值差距比較大,模型泛化能力不強,所以通過網格搜尋進行調參工作。
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=True, random_state=0,
verbose=0, warm_start=False)
訓練集袋外分數: 0.775510204082
訓練集AUC Score: 0.999003
訓練集準確率: 0.98693877551
訓練集查準率: 0.988970588235
訓練集召回率: 0.984947111473
訓練集F1值: 0.986954749287
precision recall f1-score support
0 0.98 0.99 0.99 2442
1 0.99 0.98 0.99 2458
avg / total 0.99 0.99 0.99 4900
[[2415 27]
[ 37 2421]]
測試集準確率: 0.805714285714
測試集查準率: 0.810176125245
測試集召回率: 0.79462571977
測試集F1值: 0.802325581395
precision recall f1-score support
0 0.80 0.82 0.81 1058
1 0.81 0.79 0.80 1042
avg / total 0.81 0.81 0.81 2100
[[864 194]
[214 828]]
對弱學習器的最大迭代次數n_estimators進行網格搜尋
在10到100中搜索,步長為10,度量測度為roc_auc
,5折交叉驗證,oob_score
:即是否採用袋外樣本來評估模型的好壞,得到了最佳的弱學習器迭代次數{‘n_estimators’: 90} 0.907140520132
from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators': range(10,101,10)}
gsearch1 = GridSearchCV(estimator=RandomForestClassifier(max_depth=8, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(x_train, y_train)
print gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜尋
得到引數{'min_samples_split': 45, 'max_depth': 8} 0.90777502455
,看現在模型的袋外分數,有一定提高。從0.77到0.828979591837。
param_test2 = {'max_depth':range(3,14,1), 'min_samples_split':range(5,51,5)}
gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_estimators=90, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(x_train, y_train)
print gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_
rf1 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=45,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(x_train, y_train)
print rf1.oob_score_
對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一起調參
對於內部節點再劃分所需最小樣本數min_samples_split
,我們暫時不能一起定下來,因為這個還和決策樹其他的引數存在關聯。下面我們再對內部節點再劃分所需最小樣本數min_samples_split
和葉子節點最少樣本數min_samples_leaf
一起調參。
得到引數{'min_samples_leaf': 10, 'min_samples_split': 70}, 0.90753607636810929)
param_test3 = {'min_samples_split':range(30,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8,max_features='sqrt' ,oob_score=True, random_state=10),param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_
對最大特徵數max_features做調參
得到引數{'max_features': 5}, 0.90721677436061976)
param_test4 = {'max_features':range(3,20,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8, min_samples_split=70,min_samples_leaf=10,oob_score=True,random_state=10),param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(x_train, y_train)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_
模型交叉驗證效果
訓練集的袋外分數為0.82,提升了5個百分點,且訓練集和交叉驗證集的F1值接近,都能到0.8以上,模型泛化能力提升。
rf2 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=70,min_samples_leaf=10,max_features=5 ,oob_score=True,random_state=10)
rf2.fit(x_train, y_train)
print rf2
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf2.predict(x_train)
y_train_predprob = rf2.predict_proba(x_train)[:,1]
print u'訓練集袋外分數: ', rf2.oob_score_
print "訓練集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'訓練集準確率: ', accuracy_score(y_train, y_train_pred)
print u'訓練集查準率: ', precision_score(y_train, y_train_pred)
print u'訓練集召回率: ', recall_score(y_train, y_train_pred)
print u'訓練集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))
y_test_pred = rf2.predict(x_test)
y_test_predprob = rf2.predict_proba(x_test)[:,1]
print u'測試集準確率: ', accuracy_score(y_test, y_test_pred)
print u'測試集查準率: ', precision_score(y_test, y_test_pred)
print u'測試集召回率: ', recall_score(y_test, y_test_pred)
print u'測試集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=8, max_features=5, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=10,
min_samples_split=70, min_weight_fraction_leaf=0.0,
n_estimators=90, n_jobs=1, oob_score=True, random_state=10,
verbose=0, warm_start=False)
訓練集袋外分數: 0.82612244898
訓練集AUC Score: 0.922081
訓練集準確率: 0.838367346939
訓練集查準率: 0.817938931298
訓練集召回率: 0.871847030106
訓練集F1值: 0.844033083891
precision recall f1-score support
0 0.86 0.80 0.83 2442
1 0.82 0.87 0.84 2458
avg / total 0.84 0.84 0.84 4900
[[1965 477]
[ 315 2143]]
測試集準確率: 0.824761904762
測試集查準率: 0.806363636364
測試集召回率: 0.851247600768
測試集F1值: 0.828197945845
precision recall f1-score support
0 0.84 0.80 0.82 1058
1 0.81 0.85 0.83 1042
avg / total 0.83 0.82 0.82 2100
[[845 213]
[155 887]]
微信公眾號「資料分析」,分享資料科學家的自我修養,既然遇見,不如一起成長。