1. 程式人生 > >金融貸款逾期的模型構建4——模型調優

金融貸款逾期的模型構建4——模型調優

文章目錄

一、任務

使用網格搜尋法對7個模型進行調優(調參時採用五折交叉驗證的方式),並進行模型評估,展示程式碼的執行結果。

二、概述

機器學習模型基本都會涉及調參不同的引數組合會產生不同的效果 :

  • 如果模型資料量不是很大(執行時間不是很長)——GridSearchCV來自動選擇輸入引數中的最優組合。
  • 若很大資料量,模型執行特別費計算資源和時間——GridSearchCV可能會成本太高,需要對模型瞭解深入一點或者積累更多的實戰經驗,最後進行手動調參。

1、引數說明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

(1)estimator
所使用的分類器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 並且傳入除需要確定最佳的引數之外的其他引數。每一個分類器都需要一個scoring引數,或者score方法。
(2)param_grid
param_grid 值為字典或者列表,即需要最優化的引數的取值,param_grid =param_test1,param_test1 = {‘n_estimators’:range(10,71,10)}。
(3)scoring
準確度評價標準,預設None,這時需要使用score函式;或者如scoring=‘roc_auc’,根據所選模型不同,評價準則不同。字串(函式名),或是可呼叫物件,需要其函式簽名形如:scorer(estimator, X, y);如果是None,則使用estimator的誤差估計函式。scoring引數選擇如下:
傳送門:http://scikit-learn.org/stable/modules/model_evaluation.html
(4)cv
交叉驗證引數,預設None,使用三折交叉驗證。指定fold數量,預設為3,也可以是yield訓練/測試資料的生成器。
(5)refit
預設為True,程式將會以交叉驗證訓練集得到的最佳引數,重新對所有可用的訓練集與開發集進行,作為最終用於效能評估的最佳模型引數。即在搜尋引數結束後,用最佳引數結果再次fit一遍全部資料集。
(6)iid
預設True,為True時,預設為各個樣本fold概率分佈一致,誤差估計為所有樣本之和,而非各個fold的平均。
(7)verbose
日誌冗長度,int:冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出。
(8)n_jobs
並行數,int:個數,-1:跟CPU核數一致, 1:預設值。
(9)pre_dispatch
指定總共分發的並行任務數。當n_jobs大於1時,資料將在每個執行點進行復制,這可能導致OOM,而設定pre_dispatch引數,則可以預先劃分總共的job數量,使資料最多被複制pre_dispatch次

2、常用方法

grid.fit():執行網格搜尋;
grid_scores_:給出不同引數情況下的評價結果;
best_params_:描述了已取得最佳結果的引數的組合;
best_score_:成員提供優化過程期間觀察到的最好的評分。

二、實現

1、模組引入

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型評估函式

## 模型評估
def model_metrics(clf, y_target, y_predict):
    accuracy = accuracy_score(y_target, y_predict)
    print('The accuracy is ', accuracy)
    precision = precision_score(y_target, y_predict)
    print('The precision is ', precision)
    recall = recall_score(y_target, y_predict)
    print('The recall is ', recall)

3、資料讀取

## 讀取資料
    data = pd.read_csv("data_all.csv")
    x = data.drop(labels='status', axis=1)
    y = data['status']
    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)

    ## 資料標準化
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train_stand = scaler.transform(x_train)
    x_test_stand = scaler.transform(x_test)

4、Logistic Regression

(1)調參部分

lr = LogisticRegression()
# 要調引數
param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數:',grid.best_params_)
print('訓練集的最佳分數:', grid.best_score_)
print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

==》最佳引數: {‘C’: 0.1, ‘penalty’: ‘l1’}

(2)模型評估

lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(x_train_stand, y_train)
y_pre_lr = lr.predict(x_test_stand)
model_metrics(lr, y_test, y_pre_lr)

結果輸出

The accuracy is  0.7890679747722494
The precision is  0.6746987951807228
The recall is  0.31197771587743733

5、SVM

(1)調參部分

svm = SVC(random_state=2018, probability=True)
param = {'C':[0.01, 0.1, 1]}
grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數:',grid.best_params_)
print('訓練集的最佳分數:', grid.best_score_)
print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

==》最佳引數: {‘C’: 0.1}

(2)模型評估

svm = SVC(C = 0.1, random_state=2018, probability=True)
svm.fit(x_train_stand, y_train)
y_pre_svm = svm.predict(x_test_stand)
model_metrics(svm, y_test, y_pre_svm)

結果輸出

The accuracy is  0.7575332866152769
The precision is  0.8823529411764706
The recall is  0.04178272980501393

6、Decision Tree

(1)調參部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018)
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
# 最佳引數: {'max_depth': 9, 'min_samples_split': 300}
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
# 最佳引數: {'min_samples_leaf': 90, 'min_samples_split': 50}
param = {'max_features':range(7,20,2)}  	
# 最佳引數: {'max_features': 9}
grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5)
grid.fit(x_train_stand, y_train)
print('最佳引數:',grid.best_params_)
print('訓練集的最佳分數:', grid.best_score_)
print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

(2)模型評估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018)
dt.fit(x_train_stand, y_train)
y_pre_dt = dt.predict(x_test_stand)
model_metrics(dt, y_test, y_pre_dt)

結果輸出

The accuracy is  0.7561317449194114
The precision is  0.5578947368421052
The recall is  0.14763231197771587

7、Random Forest

## Random Forest
# param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']}
# 最佳引數: {'max_features': 'sqrt', 'n_estimators': 171}
rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018)
rf.fit(x_train_stand, y_train)
y_pre_rf = rf.predict(x_test_stand)
model_metrics(rf, y_test, y_pre_rf)

輸出結果

The accuracy is  0.7848633496846531
The precision is  0.6857142857142857
The recall is  0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018)
# param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
# grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5)
# grid.fit(x_train_stand, y_train)
# print('最佳引數:',grid.best_params_)
# print('訓練集的最佳分數:', grid.best_score_)
# print('測試集的最佳分數:', grid.score(x_test_stand, y_test))
# 最佳引數: {'learning_rate': 0.1, 'n_estimators': 41}
gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018)
gbdt.fit(x_train_stand, y_train)
y_pre_gbdt = gbdt.predict(x_test_stand)
model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 調參部分
param = {'n_estimators':range(20,200,20)}
# param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)}
# param = {'gamma': [i / 10 for i in range(1, 6)]}
# param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]}
# param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]}
# param = {'n_estimators': range(20, 200, 20)}
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018)
grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數:', grid.best_params_)
print('訓練集的最佳分數:', grid.best_score_)
print('測試集的最佳分數:', grid.score(x_test_stand, y_test))
# # 最佳引數: {'n_estimators': 40}
# 訓練集的最佳分數: 0.8028110571725202
# 測試集的最佳分數: 0.7770857458817146

## 模型評估
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,
                        gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',
                        nthread=4, scale_pos_weight=1, seed=2018)
xgboost.fit(x_train_stand, y_train)
y_pre_xgb = xgboost.predict(x_test_stand)
model_metrics(xgboost, y_test, y_pre_xgb)

輸出結果

The accuracy is  0.7876664330763841
The precision is  0.6521739130434783
The recall is  0.3342618384401114

10、LightGBM

## 調參部分
gbm = lgb.LGBMClassifier(seed = 2018)
param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),
              'n_estimators': range(30, 50, 5)}
grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數:', grid.best_params_)
print('訓練集的最佳分數:', grid.best_score_)
print('測試集的最佳分數:', grid.score(x_test_stand, y_test))
# 最佳引數: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40}
# 訓練集的最佳分數: 0.8007228827289531
# 測試集的最佳分數: 0.7729296422647178

## 模型評估
gbm = lgb.LGBMClassifier(learning_rate =  0.1, max_depth = 3, n_estimators = 40, seed=2018)
gbm.fit(x_train_stand, y_train)
y_pre_gbm = gbm.predict(x_test_stand)
model_metrics(gbm, y_test, y_pre_gbm)

輸出結果

The accuracy is  0.7932725998598459
The precision is  0.6839080459770115
The recall is  0.33147632311977715

三、遇到的問題

1、UnboundLocalError: local variable ‘xxx’ referenced before assignment

錯誤
UnboundLocalError: local variable ‘xxx’ referenced before assignment

在函式外部已經定義了變數n,在函式內部對該變數進行運算,執行時會遇到了這樣的錯誤:

主要是因為沒有讓直譯器清楚變數是全域性變數還是區域性變數。

解決方案:修改變數的命名,使之不發生衝突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

錯誤
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解決方案:新增if __name__=='__main__':即可

3、recall

為什麼召回率普遍偏低?