class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

（1）estimator
所使用的分類器，如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 並且傳入除需要確定最佳的引數之外的其他引數。每一個分類器都需要一個scoring引數，或者score方法。
（2）param_grid
param_grid 值為字典或者列表，即需要最優化的引數的取值，param_grid =param_test1，param_test1 = {‘n_estimators’:range(10,71,10)}。
（3）scoring
準確度評價標準，預設None,這時需要使用score函式；或者如scoring=‘roc_auc’，根據所選模型不同，評價準則不同。字串（函式名），或是可呼叫物件，需要其函式簽名形如：scorer(estimator, X, y)；如果是None，則使用estimator的誤差估計函式。scoring引數選擇如下：
傳送門：http://scikit-learn.org/stable/modules/model_evaluation.html
（4）cv
交叉驗證引數，預設None，使用三折交叉驗證。指定fold數量，預設為3，也可以是yield訓練/測試資料的生成器。
（5）refit
預設為True,程式將會以交叉驗證訓練集得到的最佳引數，重新對所有可用的訓練集與開發集進行，作為最終用於效能評估的最佳模型引數。即在搜尋引數結束後，用最佳引數結果再次fit一遍全部資料集。
（6）iid
預設True,為True時，預設為各個樣本fold概率分佈一致，誤差估計為所有樣本之和，而非各個fold的平均。
（7）verbose
日誌冗長度，int：冗長度，0：不輸出訓練過程，1：偶爾輸出，>1：對每個子模型都輸出。
（8）n_jobs
並行數，int：個數,-1：跟CPU核數一致, 1:預設值。
（9）pre_dispatch
指定總共分發的並行任務數。當n_jobs大於1時，資料將在每個執行點進行復制，這可能導致OOM，而設定pre_dispatch引數，則可以預先劃分總共的job數量，使資料最多被複制pre_dispatch次

2、常用方法

grid.fit()：執行網格搜尋；
grid_scores_：給出不同引數情況下的評價結果；
best_params_：描述了已取得最佳結果的引數的組合；
best_score_：成員提供優化過程期間觀察到的最好的評分。

二、實現

1、模組引入

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型評估函式

## 模型評估
def model_metrics(clf, y_target, y_predict):
    accuracy = accuracy_score(y_target, y_predict)
    print('The accuracy is ', accuracy)
    precision = precision_score(y_target, y_predict)
    print('The precision is ', precision)
    recall = recall_score(y_target, y_predict)
    print('The recall is ', recall)

3、資料讀取

## 讀取資料
    data = pd.read_csv("data_all.csv")
    x = data.drop(labels='status', axis=1)
    y = data['status']
    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)

    ## 資料標準化
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train_stand = scaler.transform(x_train)
    x_test_stand = scaler.transform(x_test)

4、Logistic Regression

（1）調參部分

lr = LogisticRegression()
# 要調引數
param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數：',grid.best_params_)
print('訓練集的最佳分數：', grid.best_score_)
print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

==》最佳引數： {‘C’: 0.1, ‘penalty’: ‘l1’}

（2）模型評估

lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(x_train_stand, y_train)
y_pre_lr = lr.predict(x_test_stand)
model_metrics(lr, y_test, y_pre_lr)

結果輸出

The accuracy is  0.7890679747722494
The precision is  0.6746987951807228
The recall is  0.31197771587743733

5、SVM

（1）調參部分

svm = SVC(random_state=2018, probability=True)
param = {'C':[0.01, 0.1, 1]}
grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數：',grid.best_params_)
print('訓練集的最佳分數：', grid.best_score_)
print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

==》最佳引數： {‘C’: 0.1}

（2）模型評估

svm = SVC(C = 0.1, random_state=2018, probability=True)
svm.fit(x_train_stand, y_train)
y_pre_svm = svm.predict(x_test_stand)
model_metrics(svm, y_test, y_pre_svm)

結果輸出

The accuracy is  0.7575332866152769
The precision is  0.8823529411764706
The recall is  0.04178272980501393

6、Decision Tree

（1）調參部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018)
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
# 最佳引數： {'max_depth': 9, 'min_samples_split': 300}
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
# 最佳引數： {'min_samples_leaf': 90, 'min_samples_split': 50}
param = {'max_features':range(7,20,2)}  	
# 最佳引數： {'max_features': 9}
grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5)
grid.fit(x_train_stand, y_train)
print('最佳引數：',grid.best_params_)
print('訓練集的最佳分數：', grid.best_score_)
print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

（2）模型評估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018)
dt.fit(x_train_stand, y_train)
y_pre_dt = dt.predict(x_test_stand)
model_metrics(dt, y_test, y_pre_dt)

結果輸出

The accuracy is  0.7561317449194114
The precision is  0.5578947368421052
The recall is  0.14763231197771587

7、Random Forest

## Random Forest
# param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']}
# 最佳引數： {'max_features': 'sqrt', 'n_estimators': 171}
rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018)
rf.fit(x_train_stand, y_train)
y_pre_rf = rf.predict(x_test_stand)
model_metrics(rf, y_test, y_pre_rf)

輸出結果

The accuracy is  0.7848633496846531
The precision is  0.6857142857142857
The recall is  0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018)
# param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
# grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5)
# grid.fit(x_train_stand, y_train)
# print('最佳引數：',grid.best_params_)
# print('訓練集的最佳分數：', grid.best_score_)
# print('測試集的最佳分數：', grid.score(x_test_stand, y_test))
# 最佳引數： {'learning_rate': 0.1, 'n_estimators': 41}
gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018)
gbdt.fit(x_train_stand, y_train)
y_pre_gbdt = gbdt.predict(x_test_stand)
model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 調參部分
param = {'n_estimators':range(20,200,20)}
# param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)}
# param = {'gamma': [i / 10 for i in range(1, 6)]}
# param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]}
# param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]}
# param = {'n_estimators': range(20, 200, 20)}
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018)
grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數：', grid.best_params_)
print('訓練集的最佳分數：', grid.best_score_)
print('測試集的最佳分數：', grid.score(x_test_stand, y_test))
# # 最佳引數： {'n_estimators': 40}
# 訓練集的最佳分數： 0.8028110571725202
# 測試集的最佳分數： 0.7770857458817146

## 模型評估
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,
                        gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',
                        nthread=4, scale_pos_weight=1, seed=2018)
xgboost.fit(x_train_stand, y_train)
y_pre_xgb = xgboost.predict(x_test_stand)
model_metrics(xgboost, y_test, y_pre_xgb)

輸出結果

The accuracy is  0.7876664330763841
The precision is  0.6521739130434783
The recall is  0.3342618384401114

10、LightGBM

## 調參部分
gbm = lgb.LGBMClassifier(seed = 2018)
param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),
              'n_estimators': range(30, 50, 5)}
grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳引數：', grid.best_params_)
print('訓練集的最佳分數：', grid.best_score_)
print('測試集的最佳分數：', grid.score(x_test_stand, y_test))
# 最佳引數： {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40}
# 訓練集的最佳分數： 0.8007228827289531
# 測試集的最佳分數： 0.7729296422647178

## 模型評估
gbm = lgb.LGBMClassifier(learning_rate =  0.1, max_depth = 3, n_estimators = 40, seed=2018)
gbm.fit(x_train_stand, y_train)
y_pre_gbm = gbm.predict(x_test_stand)
model_metrics(gbm, y_test, y_pre_gbm)

輸出結果

The accuracy is  0.7932725998598459
The precision is  0.6839080459770115
The recall is  0.33147632311977715

三、遇到的問題

1、UnboundLocalError： local variable ‘xxx’ referenced before assignment

錯誤：
UnboundLocalError： local variable ‘xxx’ referenced before assignment

在函式外部已經定義了變數n，在函式內部對該變數進行運算，執行時會遇到了這樣的錯誤：

主要是因為沒有讓直譯器清楚變數是全域性變數還是區域性變數。

解決方案：修改變數的命名，使之不發生衝突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

錯誤：
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解決方案：新增if __name__=='__main__':即可

3、recall

為什麼召回率普遍偏低？

金融貸款逾期的模型構建4——模型調優

文章目錄

一、任務

二、概述

1、引數說明

2、常用方法

二、實現

1、模組引入

2、模型評估函式

3、資料讀取

4、Logistic Regression

（1）調參部分

（2）模型評估

5、SVM

（1）調參部分

（2）模型評估

6、Decision Tree

（1）調參部分

（2）模型評估

7、Random Forest

8、GBDT

9、XGBoost

10、LightGBM

三、遇到的問題

1、UnboundLocalError： local variable ‘xxx’ referenced before assignment

2、ImportError: [joblib] Attempting to do parallel computing without protecting

3、recall

相關推薦