1. 程式人生 > >學習筆記(七)模型的調參之網格搜尋和交叉驗證的簡單應用

學習筆記(七)模型的調參之網格搜尋和交叉驗證的簡單應用

學習筆記(七)模型的調參之網格搜尋和交叉驗證的簡單應用

資料是金融資料,我們要做的是預測貸款使用者是否會逾期,表格中,status是標籤:0表示未逾期,1表示逾期。
Misson1 - 構建邏輯迴歸模型進行預測
Misson2 - 構建SVM和決策樹模型進行預測
Misson3 - 構建xgboost和lightgbm模型進行預測
Mission4 - 記錄五個模型關於precision,rescore,f1,auc,roc的評分表格,畫出auc和roc曲線圖
Mission5 - 關於資料型別轉換以及缺失值處理(嘗試不同的填充看效果)以及你能借鑑的資料探索
Mission6 - 使用網格搜尋和交叉驗證進行簡單調參。

資料概述

前期的資料處理見:上一篇文章
唯一不同的是隻需要把資料分成標籤和訓練資料兩個部分


datafinal = pd.concat([datanew,date_temp], axis=1)
data_train = datafinal['status']
datafinal.drop(["status"],axis=1,inplace=True)

X_train = datafinal 
y_train = data_train
standardScaler = StandardScaler()
X_train_fit = standardScaler.fit_transform(
X_train)

交叉驗證

交叉驗證的基本思想是把在某種意義下將原始資料(dataset)進行分組,一部分做為訓練集(train set),另一部分做為驗證集(validation set or test set),首先用訓練集對分類器進行訓練,再利用驗證集來測試訓練得到的模型(model),以此來做為評價分類器的效能指標。

1. Cross——Validation 交叉驗證

交叉驗證的優點: 原始採用的train_test_split方法,資料劃分具有偶然性;交叉驗證通過多次劃分,大大降低了這種由一次隨機劃分帶來的偶然性,同時通過多次劃分,多次訓練,模型也能遇到各種各樣的資料,從而提高其泛化能力;
首先將模型初始化

log_reg = LogisticRegression()
lsvc = SVC()
dtc = DecisionTreeClassifier()
xgbc_model = XGBClassifier()
lgbm_model = LGBMClassifier()
models = [log_reg,lsvc,dtc,xgbc_model,lgbm_model]

使用交叉驗證

for model in models:
    score = cross_val_score(model, datafinal,data_train,cv=5)
    print("\n{}分數{}:".format(model,score))
    print("\n{}平均分數{}:".format(model,score.mean()))

2. k折交叉驗證(kfold)

K折交叉驗證:sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:將訓練/測試資料集劃分n_splits個互斥子集,每次用其中一個子集當作驗證集,剩下的n_splits-1個作為訓練集,進行n_splits次訓練和測試,得到n_splits個結果。

for model in models:
    kf = KFold(n_splits=5,shuffle=False)
    score = cross_val_score(model, datafinal,data_train,cv=kf)
    
    print("\n{}分數{}:".format(model,score))
    print("\n{}平均分數{}:".format(model,score.mean()))

3.留一法Leave-one-out Cross-validation

留一法Leave-one-out Cross-validation:是一種特殊的交叉驗證方式。顧名思義,如果樣本容量為n,則k=n,進行n折交叉驗證,每次留下一個樣本進行驗證。主要針對小樣本資料。

for model in models:
    loout = LeaveOneOut()
    score = cross_val_score(model, datafinal,data_train,cv=loout)
    print("\n{}分數{}:".format(model,score))
    print("\n{}平均分數{}:".format(model,score.mean()))

4.Shuffle-split cross-validation

控制更加靈活:可以控制劃分迭代次數、每次劃分時測試集和訓練集的比例(也就是說:可以存在既不在訓練集也不再測試集的情況);

for model in models:
    shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=5)
    score = cross_val_score(model, datafinal,data_train,cv=shufspl)
    print("\n{}分數{}:".format(model,score))
    print("\n{}平均分數{}:".format(model,score.mean()))

網格搜尋

窮舉搜尋:在所有候選的引數選擇中,通過迴圈遍歷,嘗試每一種可能性,表現最好的引數就是最終的結果。其原理就像是在數組裡找最大值。(為什麼叫網格搜尋?以有兩個引數的模型為例,引數a有3種可能,引數b有4種可能,把所有可能性列出來,可以表示成一個3*4的表格,其中每個cell就是一個網格,迴圈過程就像是在每個網格里遍歷、搜尋。這裡我們使用的是Grid Search with Cross Validation.
未調參前結果參考:https://blog.csdn.net/zhangyunpeng0922/article/details/84257426

1.邏輯迴歸的網格搜尋

param_grid = [
    {
        'C':  [0.0001,0.001,0.01,0.1,1,10],
        'penalty': ['l2'],
        'tol': [1e-4,1e-5,1e-6]
    },
    {
        'C':  [0.0001,0.001,0.01,0.1,1,10],
        'penalty': ['l1'],
        'tol': [1e-4,1e-5,1e-6]
    }
]
score = make_scorer(accuracy_score)
decisions = {}
print("\n{}使用邏輯迴歸預測{}".format("*"*20, "*"*20))
kf = KFold(n_splits=5,shuffle=False)
grid_search = GridSearchCV(log_reg,param_grid,score,cv=kf)
grid_search = grid_search.fit(X_train, y_train)

得出的結果是:
grid_search.best_score_ :0.8019236549443943
grid_search.best_params_:{‘C’: 0.1, ‘penalty’: ‘l1’, ‘tol’: 1e-05}

最終模型評估:
predict: 0.6811492641906096
roc_auc_score準確率: 0.7046975097838823
precision_score準確率: 0.44631901840490795
recall_score準確率: 0.7558441558441559
f1_score: 0.6554107807490491

(不知為何和上次比結果差了一點,可能引數沒有調好)

2. 決策樹的網格搜尋

"""
2 使用決策樹預測
"""
param_grid_tree = [
    {
        'max_depth': [m for m in range(5,10)],
        'class_weight': ['balanced',None]
    }
]
decisions = {}
print("\n{}使用決策樹預測{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_tree = GridSearchCV(dtc,param_grid_tree,score,cv=kf)
grid_search_tree.fit(X_train_fit, y_train)

得出的結果:
grid_search_tree.best_score_:0.7664562669071235
grid_search_tree.best_params_:{‘class_weight’: None, ‘max_depth’: 5}

模型評估的結果:
DecisionTreeClassifier準確率: 0.7561317449194114
roc_auc_score準確率: 0.580806142034549
precision_score準確率: 0.6581196581196581
recall_score準確率: 0.2
f1_score準確率: 0.30677290836653387

(有一些提升)

3. svm的網格搜尋

"""
2 使用SVC預測
"""
param_grid_svc = [
    {
        'C':  [4.5,5,5.5,6],
        'gamma': [0.0009,0.001,0.0011,0.002],
        'class_weight': ['balanced',None] 
    }
]
print("\n{}使用SVC預測{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_svc = GridSearchCV(lsvc,param_grid_svc,score,cv=kf)
grid_search_svc.fit(X_train_fit, y_train)

得出的結果:
grid_search_svc.best_score_ :0.8001202284340246
grid_search_svc.best_params_:{‘C’: 4.5, ‘class_weight’: None, ‘gamma’: 0.001}

模型評估的結果:
linear_svc準確率: 0.7757533286615277
roc_auc_score準確率: 0.60652466535384
precision_score準確率: 0.773109243697479
recall_score準確率: 0.23896103896103896
f1_score準確率: 0.3650793650793651

4.XGboost的網格搜尋

"""
2 使用xgboost預測
   
"""
param_grid_xgb = [
    {
        "max_depth": [10,30,50],
        "min_child_weight" : [1,3,6],
        "n_estimators": [200],
        "learning_rate": [0.05, 0.1,0.16]
         }
]
decisions = {}
print("\n{}使用xgboost預測{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_xgb = GridSearchCV(xgbc_model,param_grid_xgb,score,cv=kf)
# sgd = SGDClassifier()
grid_search_xgb.fit(X_train_fit, y_train)

得出的結果:
grid_search_xgb.best_score_ :0.7923053802224226
grid_search_xgb.best_params_:{‘learning_rate’: 0.05, ‘max_depth’: 30, ‘min_child_weight’: 6, ‘n_estimators’: 200}

模型評估的結果:
xgbc_model準確率: 0.7785564120532585
roc_auc_score準確率: 0.6420170999825511
precision_score準確率: 0.6751269035532995
recall_score準確率: 0.34545454545454546
f1_score準確率: 0.45704467353951883

(感覺引數也沒設定好,結果並沒有什麼提升)

5.lightGBM的網格搜尋

"""
2 使用lightgmb預測
"""
param_grid_lgb = [
    {        
        "max_depth": [5,10, 15],
        "learning_rate" : [0.01,0.05,0.1],
        "num_leaves": [30,90,120],
        "n_estimators": [20]        
    }
]
decisions = {}
print("\n{}使用lightgmb預測{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_lgb = GridSearchCV(lgbm_model,param_grid_lgb,score,cv=kf)
# sgd = SGDClassifier()
grid_search_lgb.fit(X_train_fit, y_train)

得出的結果:
grid_search_lgb.best_score_ :0.7953110910730388
grid_search_lgb.best_params_:{‘learning_rate’: 0.1, ‘max_depth’: 15, ‘n_estimators’: 20, ‘num_leaves’: 30}

模型評估的結果:
lgbm_model準確率: 0.7813594954449895
roc_auc_score準確率: 0.6283782436373607
precision_score準確率: 0.7354838709677419
recall_score準確率: 0.2961038961038961
f1_score準確率: 0.4222222222222222

問題

  1. 各個模型的引數區間並不是很好找
  2. 很多次調參之後比之前未調參的結果還要差一些
  3. 引數區間設定不好,又浪費時間,效果又不好
  4. 這是個技術活,還是個耐心活

參考

交叉驗證:
https://blog.csdn.net/sinat_32547403/article/details/73008127
https://blog.csdn.net/kancy110/article/details/74910185/
http://www.itdaan.com/keywords/sklearn中的超引數調節.html

網格搜尋:
https://blog.csdn.net/QFire/article/details/77601901
https://blog.csdn.net/qq_30490125/article/details/80387414
https://blog.csdn.net/owenfy/article/details/79631144