1. 程式人生 > >調參必備---GridSearch網格搜索

調參必備---GridSearch網格搜索

過程 pos 評估 分享 score 問題: str select 好的

什麽是Grid Search 網格搜索?

Grid Search:一種調參手段;窮舉搜索:在所有候選的參數選擇中,通過循環遍歷,嘗試每一種可能性,表現最好的參數就是最終的結果。其原理就像是在數組裏找最大值。(為什麽叫網格搜索?以有兩個參數的模型為例,參數a有3種可能,參數b有4種可能,把所有可能性列出來,可以表示成一個3*4的表格,其中每個cell就是一個網格,循環過程就像是在每個網格裏遍歷、搜索,所以叫grid search)
技術分享圖片

Simple Grid Search:簡單的網格搜索

以2個參數的調優過程為例:

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)
print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))

####   grid search start
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)#對於每種參數可能的組合,進行一次訓練;
        svm.fit(X_train,y_train)
        score = svm.score(X_test,y_test)
        if score > best_score:#找到表現最好的參數
            best_score = score
            best_parameters = {‘gamma‘:gamma,‘C‘:C}
####   grid search end

print("Best score:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))

輸出:

Size of training set:112 size of testing set:38
Best score:0.973684
Best parameters:{‘gamma‘: 0.001, ‘C‘: 100}

存在的問題:

原始數據集劃分成訓練集和測試集以後,其中測試集除了用作調整參數,也用來測量模型的好壞;這樣做導致最終的評分結果比實際效果要好。(因為測試集在調參過程中,送到了模型裏,而我們的目的是將訓練模型應用在unseen data上);

解決方法:

對訓練集再進行一次劃分,分成訓練集和驗證集,這樣劃分的結果就是:原始數據劃分為3份,分別為:訓練集、驗證集和測試集;其中訓練集用來模型訓練,驗證集用來調整參數,而測試集用來衡量模型表現好壞。
技術分享圖片

X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0)
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=1)
print("Size of training set:{} size of validation set:{} size of teseting set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0]))

best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)
        svm.fit(X_train,y_train)
        score = svm.score(X_val,y_val)
        if score > best_score:
            best_score = score
            best_parameters = {‘gamma‘:gamma,‘C‘:C}
svm = SVC(**best_parameters) #使用最佳參數,構建新的模型
svm.fit(X_trainval,y_trainval) #使用訓練集和驗證集進行訓練,more data always results in good performance.
test_score = svm.score(X_test,y_test) # evaluation模型評估
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Best score on test set:{:.2f}".format(test_score))

輸出:

Size of training set:84 size of validation set:28 size of teseting set:38
Best score on validation set:0.96
Best parameters:{‘gamma‘: 0.001, ‘C‘: 10}
Best score on test set:0.92
然而,這種間的的grid search方法,其最終的表現好壞與初始數據的劃分結果有很大的關系,為了處理這種情況,我們采用交叉驗證的方式來減少偶然性。

Grid Search with Cross Validation

from sklearn.model_selection import cross_val_score

best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)
        scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5折交叉驗證
        score = scores.mean() #取平均數
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))

輸出:

Best score on validation set:0.97
Best parameters:{‘gamma‘: 0.01, ‘C‘: 100}
Score on testing set:0.97

交叉驗證經常與網格搜索進行結合,作為參數評價的一種方法,這種方法叫做grid search with cross validation。sklearn因此設計了一個這樣的類GridSearchCV,這個類實現了fit,predict,score等方法,被當做了一個estimator,使用fit方法,該過程中:(1)搜索到最佳參數;(2)實例化了一個最佳參數的estimator;

from sklearn.model_selection import GridSearchCV

#把要調整的參數以及其候選值 列出來;
param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
             "C":[0.001,0.01,0.1,1,10,100]}
print("Parameters:{}".format(param_grid))

grid_search = GridSearchCV(SVC(),param_grid,cv=5) #實例化一個GridSearchCV類
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)
grid_search.fit(X_train,y_train) #訓練,找到最優的參數,同時使用最優的參數實例化一個新的SVC estimator。
print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))

輸出:

Parameters:{‘gamma‘: [0.001, 0.01, 0.1, 1, 10, 100], ‘C‘: [0.001, 0.01, 0.1, 1, 10, 100]}
Test set score:0.97
Best parameters:{‘C‘: 10, ‘gamma‘: 0.1}
Best score on train set:0.98
Grid Search 調參方法存在的共性弊端就是:耗時;參數越多,候選值越多,耗費時間越長!所以,一般情況下,先定一個大範圍,然後再細化。

技術分享圖片

總而言之,言而總之

  • Grid Search:一種調優方法,在參數列表中進行窮舉搜索,對每種情況進行訓練,找到最優的參數;由此可知,這種方法的主要缺點是 比較耗時!

調參必備---GridSearch網格搜索