1. 程式人生 > >Python機器學習包的sklearn中的Gridsearch簡單使用

Python機器學習包的sklearn中的Gridsearch簡單使用

cross-validation(交叉驗證)

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k”folds”:

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

上面這段話是引自sklearn的document中,對於cv的描述.描述了一個在交叉驗證中的相同的規則就是,在解決實際問題中,我們可以將所有的資料集dataset,劃分為train_set(例如70%)和test_set(30%),然後在train_set上做cross_validation,最後取平均之後,再使用test_set測試模型的準確度.不是直接在dataset上直接做cross-validation(這個是我理解cv中的一個誤區)

k-fold

本來不想寫關於cross-validation的內容的,但是決定這裡面自己的誤區還是很多的,所以寫一下,如果有人看到了,也可以幫忙指出來

.

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop

前提:整個資料集被分成了訓練集D(70%)和測試集T(30%).

上面這段話就是k-fold的全過程,(此時只涉及到訓練集D)
1.將整個訓練集D分為k個等大的集合,然後選出k-1個作為模型的訓練集.訓練模型model1.
2.使用剩下的一個集合Di,作為驗證集(和所謂的測試集的作用是一樣的),測試model1的準確性.關於模型評估方法,可以參考sklearn實現的一些方法.
3.迴圈執行上述過程k次,保證沒有重複.然後對於準確性求平均值,這就是該分類方法對應的正確性.
有人可能會問平均出來的正確性對應的模型權值θ是哪一個?這個問題就需要明白機器學習的目的是什麼?機器學習不是找到所謂模型對應的權值是多少,而是相對於實際問題,選出合適的模型(比如向量機模型)和合適的超參(比如核函式,c等超參).上述的平均正確率就是對應於模型+超參的.

GridSearch

搞懂了K-fold,就可以聊一聊GridSearch啦,因為GridSearch預設引數就是3-fold的,如果沒有不懂cross-validation就很難理解這個.

想幹什麼

Gridsearch是為了解決調參的問題.比如向量機SVM的常用引數有kernel,gamma,C等,手動調的話太慢了,寫迴圈也只能順序執行,不能並行.於是就出現了Gridsearch.通過它,可以直接找出最優的引數.

怎麼調參

param字典型別,它會將每個字典型別裡的欄位所有的組合都輸入到分類器中執行.

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

如何評估

引數輸入之後,需要評估每組引數對應的模型的預測能力.Gridsearch就在資料集上做k-fold,然後求出每組引數對應模型的平均精確度.選出最優的引數.返回.
一般Gridsearch只在訓練集上做k-fold並不會使用測試集.而是將測試集留在最後,當gridsearch選出最佳模型的時候,在使用測試集測試模型的泛化能力.

貼一個sklearn上面的例子

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# 將資料集分成訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# 設定gridsearch的引數
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

#設定模型評估的方法.如果不清楚,可以參考上面的k-fold章節裡面的超連結
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    #構造這個GridSearch的分類器,5-fold
    clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    #只在訓練集上面做k-fold,然後返回最優的模型引數
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    #輸出最優的模型引數
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    #在測試集上測試最優的模型的泛化能力.
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

上面這個例子就符合一般的套路.例子中的SVC是支援多分類的,其預設使用的是ovo的方式,如果需要改變,可以將引數設定為decision_function_shape=’ovr’,具體的可以參看SVC的API文件.

需要注意的幾個點

1.GridSearch支不支援多分類?
GridSearch只是在將引數組合好了,然後將資料使用k-fold的方式輸入到模型中,然後評估模型的準確性.其本身並不是新的分類方法,所以只要你選擇的estimator可以應用於多分類,就可以.上面的例子手寫體的識別就是一個多分類的問題.你選擇的模型評估方法也需要滿足多分類問題.當你使用roc_auc的時候評估模型的時候就需要注意資料格式.

2.GridSearch的estimator有的時候會出現巢狀,比如adaboost()整合學習中,就需要Gridsearch支援巢狀引數.雙下劃線__就表示該引數是巢狀引數,內層的引數.(這一點我沒有試驗過,只是看到有人這樣說…)當然gridsearch也有專門針對整合學習的API.
巢狀引數這篇部落格有個例子:
———2017.4.18