1. 程式人生 > >機器學習-CrossValidation交叉驗證

機器學習-CrossValidation交叉驗證

概念

“交叉驗證法”(cross validation)是一種很好並準確的用於評估模型的方法。它先將資料集D劃分為k個大小相似的互斥子集,即D=D1D2...Dk,DiDj=ij。每個子集Di都儘可能保持資料分佈的一致性,即,從D中通過分層取樣得到。然後,每次用k1個子集的並集作為訓練集,餘下的那個子集作為測試集,這樣,就可以獲得k組訓練/測試集,從而可進行k次訓練和測試,最終返回的是這k個測試結果的均值。交叉驗證通常稱為“k折交叉驗證”,k一般取10。

  • 優點:K-CV可以有效的避免過學習以及欠學習狀態的發生,最後得到的結果也比較具有說服性.
  • 缺點:K值的選取很重要

python實現

from sklearn import cross_validation
from sklearn.model_selection import train_test_split
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier

 # 訓練/測試資料分割
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.3, random_state=42)

# 定義GBDT模型
gbdt = GradientBoostingClassifier(init=None
, learning_rate=0.05, loss='deviance', max_depth=5, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, random_state=None, subsample=1.0, verbose=0, warm_start=False
) # 訓練學習 gbdt.fit(X_train, y_train) importances = gbdt.feature_importances_ # 預測及AUC評測 y_pred_gbdt = gbdt.predict_proba(X_test.toarray())[:, 1] gbdt_auc = roc_auc_score(y_test, y_pred_gbdt) print('The AUC of GBDT: %.5f' % gbdt_auc) # cross_validation print('--------------------------------cross_validation----------------------------') score = cross_validation.cross_val_score(gbdt_auc, X_all, y_all, cv=5, scoring='roc_auc') sum = 0 for sc in score: sum += sc print('GBDT 平均AUC:') print(sum / score.shape) print('交叉驗證各維AUC:') print(score) 這裡以gbdt模型為例 X_all:訓練集 y_all:標籤 cv: 交叉驗證的次數 scoring: 評估指標,可以自定義,也有很多預設選項 例如‘accuracy’, 就是返回準確率 [‘accuracy‘, ‘adjusted_rand_score‘, ‘average_precision‘, ‘f1‘, ‘f1_macro‘, ‘f1_micro‘, ‘f1_samples‘, ‘f1_weighted‘, ‘log_loss‘, ‘mean_absolute_error‘, ‘mean_squared_error‘, ‘median_absolute_error‘, ‘precision‘, ‘precision_macro‘, ‘precision_micro‘, ‘precision_samples‘, ‘precision_weighted‘, ‘r2‘, ‘recall‘, ‘recall_macro‘, ‘recall_micro‘, ‘recall_samples‘, ‘recall_weighted‘, ‘roc_auc‘]