1. 程式人生 > >優達機器學習:交叉驗證


練習:在 Sklearn 中訓練/測試分離


The api of train_test_split changed and moved from sklearn.cross_validation to
sklearn.model_selection(version update from 0.17 to 0.18)

The correct documentation for this quiz is here: 
from sklearn import datasets from sklearn.svm import SVC iris = datasets.load_iris() features = iris.data labels = iris.target ############################################################### ### YOUR CODE HERE ############################################################### ### import the relevant code and
make your train/test split ### name the output datasets features_train, features_test, ### labels_train, and labels_test # PLEASE NOTE: The import here changes depending on your version of sklearn from sklearn import cross_validation # for version 0.17 # For version 0.18 # from sklearn.model_selection import train_test_split
### set the random_state to 0 and the test_size to 0.4 so ### we can exactly check your result features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) ############################################################### # DONT CHANGE ANYTHING HERE clf = SVC(kernel="linear", C=1.) clf.fit(features_train, labels_train) print clf.score(features_test, labels_test) ############################################################## def submitAcc(): return clf.score(features_test, labels_test)


  • 可能會出現分類都一樣的問題
  • GridSearchCV 就是通過交叉驗證來確定引數的


labels, features = targetFeatureSplit(data)
labels = [1,0,1,1]

練習:第一個(過擬合)POI 識別符


  • validate_poi.py

    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!

import pickle
import sys
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  
from sklearn import tree
clf = tree.DecisionTreeClassifier()
print clf.score(features,labels)



  • validate_poi.py

    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!

import pickle
import sys
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels,test_size=0.3,random_state=42)

from sklearn import tree
clf = tree.DecisionTreeClassifier()

result = clf.predict(features_test)

from sklearn.metrics import accuracy_score

print accuracy_score(labels_test,result)

#print clf.score(features_test,labels_test)