1. 程式人生 > >優達機器學習:交叉驗證

優達機器學習:交叉驗證

練習:在 Sklearn 中訓練/測試分離

#!/usr/bin/python

""" 
PLEASE NOTE:
The api of train_test_split changed and moved from sklearn.cross_validation to
sklearn.model_selection(version update from 0.17 to 0.18)

The correct documentation for this quiz is here: 
http://scikit-learn.org/0.17/modules/cross_validation.html
"""
from sklearn import datasets from sklearn.svm import SVC iris = datasets.load_iris() features = iris.data labels = iris.target ############################################################### ### YOUR CODE HERE ############################################################### ### import the relevant code and
make your train/test split ### name the output datasets features_train, features_test, ### labels_train, and labels_test # PLEASE NOTE: The import here changes depending on your version of sklearn from sklearn import cross_validation # for version 0.17 # For version 0.18 # from sklearn.model_selection import train_test_split
### set the random_state to 0 and the test_size to 0.4 so ### we can exactly check your result features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) ############################################################### # DONT CHANGE ANYTHING HERE clf = SVC(kernel="linear", C=1.) clf.fit(features_train, labels_train) print clf.score(features_test, labels_test) ############################################################## def submitAcc(): return clf.score(features_test, labels_test)

K折交叉驗證

  • 可能會出現分類都一樣的問題
  • GridSearchCV 就是通過交叉驗證來確定引數的

注意:優達自己寫的函式targetFeatureSplit含義

labels, features = targetFeatureSplit(data)
data是二維陣列,例如
[
    [1,12.1],
    [0,14.1],
    [1,13.1],
    [1,15.2]
]
預設函式的第一個返回引數為第一列,也就是作為標籤使用,返回值如下
labels = [1,0,1,1]
第二個返回引數為第二列,作為訓練特徵使用,返回值如下
features=
[
    [12,1],
    [14.1],
    [13.1],
    [15.2]
]

練習:第一個(過擬合)POI 識別符

答案:0.989473684211

  • validate_poi.py
#!/usr/bin/python


"""
    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features,labels)
print clf.score(features,labels)

練習:部署訓練/測試機制

答案:0.724137931034

  • validate_poi.py
#!/usr/bin/python


"""
    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels,test_size=0.3,random_state=42)

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)

result = clf.predict(features_test)

from sklearn.metrics import accuracy_score

print accuracy_score(labels_test,result)

#print clf.score(features_test,labels_test)