1. 程式人生 > >Python作業——sklearn

Python作業——sklearn

Scikit-Learn: Machine Learning in Python


學習目標:

學習python庫中的sklearn,掌握三種分類方法:樸素貝葉斯、SVM和隨機森林。通過完成assignment,對結果進行對比分析,簡要概括訓練成果。


Assignment :

In the second ML assignment you have to compare the performance of three different classification algorithms, namely Naive Bayes, SVM, and Random Forest. For this assignment you need to generate a random binary classification problem, and then train and test (using 10-fold cross validation) the three algorithms. For some algorithms inner cross validation (5-fold) for choosing the parameters is needed. Then, show the classification performace (per-fold and averaged) in the report, and briefly discussing the results.


Note:

The report has to contain also a short description of the methodology used

to obtain the results.


Steps:

1 Create a classification dataset (n samples >=1000, n features >=10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
      
GaussianNB
      SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
      
RandomForestClassifier (possible n estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
        Accuracy
        F1-score
        AUC ROC

5 Write a short report summarizing the methodology and the results


Step1:

Create a classification dataset (n samples >=1000, n features >= 10)
關於分類,使用 Iris資料集 ,這個scikit-learn已經自帶。
  #返回值:

  #X:形狀陣列[n_samples,n_features]生成的樣本
  #y:形狀陣列[n_samples] 每個樣本的類成員的整數標籤
from sklearn import datasets

from sklearn import cross_validation

iris=datasets.load_iris()
#Artificial data generators
dataset=datasets.make_classification(n_samples=1000,n_features=10,
                                     n_informative=2,n_redundant=2,n_repeated=0,n_classes=2)
print(X)
print(y)

X:


y:


Step2 :

Split the dataset using 10-fold cross validation
from sklearn import cross_validation
kf=cross_validation.KFold(len(X),n_folds=10,shuffle=True)
for train_index,test_index in kf:
    X_train,y_train=X[train_index],y[train_index]
    X_test,y_test=X[test_index],y[test_index]
X_train:

X_test:

Y_train:

Y_test:


Step3:

Train the algorithms
      GaussianNB
      SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
      
RandomForestClassifier (possible n estimators values [10, 100, 1000])

GaussianNB:

from sklearn.naive_bayes import GaussianNB
model1 = GaussianNB()
model1.fit(X_train, y_train)
predict = clf.predict(X_test)
print(predict)

predict:


SVC:

from sklearn.svm import SVC
for num in  [1e-02, 1e-01, 1e00, 1e01, 1e02]:
    model2= SVC(num, kernel='rbf', gamma=0.1)
    model2.fit(X_train, y_train)
    predict2 = model2.predict(X_test)
    print(predict2)

predict2:



RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier
for n_estimators in [10, 100, 1000]:
    #SVC
    model3 = RandomForestClassifier(n_estimators=6)
    model3.fit(X_train, y_train)
    predict3 = model3.predict(X_test)
    print(predict3)
predict3:



Step4:

     Evaluate the cross-validated performance
        Accuracy
        F1-score

        AUC ROC

GaussianNB:

from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predict)
print(accuracy)
F1_score = metrics.f1_score(y_test, pred)
print(F1_score)
auc_roc = metrics.roc_auc_score(y_test, predict)
print(auc_roc)

SVC:

for num in  [1e-02, 1e-01, 1e00, 1e01, 1e02]:
    model2 = SVC(num, kernel='rbf', gamma=0.1)
    model2.fit(X_train, y_train)
    predict2 = model2.predict(X_test)

    accurary = metrics.accuracy_score(y_test, predict2)
    print(accurary)
    F1_score = metrics.f1_score(y_test, predict2)
    print(F1_score)
    auc_roc = metrics.roc_auc_score(y_test, predict2)
    print(auc_roc)


RandomForestClassifier:

for n_estimators in [10, 100, 1000]:
    model3 = RandomForestClassifier(n_estimators=6)
    model3.fit(X_train, y_train)
    predict3 = model3.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, predict3)
    print(accuracy)
    F1_score = metrics.f1_score(y_test, predict3)
    print(F1_score)
    auc_roc = metrics.roc_auc_score(y_test, predict3)
    print(auc_roc)


Step5:

Write a short report summarizing the methodology and the result

 總結1:三個模型的效能評估從次到優分別是GaussianNB< SVC <RandomForestClassifier

 總結2:SVC中,當C取值為1e00時最優

 總結3 :RandomForestClassifier中,n_estimators越小越優


(本次作業的耗時主要在關於Anaconda(在spider中)無法匯入sklearn,直接在ipython上是沒問題的)

sklearn提供了很多的資料集和訓練方法,有待於進一步學習。