1. 程式人生 > >機器學習實戰基礎篇——分類篇

機器學習實戰基礎篇——分類篇

機器學習裡面最重要的兩類任務:分類(classification)與迴歸(regression)。我學習的這本書名叫《Python機器學習及實踐——從零開始通往Kaggle競賽之路》,清華大學出版社。這本書裡在基礎篇對演算法的原理沒有進行太多的論述,而是如書名所述,偏重實踐。詳細的學習演算法會在以後的專欄中仔細研讀,參考的書籍是周志華教授的《機器學習》,視訊則是吳恩達教授的網易公開課《機器學習》。

總體來說,這本書裡介紹了六種分類學習演算法:

  • 線性分類器(Linear Classifier)
  • 支援向量機(SVM, Support Vector classifier)
  • 樸素貝葉斯(Naive Bayes Classifier)
  • K-鄰近(K-neighborsClassfier)
  • 決策樹(DecisionTreeClassifier)

這幾種演算法都給了相應的資料去進行練習。跟著寫完這六個演算法程式碼,發現如果只是應用這些演算法的話,只需要以下的步驟:


這個圖片是才從我的oneNote筆記裡面擷取的。可能其中的某些步驟不是很統一,但是總體而言確實是通過以上幾個步驟來完成的。為了便於理解,貼上一部分的原始碼。

  • 線性分類
#import panda and numpyimport pandas as pdimport numpy as np
#construct attributes listcolumn_names=['Sample code number'
,'Climp Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']# using pandas.read_csvdata=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
,names=column_names)#using standard value to replace '?'data=data.replace(to_replace='?',value=np.nan)#丟棄帶有缺失值的資料data=data.dropna(how='any')print(data.shape)#輸出資料的維度
#分割資料用於測試和訓練from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)#查驗結果print(y_train.value_counts())print(y_test.value_counts())
#預測#首先標準化保證每個維度的特徵資料方差為1,均值為0。這樣預測結果不會被某些維度過大的特徵值主導#匯入標準化函式from sklearn.preprocessing import StandardScaler#匯入logisticRegression 與SGDClassifier進行訓練from sklearn.linear_model import LogisticRegression from sklearn.linear_model import SGDClassifier
#標準化資料ss=StandardScaler()X_train=ss.fit_transform(X_train)X_test=ss.transform(X_test)
#LogisticRegressionlr=LogisticRegression()#初始化lr.fit(X_train,y_train)#fit 函式訓練lr_y_predict=lr.predict(X_test)
#SGDClassifiersgdc=SGDClassifier()sgdc.fit(X_train,y_train)sgdc_y_predict=sgdc.predict(X_test)
#效能評估#使用classification_reportfrom sklearn.metrics import classification_report#使用score函式獲得準確性print('Accuracy of LR classifier:', lr.score(X_test,y_test))#輸出recall 和 precisionprint(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignment']))
#使用隨機梯度下降演算法print('Accuracy of LR classifier:', sgdc.score(X_test,y_test))print(classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignment']))
  • 決策樹(DecisionTreeClassifier)
#讀資料import pandas as pdtitanic=pd.read_csv('C:/Users/Zhipeng Wang/Downloads/train.csv')#print(titanic.head())#print(titanic.info())
#選擇特徵X=titanic[['Pclass','Age','Sex']]y=titanic[['Survived']]#print(X.info())X['Age'].fillna(X['Age'].mean(),inplace=True)X.info()
#資料分割from sklearn.cross_validation import train_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)#特徵轉換from sklearn.feature_extraction import DictVectorizervec=DictVectorizer(sparse=False)X_train=vec.fit_transform(X_train.to_dict(orient='record'))X_test=vec.transform(X_test.to_dict(orient='record'))#匯入決策樹from sklearn.tree import DecisionTreeClassifier#初始化分類器dtc=DecisionTreeClassifier()dtc.fit(X_train,y_train)y_predict=dtc.predict(X_test)
#效能評估from sklearn.metrics import classification_reportprint(dtc.score(X_test,y_test))print(classification_report(y_predict,y_test,target_names=['Died','Survived']))
  • SVM#從sklearn.database 匯入數字載入器並賦予變數digitsfrom sklearn.datasets import load_digitsdigits=load_digits()print(digits.data.shape)
    #split datasetfrom sklearn.cross_validation import train_test_split#25% data for test X_train, X_test, y_train, y_test=train_test_split(digits.data,digits.target,test_size=0.25,random_state=33)#check date scaleprint(y_train.shape)print(y_test.shape)
    #Standardize data and using linearSVCfrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVCss=StandardScaler()X_train=ss.fit_transform(X_train)X_test=ss.transform(X_test)#Train and predictlsvc=LinearSVC()lsvc.fit(X_train,y_train)y_predict=lsvc.predict(X_test)
    #Performance assessmentprint('The Accuracy of Linear SVC is',lsvc.score(X_test,y_test))from sklearn.metrics import classification_reportprint(classification_report(y_test,y_predict,target_names=digits.target_names.astype(str)))

每段程式碼都不長,僅供參考。下面一篇文章關於迴歸,將採用類似的風格來做。