sklearn 學習實踐之——基於自帶資料集(波士頓房價、鳶尾花、糖尿病等)構建分類、迴歸模型
阿新 • • 發佈:2018-12-08
只要是接觸機器學習的,很少有沒聽過sklearn的,這個真的可以稱得上是機器學習快速進行的神器了,在研究生的時候搭建常用的機器學習模型用的就是sklearn,今天應部門的一些需求,簡單的總結了一點使用方法,後面還會繼續更新,今天僅使用sklearn自帶的資料集來實踐一下分類和迴歸模型,比較簡單就不再進行解釋了,主要是看一下sklearn自身有哪些資料集可以很方便地去使用,下面是具體的實踐:
#!usr/bin/env python #encoding:utf-8 ''' __Author__:沂水寒城 功能:sklearn 資料集探索 sklearn自動了下面幾種資料用於演算法練習。 load_boston([return_X_y]) 載入波士頓房價資料;用於迴歸問題 load_iris([return_X_y]) 載入iris 資料集;用於分類問題 load_diabetes([return_X_y]) 載入糖尿病資料集;用於迴歸問題 load_digits([n_class, return_X_y]) 載入手寫字符集;用於分類問題 load_linnerud([return_X_y]) 載入linnerud 資料集;用於多元迴歸問題 ''' import sys reload(sys) sys.setdefaultencoding('utf-8') #載入各種資料集 from sklearn.datasets import load_iris from sklearn.datasets import load_boston from sklearn.datasets import load_diabetes from sklearn.datasets import load_linnerud #載入模型 from sklearn import svm from sklearn import linear_model from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression,LinearRegression def split_data(data_list, y_list, ratio=0.30): ''' 按照指定的比例,劃分樣本資料集 ratio: 測試資料的比率 ''' X_train, X_test, y_train, y_test = train_test_split(data_list, y_list, test_size=ratio, random_state=42) print '--------------------------------data shape-----------------------------------' print len(X_train), len(y_train) print len(X_test), len(y_test) return X_train, X_test, y_train, y_test def regressionModels(): ''' 迴歸模型使用 ''' #波士頓房價資料 boston=load_boston() data=boston.data target=boston.target print data.shape print target.shape X_train, X_test, y_train, y_test=split_data(data,target) model=LinearRegression() model.fit(X_train,y_train) print u"係數矩陣:" print model.coef_.tolist() print u"截距" print model.intercept_ print '-----------------------------------------------------------------' # 糖尿病資料集 diabetes=load_diabetes() data=diabetes.data target=diabetes.target print data.shape print target.shape X_train, X_test, y_train, y_test=split_data(data,target) model=LinearRegression() model.fit(X_train,y_train) print u"係數矩陣:" print model.coef_.tolist() print u"截距" print model.intercept_ print '-----------------------------------------------------------------' linnerud=load_linnerud() data=linnerud.data target=linnerud.target print data.shape print target.shape X_train, X_test, y_train, y_test=split_data(data,target) model=LinearRegression() model.fit(X_train,y_train) print u"係數矩陣:" print model.coef_.tolist() print u"截距" print model.intercept_ def classificationModels(): ''' 分類模型使用 ''' #鳶尾花資料集 iris=load_iris() data=iris.data target=iris.target print data.shape print target.shape X_train, X_test, y_train, y_test=split_data(data,target) model=svm.SVC() model.fit(data,target) y_predict=model.predict(X_test) print "Accuracy:" print accuracy_score(y_test,y_predict) if __name__=='__main__': regressionModels() classificationModels()
上述程式碼在python2.7環境下測試通過,下面是結果輸出:
#迴歸結果輸出 (506L, 13L) (506L,) --------------------------------data shape----------------------------------- 354 354 152 152 係數矩陣: [-0.13347010285294442, 0.03580891359322994, 0.04952264522005112, 3.119835116285431, -15.417060895306475, 4.057199231645387, -0.010820835184929944, -1.3859982431608757, 0.24272733982224273, -0.008702234365661983, -0.9106852081102892, 0.011794115892572796, -0.547113312823961] 截距 31.63108403569312 ----------------------------------------------------------------- (442L, 10L) (442L,) --------------------------------data shape----------------------------------- 309 309 133 133 係數矩陣: [29.250345824146294, -261.70768052669956, 546.2973726341081, 388.4007725749296, -901.9533870552892, 506.76114900102954, 121.14845947917183, 288.0293249509, 659.2713384575223, 41.375369011084985] 截距 151.00818273080338 ----------------------------------------------------------------- (20L, 3L) (20L, 3L) --------------------------------data shape----------------------------------- 14 14 6 6 係數矩陣: [[0.30287324212545036, -0.37960681782019234, 0.16074123821975367], [-0.09726556450063326, -0.047093687917992795, 0.027083148889626204], [-0.5869461483881073, 0.053272106843233316, -0.004283316055441206]] 截距 #分類結果輸出 [213.69337737 41.05782266 54.38303224] (150L, 4L) (150L,) --------------------------------data shape----------------------------------- 105 105 45 45 Accuracy: 1.0
歡迎交流學習!