1. 程式人生 > >sklearn 學習實踐之——基於自帶資料集(波士頓房價、鳶尾花、糖尿病等)構建分類、迴歸模型

sklearn 學習實踐之——基於自帶資料集(波士頓房價、鳶尾花、糖尿病等)構建分類、迴歸模型

      只要是接觸機器學習的,很少有沒聽過sklearn的,這個真的可以稱得上是機器學習快速進行的神器了,在研究生的時候搭建常用的機器學習模型用的就是sklearn,今天應部門的一些需求,簡單的總結了一點使用方法,後面還會繼續更新,今天僅使用sklearn自帶的資料集來實踐一下分類和迴歸模型,比較簡單就不再進行解釋了,主要是看一下sklearn自身有哪些資料集可以很方便地去使用,下面是具體的實踐:

#!usr/bin/env python
#encoding:utf-8


'''
__Author__:沂水寒城
功能:sklearn 資料集探索 
sklearn自動了下面幾種資料用於演算法練習。
load_boston([return_X_y]) 載入波士頓房價資料;用於迴歸問題
load_iris([return_X_y]) 載入iris 資料集;用於分類問題
load_diabetes([return_X_y]) 載入糖尿病資料集;用於迴歸問題
load_digits([n_class, return_X_y]) 載入手寫字符集;用於分類問題
load_linnerud([return_X_y]) 載入linnerud 資料集;用於多元迴歸問題
'''


import sys
reload(sys)
sys.setdefaultencoding('utf-8')
#載入各種資料集
from sklearn.datasets import load_iris
from sklearn.datasets import load_boston
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_linnerud
#載入模型
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,LinearRegression


def split_data(data_list, y_list, ratio=0.30):
    '''
    按照指定的比例,劃分樣本資料集
    ratio: 測試資料的比率
    '''
    X_train, X_test, y_train, y_test = train_test_split(data_list, y_list, test_size=ratio, random_state=42)
    print '--------------------------------data shape-----------------------------------'
    print len(X_train), len(y_train)
    print len(X_test), len(y_test)
    return X_train, X_test, y_train, y_test


def regressionModels():
    '''
    迴歸模型使用
    '''
    #波士頓房價資料
    boston=load_boston()
    data=boston.data
    target=boston.target
    print data.shape
    print target.shape
    X_train, X_test, y_train, y_test=split_data(data,target)
    model=LinearRegression()
    model.fit(X_train,y_train)
    print u"係數矩陣:"
    print model.coef_.tolist()
    print u"截距"
    print model.intercept_ 
    print '-----------------------------------------------------------------'
    # 糖尿病資料集
    diabetes=load_diabetes()
    data=diabetes.data
    target=diabetes.target
    print data.shape
    print target.shape
    X_train, X_test, y_train, y_test=split_data(data,target)
    model=LinearRegression()
    model.fit(X_train,y_train)
    print u"係數矩陣:"
    print model.coef_.tolist()
    print u"截距"
    print model.intercept_
    print '-----------------------------------------------------------------'
    linnerud=load_linnerud()
    data=linnerud.data
    target=linnerud.target
    print data.shape
    print target.shape
    X_train, X_test, y_train, y_test=split_data(data,target)
    model=LinearRegression()
    model.fit(X_train,y_train)
    print u"係數矩陣:"
    print model.coef_.tolist()
    print u"截距"
    print model.intercept_

    

def classificationModels():
    '''
    分類模型使用
    '''
    #鳶尾花資料集
    iris=load_iris()
    data=iris.data
    target=iris.target
    print data.shape
    print target.shape
    X_train, X_test, y_train, y_test=split_data(data,target)
    model=svm.SVC()
    model.fit(data,target)
    y_predict=model.predict(X_test)
    print "Accuracy:"
    print accuracy_score(y_test,y_predict)


if __name__=='__main__':
    regressionModels()
    classificationModels()

        上述程式碼在python2.7環境下測試通過,下面是結果輸出:

#迴歸結果輸出
(506L, 13L)
(506L,)
--------------------------------data shape-----------------------------------
354 354
152 152
係數矩陣:
[-0.13347010285294442, 0.03580891359322994, 0.04952264522005112, 3.119835116285431, -15.417060895306475, 4.057199231645387, -0.010820835184929944, -1.3859982431608757, 0.24272733982224273, -0.008702234365661983, -0.9106852081102892, 0.011794115892572796, -0.547113312823961]
截距
31.63108403569312
-----------------------------------------------------------------
(442L, 10L)
(442L,)
--------------------------------data shape-----------------------------------
309 309
133 133
係數矩陣:
[29.250345824146294, -261.70768052669956, 546.2973726341081, 388.4007725749296, -901.9533870552892, 506.76114900102954, 121.14845947917183, 288.0293249509, 659.2713384575223, 41.375369011084985]
截距
151.00818273080338
-----------------------------------------------------------------
(20L, 3L)
(20L, 3L)
--------------------------------data shape-----------------------------------
14 14
6 6
係數矩陣:
[[0.30287324212545036, -0.37960681782019234, 0.16074123821975367], [-0.09726556450063326, -0.047093687917992795, 0.027083148889626204], [-0.5869461483881073, 0.053272106843233316, -0.004283316055441206]]
截距



#分類結果輸出
[213.69337737  41.05782266  54.38303224]
(150L, 4L)
(150L,)
--------------------------------data shape-----------------------------------
105 105
45 45
Accuracy:
1.0

        歡迎交流學習!