1. 程式人生 > >基於sciket-learn實現線性迴歸演算法

基於sciket-learn實現線性迴歸演算法

線性迴歸演算法主要用來解決迴歸問題,是許多強大的非線性模型的基礎,無論是簡單線性迴歸,還是多元線性迴歸,思想都是一樣的,假設我們找到了最佳擬合方程(對於簡單線性迴歸,多元線性迴歸對應多個特徵作為一組向量)y=ax+b,則對於每一個樣本點xi,根據我們的直線方程,預測值為y^i = axi + b,真值為y,我們希望y和y^i的差距儘量的小。

接下來我們看看通過sciket-learn來實現線性迴歸演算法,首先還是匯入常用的庫

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

這裡我們使用boston房價的資料集,並且去掉50這個極值,通常在實際應用中,這個極值可能是由於環境因素或者儀器限制等無法獲取到真值,所以在這裡我們去除資料集裡的50

boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

接下里是訓練集和測試集的劃分以及獲取構造器並且fit訓練集

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

在檢視分類效果前,可以先看看y^方程裡的係數與截距

lin_reg.coef_
lin_reg.intercept_

最後我們來檢視線性迴歸的score和predict值

lin_reg.score(X_test, y_test)
lin_reg.predict(X_test)

這樣一個多遠線性迴歸的演算法變完成了,在我的機器上,評價結果是0.80089168995191,大家的出來值應該差不多也是這個維度,在上一篇部落格中提到的kNN演算法,我們用它來解決了分類問題,kNN同樣也可以用來解決迴歸問題,我們在同樣的資料集下,看看kNN的表現如何。

首先,還是先倒入相關類庫(接著上面的程式碼,重複的類庫就不再重新匯入了)

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

訓練並檢視結果

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)

在我的機器上,得出的評價結果是0.60,與線性迴歸差的有點多,但是在這裡我們或許並沒有使用最優的超引數,下面進行網格化搜尋

param_grid = [
    {
        "weights" : ["uniform"],
        "n_neighbors" : [i for i in range(1,11)]
    },
    {
        "weights" : ["distance"],
        "n_neighbors" : [i for i in range(1,11)],
        "p" : [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs = -1, verbose = 1)
grid_search.fit(X_train, y_train)

檢視最優超引數

grid_search.best_params_

檢視評價結果

grid_search.best_estimator_.score(X_test, y_test)

我的機器上得到的評價結果是0.73,雖然比線性迴歸還差一些,但是已經在同一個維度上了。下面是完整程式碼

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

lin_reg.coef_
lin_reg.intercept_

lin_reg.score(X_test, y_test)

from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)

from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights" : ["uniform"],
        "n_neighbors" : [i for i in range(1,11)]
    },
    {
        "weights" : ["distance"],
        "n_neighbors" : [i for i in range(1,11)],
        "p" : [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs = -1, verbose = 1)
grid_search.fit(X_train, y_train)

grid_search.best_params_

grid_search.best_estimator_.score(X_test, y_test)