1. 程式人生 > >用線性迴歸模型預測房價

用線性迴歸模型預測房價

本文使用sklearn 中自帶的波士頓房價資料集來訓練模型,然後利用模型來預測房價。這份收據中共收集了13個特徵。

1.輸入特徵

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
X = boston.data
y = boston.target
X.shape

輸出為:“(506,13)”共有506個樣本 13個特徵。


print(X[0])
輸出結果為:
array([  6.32000000e-03,   1.80000000e+01,   2.31000000e+00,
         0.00000000e+00,   5.38000000e-01,   6.57500000e+00,
         6.52000000e+01,   4.09000000e+00,   1.00000000e+00,
         2.96000000e+02,   1.53000000e+01,   3.96900000e+02,
         4.98000000e+00])

可以通過boston.feature_names 來檢視這些特徵的標籤

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='|S7')

2.模型訓練

LinearRegression 類實現了線性迴歸演算法。在訓練之前先把資料集分為兩份。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

訓練模型並測試模型準確性

import time
from sklearn.linear_model import LinearRegression

model = LinearRegression()

start = time.clock()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
cv_score = model.score(X_test, y_test)
print('elaspe: {0:.6f}; train_score: {1:0.6f}; cv_score: {2:.6f}'.format(time.clock()-start, train_score, cv_score))

統計了模型的訓練時間,統計模型對訓練樣本的準確性得分(即對訓練樣本的擬合好壞程度)train_score,還統計了模型對測試樣本的得分 cv_score
執行結果如下:

elaspe: 0.002447; train_score: 0.723941; cv_score: 0.794958

從結果可以看出模型擬合效果一般。

3.模型優化

模型優化的方式
1.觀察特徵的變化範圍從10310^{-3}級別到10210^2,先將資料進行歸一化的處理,可以加快演算法的收斂速度。
2.增加多項式特徵

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def polynomial_model(degree=1):
    polynomial_features = PolynomialFeatures(degree=degree,
                                             include_bias=False)
    linear_regression = LinearRegression(normalize=True)
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    return pipeline

接著我們使用二階多項式來你和資料:

model = polynomial_model(degree=2)

start = time.clock()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
cv_score = model.score(X_test, y_test)
print('elaspe: {0:.6f}; train_score: {1:0.6f}; cv_score: {2:.6f}'.format(time.clock()-start, train_score, cv_score))

輸出結果是:

elaspe: 0.016412; train_score: 0.930547; cv_score: 0.860465

訓練樣本分數和測試樣本分數都提高了,模型得到了優化。
把多項式特徵改為三階檢視效果。
執行結果為 :

elaspe: 0.133220; train_score: 1.000000; cv_score: -105.517016

針對訓練樣本的分數達到了1 ,而對測試樣本的分數卻是負數,產生了過擬合。

4.學習曲線

通過畫出學習曲線,來對模型的狀態及優化的方向直觀的觀察。

from common.utils import plot_learning_curve
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
plt.figure(figsize=(18, 4), dpi=200)
title = 'Learning Curves (degree={0})'
degrees = [1, 2, 3]

start = time.clock()
plt.figure(figsize=(18, 4), dpi=200)
for i in range(len(degrees)):
    plt.subplot(1, 3, i + 1)
    plot_learning_curve(plt, polynomial_model(degrees[i]), title.format(degrees[i]), X, y, ylim=(0.01, 1.01), cv=cv)

print('elaspe: {0:.6f}'.format(time.clock()-start))

學習曲線