Python數據分析--Kaggle共享單車項目實戰

阿新 • • 發佈：2018-12-25

sso 交流沒有統計 wrap info mon oat sele

前言

上面一節我們介紹了一元線性回歸和多元線性回歸的原理, 又通過一個案例對多元線性回歸模型進一步了解, 其中談到自變量之間存在高度相關, 容易產生多重共線性問題, 對於多重共線性問題的解決方法有: 刪除自變量, 改變數據形式, 添加正則化項, 逐步回歸, 主成分分析等. 今天我們來看看其中的添加正則化項.

正文

添加正則化項, 是指在損失函數上添加正則化項, 而正則化項可分為兩種: 一種是L1正則化項, 另一種是L2正則化. 我們把帶有L2正則化項的回歸模型稱為嶺回歸, 帶有L1正則化項的回歸稱為Lasso回歸.

1. 嶺回歸

引用百度百科定義.

嶺回歸(英文名：ridge regression, Tikhonov regularization)是一種專用於共線性數據分析的有偏估計回歸方法，實質上是一種改良的最小二乘估計法，通過放棄最小二乘法的無偏性，以損失部分信息、降低精度為代價獲得回歸系數更為符合實際、更可靠的回歸方法，對病態數據的擬合要強於最小二乘法。

通過定義可以看出, 嶺回歸是改良後的最小二乘法, 是有偏估計的回歸方法, 即給損失函數加上一個正則化項, 也叫懲罰項(L2範數), 那麽嶺回歸的損失函數表示為

技術分享圖片

其中, m是樣本量, n是特征數, 技術分享圖片是懲罰項參數(其取值大於0), 加懲罰項主要為了讓模型參數的取值不能過大. 當

趨於無窮大時, 對應技術分享圖片

趨向於0, 而

表示的是因變量隨著某一自變量改變一個單位而變化的數值(假設其他自變量均保持不變), 這時, 自變量之間的共線性對因變量的影響幾乎不存在, 故其能有效解決自變量之間的多重共線性問題, 同時也能防止過擬合.

2. Lasso回歸

嶺回歸的正則化項是對技術分享圖片求平方和, 既然能求平方也就能取絕對值, 而Lasso回歸的L1範數正是對取絕對值, 故其損失函數可以表示為

技術分享圖片

當只有兩個自變量時, L1範數在二維上對應的圖形是矩形(頂點均在坐標軸上, 即其中一個回歸系數為0), 對於這樣的矩形來說其頂點更容易與同心橢圓(等值線)相交, 而相交的點則為最小損失函數的最優解. 也就是說Lasso會出現回歸系數為0的情況. 對於L2範數來說則是圓形,其不會相交於坐標軸上的點, 自然也就不會出現回歸系數為0的情況. 當然多個自變量也是同樣的道理

3. 嶺回歸和Lasso回歸對比

相同點:

1. 嶺回歸和Lasso回歸均是加了正則化項的線性回歸模型, 本質上它們都是線性回歸模型.

2. 兩者均能在一定程度上解決多重共線性問題, 並且可以有效避免過擬合.

3. 回歸系數均受正則化參數的影響, 均可以用圖形表示回歸系數和正則化參數的關系, 並可以通過該圖形進行變量以及正則化參數的篩選.

不同點:

1. 嶺回歸的回歸系數均不為0, Lasso回歸部分回歸系數為0.

4. 實際案例應用

1. 數據來源及數據背景

數據來源: https://www.kaggle.com/c/bike-sharing-demand/data, 數據有訓練集和測試集, 在訓練集中包含10886個樣本以及12個字段, 通過訓練集上自行車租賃數據對美國華盛頓自行車租賃需求進行預測.

2. 數據概覽

1. 讀取數據

import pandas as pd
df = pd.read_csv(r‘D:\Data\bike.csv‘)
pd.set_option(‘display.max_rows‘,4 )
df

技術分享圖片

通過以上可以得知數據維度10886行X12列, 除了第一列其它均顯示為數值, 具體的格式還要進一步查看, 對於各列的解釋也放入下一環節.

2. 查看數據整體信息

df.info()

<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object     #時間和日期
season        10886 non-null int64      #季節,  1 =春季，2 =夏季，3 =秋季，4 =冬季  
holiday       10886 non-null int64      #是否是假期, 1=是, 0=否
workingday    10886 non-null int64      #是否是工作日, 1=是, 0=否
weather       10886 non-null int64      #天氣,1:晴朗，很少有雲，部分多雲，部分多雲; 2:霧+多雲，霧+碎雲，霧+少雲，霧; 3:小雪，小雨+雷雨+散雲，小雨+散雲; 4:大雨+冰塊+雷暴+霧，雪+霧
temp          10886 non-null float64    #溫度
atemp         10886 non-null float64    #體感溫度
humidity      10886 non-null int64      #相對濕度
windspeed     10886 non-null float64    #風速
casual        10886 non-null int64      #未註冊用戶租賃數量
registered    10886 non-null int64      #註冊用戶租賃數量
count         10886 non-null int64      #所有用戶租賃總數
dtypes: float64(3), int64(8), object(1) 
memory usage: 1020.6+ KB

除了datetime為字符串型, 其他均為數值型, 且無缺失值.

3. 描述性統計

df.describe()

溫度, 體表溫度, 相對濕度, 風速均近似對稱分布, 而非註冊用戶, 註冊用戶,以及總數均右邊分布.

技術分享圖片

4. 偏態, 峰態

for i in range(5, 12):
    name = df.columns[i]
    print(‘{0}偏態系數為 {1}, 峰態系數為 {2}‘.format(name, df[name].skew(), df[name].kurt()))

temp偏態系數為 0.003690844422472008, 峰態系數為 -0.9145302637630794
atemp偏態系數為 -0.10255951346908665, 峰態系數為 -0.8500756471754651
humidity偏態系數為 -0.08633518364548581, 峰態系數為 -0.7598175375208864
windspeed偏態系數為 0.5887665265853944, 峰態系數為 0.6301328693364932
casual偏態系數為 2.4957483979812567, 峰態系數為 7.551629305632764
registered偏態系數為 1.5248045868182296, 峰態系數為 2.6260809999210672
count偏態系數為 1.2420662117180776, 峰態系數為 1.3000929518398334

temp, atemp, humidity低度偏態, windspeed中度偏態, casual, registered, count高度偏態

temp, atemp, humidity為平峰分布, windspeed,casual, registered, count為尖峰分布.

3. 數據預處理

由於沒有缺失值, 不用處理缺失值, 看看有沒有重復值.

1. 檢查重復值

print(‘未去重: ‘, df.shape)
print(‘去重: ‘, df.drop_duplicates().shape)

未去重:  (10886, 12)
去重:  (10886, 12)

沒有重復項, 看看異常值.

2. 異常值

通過箱線圖查看異常值

import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 6))
#繪制箱線圖
sns.boxplot(x="windspeed", data=df,ax=axes[0][0])
sns.boxplot(x=‘casual‘, data=df, ax=axes[0][1])
sns.boxplot(x=‘registered‘, data=df, ax=axes[1][0])
sns.boxplot(x=‘count‘, data=df, ax=axes[1][1])
plt.show()

租賃數量會受小時的影響, 比如說上班高峰期等, 故在這裏先不處理異常值.

技術分享圖片

3. 數據加工

轉換"時間和日期"的格式, 並提取出小時, 日, 月, 年.

#轉換格式, 並提取出小時, 星期幾, 月份
df[‘datetime‘] = pd.to_datetime(df[‘datetime‘])
df[‘hour‘] = df.datetime.dt.hour
df[‘week‘] = df.datetime.dt.dayofweek
df[‘month‘] = df.datetime.dt.month
df[‘year_month‘] = df.datetime.dt.strftime(‘%Y-%m‘)
df[‘date‘] = df.datetime.dt.date
#刪除datetime
df.drop(‘datetime‘, axis = 1, inplace = True)
df

技術分享圖片

4. 特征分析

1) 日期和總租賃數量

import matplotlib
#設置中文字體
font = {‘family‘: ‘SimHei‘}
matplotlib.rc(‘font‘, **font)
#分別計算日期和月份中位數
group_date = df.groupby(‘date‘)[‘count‘].median()
group_month = df.groupby(‘year_month‘)[‘count‘].median()
group_month.index = pd.to_datetime(group_month.index)
plt.figure(figsize=(16,5))
plt.plot(group_date.index, group_date.values, ‘-‘, color = ‘b‘, label = ‘每天租賃數量中位數‘, alpha=0.8)
plt.plot(group_month.index, group_month.values, ‘-o‘, color=‘orange‘, label = ‘每月租賃數量中位數‘)
plt.legend()
plt.show()

2012年相比2011年租賃數量有所增長, 且波動幅度相類似.

技術分享圖片

2) 月份和總租賃數量

import seaborn as sns
plt.figure(figsize=(10, 4))
sns.boxplot(x=‘month‘, y=‘count‘, data=df)
plt.show()

與上圖的波動幅度基本一致, 另外每個月均有不同程度的離群值.

技術分享圖片

3) 季節和總租賃數量

plt.figure(figsize=(8, 4))
sns.boxplot(x=‘season‘, y=‘count‘, data=df)
plt.show()

就中位數來說, 秋季是最多的, 春季最少且離群值較多.

技術分享圖片

4) 星期幾和租賃數量

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(12, 8))
sns.boxplot(x="week",y=‘casual‘ ,data=df,ax=axes[0])
sns.boxplot(x=‘week‘,y=‘registered‘, data=df, ax=axes[1])
sns.boxplot(x=‘week‘,y=‘count‘, data=df, ax=axes[2])
plt.show()

就中位數來說, 未註冊用戶周六和周日較多, 而註冊用戶則周內較多, 對應的總數也是周內較多, 且周內在總數的離群值較多(0代表周一, 6代表周日)

技術分享圖片

5) 節假日, 工作日和總租賃數量

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(9, 7))
sns.boxplot(x=‘holiday‘, y=‘casual‘, data=df, ax=axes[0][0])
sns.boxplot(x=‘holiday‘, y=‘registered‘, data=df, ax=axes[1][0])
sns.boxplot(x=‘holiday‘, y=‘count‘, data=df, ax=axes[2][0])
sns.boxplot(x=‘workingday‘, y=‘casual‘, data=df, ax=axes[0][1])
sns.boxplot(x=‘workingday‘, y=‘registered‘, data=df, ax=axes[1][1])
sns.boxplot(x=‘workingday‘, y=‘count‘, data=df, ax=axes[2][1])
plt.show()

未註冊用戶: 在節假日較多, 在工作日較少

註冊用戶: 在節假日較少, 在工作日較多

總的來說, 節假日租賃較少, 工作日租賃較多, 初步猜測多數未註冊用戶租賃自行車是用來非工作日出遊, 而多數註冊用戶則是工作日用來上班或者上學.

技術分享圖片

6) 小時和總租賃數量的關系

#繪制第一個子圖
plt.figure(1, figsize=(14, 8))
plt.subplot(221)
hour_casual = df[df.holiday==1].groupby(‘hour‘)[‘casual‘].median()
hour_registered = df[df.holiday==1].groupby(‘hour‘)[‘registered‘].median()
hour_count = df[df.holiday==1].groupby(‘hour‘)[‘count‘].median()
plt.plot(hour_casual.index, hour_casual.values, ‘-‘, color=‘r‘, label=‘未註冊用戶‘)
plt.plot(hour_registered.index, hour_registered.values, ‘-‘, color=‘g‘, label=‘註冊用戶‘)
plt.plot(hour_count.index, hour_count.values, ‘-o‘, color=‘c‘, label=‘所有用戶‘)
plt.legend()
plt.xticks(hour_casual.index)
plt.title(‘未註冊用戶和註冊用戶在節假日自行車租賃情況‘)
#繪制第二個子圖
plt.subplot(222)
hour_casual = df[df.workingday==1].groupby(‘hour‘)[‘casual‘].median()
hour_registered = df[df.workingday==1].groupby(‘hour‘)[‘registered‘].median()
hour_count = df[df.workingday==1].groupby(‘hour‘)[‘count‘].median()
plt.plot(hour_casual.index, hour_casual.values, ‘-‘, color=‘r‘, label=‘未註冊用戶‘)
plt.plot(hour_registered.index, hour_registered.values, ‘-‘, color=‘g‘, label=‘註冊用戶‘)
plt.plot(hour_count.index, hour_count.values, ‘-o‘, color=‘c‘, label=‘所有用戶‘)
plt.legend()
plt.title(‘未註冊用戶和註冊用戶在工作日自行車租賃情況‘)
plt.xticks(hour_casual.index)
#繪制第三個子圖
plt.subplot(212)
hour_casual = df.groupby(‘hour‘)[‘casual‘].median()
hour_registered = df.groupby(‘hour‘)[‘registered‘].median()
hour_count = df.groupby(‘hour‘)[‘count‘].median()
plt.plot(hour_casual.index, hour_casual.values, ‘-‘, color=‘r‘, label=‘未註冊用戶‘)
plt.plot(hour_registered.index, hour_registered.values, ‘-‘, color=‘g‘, label=‘註冊用戶‘)
plt.plot(hour_count.index, hour_count.values, ‘-o‘, color=‘c‘, label=‘所有用戶‘)
plt.legend()
plt.title(‘未註冊用戶和註冊用戶自行車租賃情況‘)
plt.xticks(hour_casual.index)
plt.show()

查看代碼

在節假日, 未註冊用戶和註冊用戶走勢相接近, 不過未註冊用戶最高峰在14點, 而註冊用戶則是17點

在工作日, 註冊用戶呈現出雙峰走勢, 在8點和17點均為用車高峰期, 而這正是上下班或者上下學高峰期.

對於註冊用戶來說, 17點在節假日和工作日均為高峰期, 說明部分用戶在節假日可能未必休假.

技術分享圖片

7) 天氣和總租賃數量

fig, ax = plt.subplots(3, 1, figsize=(12, 6))
sns.boxplot(x=‘weather‘, y=‘casual‘, hue=‘workingday‘,data=df, ax=ax[0])
sns.boxplot(x=‘weather‘, y=‘registered‘,hue=‘workingday‘, data=df, ax=ax[1])
sns.boxplot(x=‘weather‘, y=‘count‘,hue=‘workingday‘, data=df, ax=ax[2])

就中位數而言未註冊用戶和註冊用戶均表現為: 在工作日和非工作日租賃數量均隨著天氣的惡劣而減少, 特別地, 當天氣為大雨大雪天(4)且非工作日均沒有自行車租賃.

技術分享圖片

從圖上可以看出, 大雨大雪天只有一個數據, 我們看看原數據.

df[df.weather==4]

只有在2012年1月9日18時為大雨大雪天, 說明天氣是突然變化的, 部分用戶可能因為沒有看天氣預報而租賃自行車, 當然也有其他原因.

技術分享圖片

另外, 發現1月份是春季, 看看它的季節劃分規則.

sns.boxplot(x=‘season‘, y=‘month‘,data=df)

123為春季, 456為夏季, 789為秋季...

技術分享圖片

季節的劃分通常和緯度相關, 而這份數據是用來預測美國華盛頓的租賃數量, 且美國和我國的緯度基本一樣, 故按照345春節, 678夏季..這個規則來重新劃分.

import numpy as np
df[‘group_season‘] = np.where((df.month <=5) & (df.month >=3), 1,
                        np.where((df.month <=8) & (df.month >=6), 2,
                                 np.where((df.month <=11) & (df.month >=9), 3, 4)))
fig, ax = plt.subplots(2, 1, figsize=(12, 6))
#繪制氣溫和季節箱線圖
sns.boxplot(x=‘season‘, y=‘temp‘,data=df, ax=ax[0])
sns.boxplot(x=‘group_season‘, y=‘temp‘,data=df, ax=ax[1])

第一個圖是調整之前的, 就中位數來說, 春季氣溫最低, 秋季氣溫最高

第二個圖是調整之後的, 就中位數來說, 冬季氣溫最低, 夏季氣溫最高

技術分享圖片

顯然第二張的圖的結果較符合常理, 故刪除另外那一列.

df.drop(‘season‘, axis=1, inplace=True)
df.shape

(10886, 16)

8) 其他變量和總租賃數量的關系

這裏我直接使用利用seaborn的pairplot繪制剩余的溫度, 體感溫度, 相對濕度, 風速這四個連續變量與未註冊用戶和註冊用戶的關系在一張圖上.

sns.pairplot(df[[‘temp‘, ‘atemp‘, ‘humidity‘, ‘windspeed‘, ‘casual‘, ‘registered‘, ‘count‘]])

為了方便縱覽全局, 我將圖片尺寸縮小, 如下圖所示. 縱軸從上往下依次是溫度, 體感溫度, 相對濕度, 風速, 未註冊用戶, 註冊用戶, 所有用戶, 橫軸從左往右是同樣的順序.

技術分享圖片

從圖上可以看出, 溫度和體感溫度分別與未註冊用戶, 註冊用戶, 所有用戶均有一定程度的正相關, 而相對濕度和風速與之呈現一定程度的負相關. 另外, 其他變量之間也有不同程度的相關關系.

另外, 第四列(風速)在散點圖中間有明顯的間隙. 需要揪出這一塊來看看.

df[‘windspeed‘]

0         0.0000
1         0.0000
2         0.0000
          ...   
10883    15.0013
10884     6.0032
10885     8.9981
Name: windspeed, Length: 10886, dtype: float64

風速為0, 這明顯不合理, 把其當成缺失值來處理. 我這裏選擇的是向後填充.

df.loc[df.windspeed == 0, ‘windspeed‘] = np.nan
df.fillna(method=‘bfill‘, inplace=True)
df.windspeed.isnull().sum()

9) 相關矩陣

由於多個變量不滿足正態分布, 對其進行對數變換.

#對數轉換
df[‘windspeed‘] = np.log(df[‘windspeed‘].apply(lambda x: x+1))
df[‘casual‘] = np.log(df[‘casual‘].apply(lambda x: x+1))
df[‘registered‘] = np.log(df[‘registered‘].apply(lambda x: x+1))
df[‘count‘] = np.log(df[‘count‘].apply(lambda x: x+1))
sns.pairplot(df[[‘windspeed‘, ‘casual‘, ‘registered‘, ‘count‘]])

技術分享圖片

經過對數變換之後, 註冊用戶和所有用戶的租賃數量和正態還是相差較大, 故在計算相關系數時選擇spearman相關系數.

correlation = df.corr(method=‘spearman‘)
plt.figure(figsize=(12, 8))
#繪制熱力圖
sns.heatmap(correlation, linewidths=0.2, vmax=1, vmin=-1, linecolor=‘w‘,
            annot=True,annot_kws={‘size‘:8},square=True)

均有不同程度的相關程度, 其中, temp和atemp高度相關, count和registered高度相關, 數值均達到0.99.

技術分享圖片

5. 回歸模型

嶺回歸和Lasso回歸是加了正則化項的線性回歸, 下面將分別構造兩個模型.

5.1 嶺回歸

1. 劃分數據集

from sklearn.model_selection import train_test_split
#由於所有用戶的租賃數量是由未註冊用戶和註冊用戶相加而成, 故刪除.
df.drop([‘casual‘,‘registered‘], axis=1, inplace=True)
X = df.drop([‘count‘], axis=1)
y = df[‘count‘]
#劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

2. 模型訓練

from sklearn.linear_model import Ridge
#這裏的alpha指的是正則化項參數, 初始先設置為1.
rd = Ridge(alpha=1)
rd.fit(X_train, y_train)
print(rd.coef_)
print(rd.intercept_)

[ 0.00770067 -0.00034301  0.0039196   0.00818243  0.03635549 -0.01558927
  0.09080788  0.0971406   0.02791812  0.06114358 -0.00099811]
2.6840271343740754

通過前面我們知道, 正則化項參數對結果的影響較大, 下一步我們就通過嶺跡圖來選擇正則化參數.

#設置參數以及訓練模型
alphas = 10**np.linspace(-5, 10, 500)
betas = []
for alpha in alphas:
    rd = Ridge(alpha = alpha)
    rd.fit(X_train, y_train)
    betas.append(rd.coef_)
#繪制嶺跡圖
plt.figure(figsize=(8,6))
plt.plot(alphas, betas)
#對數據進行對數轉換, 便於觀察.
plt.xscale(‘log‘)
#添加網格線
plt.grid(True)
#坐標軸適應數據量
plt.axis(‘tight‘)
plt.title(r‘正則化項參數$\alpha$和回歸系數$\beta$嶺跡圖‘)
plt.xlabel(r‘$\alpha$‘)
plt.ylabel(r‘$\beta$‘)
plt.show()

查看代碼

通過圖像可以看出, 當alpha為10⁷時所有變量嶺跡趨於穩定.按照嶺跡法應當取alpha=10⁷.

技術分享圖片

由於是通過肉眼觀察的, 其不一定是最佳, 采用另外一種方式: 交叉驗證的嶺回歸.

from sklearn.linear_model import RidgeCV
from sklearn import metrics
rd_cv = RidgeCV(alphas=alphas, cv=10, scoring=‘r2‘)
rd_cv.fit(X_train, y_train)
rd_cv.alpha_

805.0291812295973

最後選出的最佳正則化項參數為805.03, 然後用這個參數進行模型訓練

rd = Ridge(alpha=805.0291812295973) #, fit_intercept=False
rd.fit(X_train, y_train)
print(rd.coef_)
print(rd.intercept_)

[ 0.00074612 -0.00382265  0.00532093  0.01100823  0.03375475 -0.01582157
  0.0584206   0.09708992  0.02639369  0.0604242  -0.00116086]
2.7977274604845856

4. 模型預測

from sklearn import metrics
from math import sqrt
#分別預測訓練數據和測試數據
y_train_pred = rd.predict(X_train)
y_test_pred = rd.predict(X_test)
#分別計算其均方根誤差和擬合優度
y_train_rmse = sqrt(metrics.mean_squared_error(y_train, y_train_pred))
y_train_score = rd.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test, y_test_pred))
y_test_score = rd.score(X_test, y_test)
print(‘訓練集RMSE: {0}, 評分: {1}‘.format(y_train_rmse, y_train_score))
print(‘測試集RMSE: {0}, 評分: {1}‘.format(y_test_rmse, y_test_score))

訓練集RMSE: 1.0348076524200298, 評分: 0.46691272323469246
測試集RMSE: 1.0508046977499312, 評分: 0.45801571689420706

5.2 Lasso回歸

1. 模型訓練

from sklearn.linear_model import Lasso
alphas = 10**np.linspace(-5, 10, 500)
betas = []
for alpha in alphas:
    Las = Lasso(alpha = alpha)
    Las.fit(X_train, y_train)
    betas.append(Las.coef_)
plt.figure(figsize=(8,6))
plt.plot(alphas, betas)
plt.xscale(‘log‘)
plt.grid(True)
plt.axis(‘tight‘)
plt.title(r‘正則化項參數$\alpha$和回歸系數$\beta$的Lasso圖‘)
plt.xlabel(r‘$\alpha$‘)
plt.ylabel(r‘$\beta$‘)
plt.show()

查看代碼

通過Lasso回歸曲線, 可以看出大致在10附近所有變量趨於穩定

技術分享圖片

同樣采用交叉驗證選擇Lasso回歸最優正則化項參數

from sklearn.linear_model import LassoCV
from sklearn import metrics
Las_cv = LassoCV(alphas=alphas, cv=10)
Las_cv.fit(X_train, y_train)
Las_cv.alpha_

0.005074705239490466

用這個參數重新訓練模型

Las = Lasso(alpha=0.005074705239490466) #, fit_intercept=False
Las.fit(X_train, y_train)
print(Las.coef_)
print(Las.intercept_)

[ 0.         -0.          0.          0.01001827  0.03467474 -0.01570339
  0.06202352  0.09721864  0.02632133  0.06032038 -0.        ]
2.7808303982442952

對比嶺回歸可以發現, 這裏的回歸系數中有0存在, 也就是舍棄了holiday, workingday, weather和group_season這四個自變量.

#用Lasso分別預測訓練集和測試集, 並計算均方根誤差和擬合優度
y_train_pred = Las.predict(X_train)
y_test_pred = Las.predict(X_test)
y_train_rmse = sqrt(metrics.mean_squared_error(y_train, y_train_pred))
y_train_score = Las.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test, y_test_pred))
y_test_score = Las.score(X_test, y_test)
print(‘訓練集RMSE: {0}, 評分: {1}‘.format(y_train_rmse, y_train_score))
print(‘測試集RMSE: {0}, 評分: {1}‘.format(y_test_rmse, y_test_score))

訓練集RMSE: 1.0347988070045209, 評分: 0.4669218367318746
測試集RMSE: 1.050818996520012, 評分: 0.45800096674816204

最後, 再用傳統的線性回歸進行預測, 從而對比三者之間的差異.

from sklearn.linear_model import LinearRegression
#訓練線性回歸模型
LR = LinearRegression()
LR.fit(X_train, y_train)
print(LR.coef_)
print(LR.intercept_)
#分別預測訓練集和測試集, 並計算均方根誤差和擬合優度
y_train_pred = LR.predict(X_train)
y_test_pred = LR.predict(X_test)
y_train_rmse = sqrt(metrics.mean_squared_error(y_train, y_train_pred))
y_train_score = LR.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test, y_test_pred))
y_test_score = LR.score(X_test, y_test)
print(‘訓練集RMSE: {0}, 評分: {1}‘.format(y_train_rmse, y_train_score))
print(‘測試集RMSE: {0}, 評分: {1}‘.format(y_test_rmse, y_test_score))

[ 0.00775915 -0.00032048  0.00391537  0.00817703  0.03636054 -0.01558878
  0.09087069  0.09714058  0.02792397  0.06114454 -0.00099731]
2.6837869701964014
訓練集RMSE: 1.0347173340121176, 評分: 0.46700577529675036
測試集RMSE: 1.0510323073614725, 評分: 0.45778089839236114

總結

就測試集和訓練集均方根誤差之差來說, 線性回歸最大, 嶺回歸最小, 另外回歸在測試集的擬合優度最大, 總體來說, 嶺回歸在此數據集上表現略優.

技術分享圖片

就這個評分來說, 以上模型還不是很好, 還需要學習其他模型, 比如決策樹, 隨機森林, 神經網絡等.

聲明: 本文僅用作學習交流

Python數據分析--Kaggle共享單車項目實戰

sso 交流沒有統計 wrap info mon oat sele 前言上面一節我們介紹了一元線性回歸和多元線性回歸的原理, 又通過一個案例對多元線性回歸模型進一步了解, 其中談到自變量之間存在高度相關, 容易產生多重共線性問題, 對於多重共線性問題的解決方法有: 刪

Python數據分析--Kaggle共享單車項目實戰

前言

目錄

正文

1. 嶺回歸

2. Lasso回歸

3. 嶺回歸和Lasso回歸對比

4. 實際案例應用

1. 數據來源及數據背景

2. 數據概覽

3. 數據預處理

4. 特征分析

5. 回歸模型

5.1 嶺回歸

5.2 Lasso回歸

總結

Python數據分析--Kaggle共享單車項目實戰

Python數據分析+Kaggle案例培訓課程含課件代碼

Python數據分析基礎與實踐超具實戰意義的Python數據分析實踐課程 Python視頻教程

集成druid數據源【JWordPress前臺項目實戰】

python數據分析--KaggleTitanic項目實戰

Python資料分析--Kaggle共享單車專案實戰

Python數據分析－Kobe Bryan生涯數據讀取及分析

Python數據分析必備Anaconda安裝、快捷鍵、包安裝

python數據分析入門學習筆記兒

python數據分析筆記中panda(2)

利用python數據分析panda學習筆記之基本功能

搭建python數據分析平臺

Python數據分析(一): ipython 技巧！

Python數據分析(二): Numpy技巧 (3/4)

Python數據分析(二): Numpy技巧 (4/4)

Python數據分析工具

python 數據分析

[讀書筆記] Python數據分析（一）準備工作

利用Python數據分析-Numpy和Pands篇

CP1621-唐宇迪-python數據分析與機器實戰

Python數據分析--Kaggle共享單車項目實戰

前言

目錄

正文

1. 嶺回歸

2. Lasso回歸

3. 嶺回歸和Lasso回歸對比

4. 實際案例應用

1. 數據來源及數據背景

2. 數據概覽

3. 數據預處理

4. 特征分析

5. 回歸模型

5.1 嶺回歸

5.2 Lasso回歸

總結

相關推薦