Python金融系列第五篇:多元線性迴歸和殘差分析
作者:chen_h
微訊號 & QQ:862251340
微信公眾號:coderpai
第六篇:現代投資組合理論
第七篇:市場風險
第八篇:Fama-French 多因子模型
介紹
在前某章中,我們介紹了簡單的線性迴歸,它只有一個自變數。在本章中,我們將學習具有多個自變數的線性迴歸。
簡單的線性迴歸模型以下列形式編寫:
具有 p 個變數的多元線性迴歸模型可以由下面的公式給出:
Python 實現
在上一章中,我們使用標普500指數來預測亞馬遜股票收益率。現在我們將新增更多變數來改進模型的預測。特別是,我們將考慮亞馬遜的競爭對手。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from pandas_datareader import data as pdr
import fix_yahoo_finance as yf
# Get stock prices
spy_table = pdr.get_data_yahoo("SPY")
amzn_table = pdr.get_data_yahoo("AMZN")
ebay_table = pdr.get_data_yahoo("EBAY")
wal_table = pdr.get_data_yahoo("WMT")
aapl_table = pdr.get_data_yahoo("AAPL")
然後我們從 2016 年開始獲取收盤價:
spy = spy_table .loc['2016',['Close']]
amzn = amzn_table.loc['2016',['Close']]
ebay = ebay_table.loc['2016',['Close']]
wal = wal_table .loc['2016',['Close']]
aapl = aapl_table.loc['2016',['Close']]
在獲取每個股票的日誌返回後,我們將它們連線成一個 DataFrame,並打印出最後五行:
spy_log = np.log(spy.Close) .diff().dropna()
amzn_log = np.log(amzn.Close).diff().dropna()
ebay_log = np.log(ebay.Close).diff().dropna()
wal_log = np.log(wal.Close) .diff().dropna()
aapl_log = np.log(aapl.Close).diff().dropna()
df = pd.concat([spy_log,amzn_log,ebay_log,wal_log,aapl_log],axis = 1).dropna()
df.columns = ['SPY', 'AMZN', 'EBAY', 'WAL', 'AAPL']
df.tail()
SPY | AMZN | EBAY | WAL | AAPL | |
---|---|---|---|---|---|
Date | |||||
2016-12-23 | 0.001463 | -0.007531 | 0.008427 | -0.000719 | 0.001976 |
2016-12-27 | 0.002478 | 0.014113 | 0.014993 | 0.002298 | 0.006331 |
2016-12-28 | -0.008299 | 0.000946 | -0.007635 | -0.005611 | -0.004273 |
2016-12-29 | -0.000223 | -0.009081 | -0.001000 | -0.000722 | -0.000257 |
2016-12-30 | -0.003662 | -0.020172 | -0.009720 | -0.002023 | -0.007826 |
跟以前一樣,我們使用 statsmodels
包來執行簡單的線性迴歸:
import statsmodels.formula.api as sm
simple = sm.ols(formula = 'amzn ~ spy',data = df).fit()
print(simple.summary())
OLS Regression Results
==============================================================================
Dep. Variable: amzn R-squared: 0.230
Model: OLS Adj. R-squared: 0.227
Method: Least Squares F-statistic: 74.46
Date: Tue, 09 Oct 2018 Prob (F-statistic): 7.44e-16
Time: 11:55:12 Log-Likelihood: 680.94
No. Observations: 251 AIC: -1358.
Df Residuals: 249 BIC: -1351.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0002 0.001 0.196 0.845 -0.002 0.002
spy 1.0661 0.124 8.629 0.000 0.823 1.309
==============================================================================
Omnibus: 67.332 Durbin-Watson: 2.018
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2026.389
Skew: -0.074 Prob(JB): 0.00
Kurtosis: 16.919 Cond. No. 121.
==============================================================================
同樣,我們可以構建一個多元線性迴歸模型:
import statsmodels.formula.api as sm
model = sm.ols(formula = 'amzn ~ spy + ebay + wal',data = df).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: amzn R-squared: 0.250
Model: OLS Adj. R-squared: 0.238
Method: Least Squares F-statistic: 20.52
Date: Tue, 09 Oct 2018 Prob (F-statistic): 1.32e-14
Time: 13:23:15 Log-Likelihood: 684.25
No. Observations: 251 AIC: -1358.
Df Residuals: 246 BIC: -1341.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0002 0.001 0.229 0.819 -0.002 0.002
spy 1.0254 0.170 6.038 0.000 0.691 1.360
ebay -0.0774 0.058 -1.325 0.186 -0.193 0.038
wal -0.0838 0.089 -0.943 0.346 -0.259 0.091
aapl 0.1576 0.084 1.883 0.061 -0.007 0.322
==============================================================================
Omnibus: 69.077 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1890.930
Skew: -0.272 Prob(JB): 0.00
Kurtosis: 16.435 Cond. No. 179.
==============================================================================
從上表中我們可以看出,ebay,walmart 和 apple 的 p 值分別是 0.186,0.346,0.061,因此在 95% 置信水平下他們都不顯著。多元迴歸模型具有比簡單模型更高的 ,0.254 VS 0.234。實際上, 不會隨著變數數量的增加而減少。為什麼呢?如果在我們的迴歸模型中新增一個額外的變數,但它無法解釋響應中的變化(amzn),那麼它的估計係數將只是零。就好像該變數從未包含在模型中一樣,因此 不會改變。但是,新增數百個變數並不總是更好,這個問題我們會在後續章節中討論。
我們可以進一步改進模型嗎?在這裡,我們嘗試 Fama-French 5因子模型,這是資產定價理論中的一個重要模型。我們將會在後面的教程中介紹。資料下載地址
path = './F-F_Research_Data_5_Factors_2x3_daily.CSV'
fama_table = pd.read_csv(path)
# Convert time column into index
fama_table.index = [datetime.strptime(str(x), "%Y%m%d")
for x in fama_table.iloc[:,0]]
# Remove time column
fama_table = fama_table.iloc[:,1:]
通過這些資料,我們可以構建一個 Fama-French 因子模型:
fama = fama_table['2016']
fama = fama.rename(columns = {'Mkt-RF':'MKT'})
fama = fama.apply(lambda x: x/100)
fama_df = pd.concat([fama, amzn_log], axis = 1)
fama_model = sm.ols(formula = 'Close~MKT+SMB+HML+RMW+CMA', data = fama_df).fit()
print(fama_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Close R-squared: 0.387
Model: OLS Adj. R-squared: 0.375
Method: Least Squares F-statistic: 30.96
Date: Tue, 09 Oct 2018 Prob (F-statistic): 2.24e-24
Time: 13:46:31 Log-Likelihood: 709.57
No. Observations: 251 AIC: -1407.
Df Residuals: 245 BIC: -1386.
Df Model: 5
Covariance Type: nonrobust
======================================================