1. 程式人生 > >Python金融系列第五篇:多元線性迴歸和殘差分析

Python金融系列第五篇:多元線性迴歸和殘差分析

作者:chen_h
微訊號 & QQ:862251340
微信公眾號:coderpai


第一篇:計算股票回報率,均值和方差

第二篇:簡單線性迴歸

第三篇:隨機變數和分佈

第四篇:置信區間和假設檢驗

第五篇:多元線性迴歸和殘差分析

第六篇:現代投資組合理論

第七篇:市場風險

第八篇:Fama-French 多因子模型


介紹

在前某章中,我們介紹了簡單的線性迴歸,它只有一個自變數。在本章中,我們將學習具有多個自變數的線性迴歸。

簡單的線性迴歸模型以下列形式編寫:

Y

= α + β X + ϵ Y = \alpha + \beta X + \epsilon

具有 p 個變數的多元線性迴歸模型可以由下面的公式給出:

Y = α + β 1 X 1

+ β 2 X 2 + β 3 X 3 + + β p X p + ϵ Y = \alpha + \beta_{1}X_{1}+ \beta_{2}X_{2}+ \beta_{3}X_{3}+ \cdots + \beta_{p}X_{p} + \epsilon

Python 實現

在上一章中,我們使用標普500指數來預測亞馬遜股票收益率。現在我們將新增更多變數來改進模型的預測。特別是,我們將考慮亞馬遜的競爭對手。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from pandas_datareader import data as pdr
import fix_yahoo_finance as yf

# Get stock prices
spy_table  = pdr.get_data_yahoo("SPY")
amzn_table = pdr.get_data_yahoo("AMZN")
ebay_table = pdr.get_data_yahoo("EBAY")
wal_table  = pdr.get_data_yahoo("WMT")
aapl_table = pdr.get_data_yahoo("AAPL")

然後我們從 2016 年開始獲取收盤價:

spy  = spy_table .loc['2016',['Close']]
amzn = amzn_table.loc['2016',['Close']]
ebay = ebay_table.loc['2016',['Close']]
wal  = wal_table .loc['2016',['Close']]
aapl = aapl_table.loc['2016',['Close']]

在獲取每個股票的日誌返回後,我們將它們連線成一個 DataFrame,並打印出最後五行:

spy_log  = np.log(spy.Close) .diff().dropna()
amzn_log = np.log(amzn.Close).diff().dropna()
ebay_log = np.log(ebay.Close).diff().dropna()
wal_log  = np.log(wal.Close) .diff().dropna()
aapl_log = np.log(aapl.Close).diff().dropna()
df = pd.concat([spy_log,amzn_log,ebay_log,wal_log,aapl_log],axis = 1).dropna()
df.columns = ['SPY', 'AMZN', 'EBAY', 'WAL', 'AAPL']
df.tail()
SPY AMZN EBAY WAL AAPL
Date
2016-12-23 0.001463 -0.007531 0.008427 -0.000719 0.001976
2016-12-27 0.002478 0.014113 0.014993 0.002298 0.006331
2016-12-28 -0.008299 0.000946 -0.007635 -0.005611 -0.004273
2016-12-29 -0.000223 -0.009081 -0.001000 -0.000722 -0.000257
2016-12-30 -0.003662 -0.020172 -0.009720 -0.002023 -0.007826

跟以前一樣,我們使用 statsmodels 包來執行簡單的線性迴歸:

import statsmodels.formula.api as sm
simple = sm.ols(formula = 'amzn ~ spy',data = df).fit()
print(simple.summary()) 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   amzn   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.227
Method:                 Least Squares   F-statistic:                     74.46
Date:                Tue, 09 Oct 2018   Prob (F-statistic):           7.44e-16
Time:                        11:55:12   Log-Likelihood:                 680.94
No. Observations:                 251   AIC:                            -1358.
Df Residuals:                     249   BIC:                            -1351.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0002      0.001      0.196      0.845      -0.002       0.002
spy            1.0661      0.124      8.629      0.000       0.823       1.309
==============================================================================
Omnibus:                       67.332   Durbin-Watson:                   2.018
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2026.389
Skew:                          -0.074   Prob(JB):                         0.00
Kurtosis:                      16.919   Cond. No.                         121.
==============================================================================

同樣,我們可以構建一個多元線性迴歸模型:

import statsmodels.formula.api as sm
model = sm.ols(formula = 'amzn ~ spy + ebay + wal',data = df).fit()
print(model.summary()) 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   amzn   R-squared:                       0.250
Model:                            OLS   Adj. R-squared:                  0.238
Method:                 Least Squares   F-statistic:                     20.52
Date:                Tue, 09 Oct 2018   Prob (F-statistic):           1.32e-14
Time:                        13:23:15   Log-Likelihood:                 684.25
No. Observations:                 251   AIC:                            -1358.
Df Residuals:                     246   BIC:                            -1341.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0002      0.001      0.229      0.819      -0.002       0.002
spy            1.0254      0.170      6.038      0.000       0.691       1.360
ebay          -0.0774      0.058     -1.325      0.186      -0.193       0.038
wal           -0.0838      0.089     -0.943      0.346      -0.259       0.091
aapl           0.1576      0.084      1.883      0.061      -0.007       0.322
==============================================================================
Omnibus:                       69.077   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1890.930
Skew:                          -0.272   Prob(JB):                         0.00
Kurtosis:                      16.435   Cond. No.                         179.
==============================================================================

從上表中我們可以看出,ebay,walmart 和 apple 的 p 值分別是 0.186,0.346,0.061,因此在 95% 置信水平下他們都不顯著。多元迴歸模型具有比簡單模型更高的 R 2 R^{2} ,0.254 VS 0.234。實際上, R 2 R^{2} 不會隨著變數數量的增加而減少。為什麼呢?如果在我們的迴歸模型中新增一個額外的變數,但它無法解釋響應中的變化(amzn),那麼它的估計係數將只是零。就好像該變數從未包含在模型中一樣,因此 R 2 R^{2} 不會改變。但是,新增數百個變數並不總是更好,這個問題我們會在後續章節中討論。

我們可以進一步改進模型嗎?在這裡,我們嘗試 Fama-French 5因子模型,這是資產定價理論中的一個重要模型。我們將會在後面的教程中介紹。資料下載地址

path = './F-F_Research_Data_5_Factors_2x3_daily.CSV'
fama_table = pd.read_csv(path)

# Convert time column into index
fama_table.index = [datetime.strptime(str(x), "%Y%m%d")
                    for x in fama_table.iloc[:,0]]
# Remove time column
fama_table = fama_table.iloc[:,1:]

通過這些資料,我們可以構建一個 Fama-French 因子模型:

fama = fama_table['2016']
fama = fama.rename(columns = {'Mkt-RF':'MKT'})
fama = fama.apply(lambda x: x/100)
fama_df = pd.concat([fama, amzn_log], axis = 1)
fama_model = sm.ols(formula = 'Close~MKT+SMB+HML+RMW+CMA', data = fama_df).fit()
print(fama_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Close   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.375
Method:                 Least Squares   F-statistic:                     30.96
Date:                Tue, 09 Oct 2018   Prob (F-statistic):           2.24e-24
Time:                        13:46:31   Log-Likelihood:                 709.57
No. Observations:                 251   AIC:                            -1407.
Df Residuals:                     245   BIC:                            -1386.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
======================================================