學習筆記:【案例】財政收入影響因素分析及預測模型
阿新 • • 發佈:2018-08-07
6.0 pearson end 4.5 pandas 問題 特征 ase max()
案例來源:《Python數據分析與挖掘實戰》第13章
案例背景與挖掘目標
輸入數據:
《某市統計年鑒》(1995-2014)
挖掘目標:
- 梳理影響地方財政收入的關鍵特征,分析、識別影響地方財政收入的關鍵特征的選擇模型
- 結合目標1的因素分析,對某市2015年的財政總收入及各個類別收入進行預測
分析方法與過程(選擇的原則)
以往對財政收入的分析會使用多元線性回歸模型,和最小二乘估計方法來估計回歸模型的系統,但這樣的結果對數據依賴程度很大,並且求得的往往只是局部最優解,後續的檢驗可能會失去應有的意義。
因此本案例運用Adaptive-Lasso變量選擇方法來研究。
子任務規劃
- 從某市統計局網站以及各統計年鑒搜集到該市財政收入以及各類別收入
- 建立Adaptive-Lasso變量選擇模型
- 代入構建好的人工神經網絡模型中,從而得到2015年預測值
實驗
掌握Adaptive-Lasso變量選擇和神經網絡預測模型
- 分析數據,識別關鍵特征,使用Adaptive-Lasso變量選擇方法進行篩選
- 用GM(1,1)灰色預測方法得到篩選出的關鍵影響因素的2014、2015的預測值
- 代入神經網絡模型,得到2014、2015預測值
代碼存檔:
實驗
掌握Adaptive-Lasso變量選擇和神經網絡預測模型
- 分析數據,識別關鍵特征,使用Adaptive-Lasso變量選擇方法進行篩選
- 用GM(1,1)灰色預測方法得到篩選出的關鍵影響因素的2014、2015的預測值
- 代入神經網絡模型,得到2014、2015預測值
import numpy as np
import pandas as pd
import os
# 查看數據概況
dpath = ‘./demo/data/data1.csv‘
input_data = pd.read_csv(dpath)
r = [input_data.min(),input_data.max(),input_data.mean(),input_data.std()]
r = pd.DataFrame(r, index=[‘Min‘,‘Max‘,‘Mean‘,‘Std‘])
r = np.round(r,2)
print(r)
x1 x2 x3 x4 x5 x6 x7 Min 3831732.00 181.54 448.19 7571.00 6212.70 6370241.00 525.71 Max 7599295.00 2110.78 6882.85 42049.14 33156.83 8323096.00 4454.55 Mean 5579519.95 765.04 2370.83 19644.69 15870.95 7350513.60 1712.24 Std 1262194.72 595.70 1919.17 10203.02 8199.77 621341.85 1184.71 x8 x9 x10 x11 x12 x13 y Min 985.31 60.62 65.66 97.50 1.03 5321.00 64.87 Max 15420.14 228.46 852.56 120.00 1.91 41972.00 2088.14 Mean 5705.80 129.49 340.22 103.31 1.42 17273.80 618.08 Std 4478.40 50.51 251.58 5.51 0.25 11109.19 609.25
# 求解Pearson相關系數
np.round(input_data.corr(method=‘pearson‘),2)
x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x1 | 1.00 | 0.95 | 0.95 | 0.97 | 0.97 | 0.99 | 0.95 | 0.97 | 0.98 | 0.98 | -0.29 | 0.94 | 0.96 | 0.94 |
x2 | 0.95 | 1.00 | 1.00 | 0.99 | 0.99 | 0.92 | 0.99 | 0.99 | 0.98 | 0.98 | -0.13 | 0.89 | 1.00 | 0.98 |
x3 | 0.95 | 1.00 | 1.00 | 0.99 | 0.99 | 0.92 | 1.00 | 0.99 | 0.98 | 0.99 | -0.15 | 0.89 | 1.00 | 0.99 |
x4 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.19 | 0.91 | 1.00 | 0.99 |
x5 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.18 | 0.90 | 0.99 | 0.99 |
x6 | 0.99 | 0.92 | 0.92 | 0.95 | 0.95 | 1.00 | 0.93 | 0.95 | 0.97 | 0.96 | -0.34 | 0.95 | 0.94 | 0.91 |
x7 | 0.95 | 0.99 | 1.00 | 0.99 | 0.99 | 0.93 | 1.00 | 0.99 | 0.98 | 0.99 | -0.15 | 0.89 | 1.00 | 0.99 |
x8 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.15 | 0.90 | 1.00 | 0.99 |
x9 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.97 | 0.98 | 0.99 | 1.00 | 0.99 | -0.23 | 0.91 | 0.99 | 0.98 |
x10 | 0.98 | 0.98 | 0.99 | 1.00 | 1.00 | 0.96 | 0.99 | 1.00 | 0.99 | 1.00 | -0.17 | 0.90 | 0.99 | 0.99 |
x11 | -0.29 | -0.13 | -0.15 | -0.19 | -0.18 | -0.34 | -0.15 | -0.15 | -0.23 | -0.17 | 1.00 | -0.43 | -0.16 | -0.12 |
x12 | 0.94 | 0.89 | 0.89 | 0.91 | 0.90 | 0.95 | 0.89 | 0.90 | 0.91 | 0.90 | -0.43 | 1.00 | 0.90 | 0.87 |
x13 | 0.96 | 1.00 | 1.00 | 1.00 | 0.99 | 0.94 | 1.00 | 1.00 | 0.99 | 0.99 | -0.16 | 0.90 | 1.00 | 0.99 |
y | 0.94 | 0.98 | 0.99 | 0.99 | 0.99 | 0.91 | 0.99 | 0.99 | 0.98 | 0.99 | -0.12 | 0.87 | 0.99 | 1.00 |
結果顯示只有X11與結果y值呈現負相關,其余變量均為正相關。
# 導入AdaptiveLasso
from sklearn import linear_model
model = linear_model.Lasso(alpha=1)
model.fit(input_data.iloc[:,0:13], input_data[‘y‘])
model.coef_
/Users/januswing/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
array([-1.85085555e-04, -3.15519378e-01, 4.32896206e-01, -3.15753523e-02,
7.58007814e-02, 4.03145358e-04, 2.41255896e-01, -3.70482514e-02,
-2.55448330e+00, 4.41363280e-01, 5.69277642e+00, -0.00000000e+00,
-3.98946837e-02])
def GM11(x0): #自定義灰色預測函數
import numpy as np
x1 = x0.cumsum() #1-AGO序列
z1 = (x1[:len(x1)-1] + x1[1:])/2.0 #緊鄰均值(MEAN)生成序列
z1 = z1.reshape((len(z1),1))
B = np.append(-z1, np.ones_like(z1), axis = 1)
Yn = x0[1:].reshape((len(x0)-1, 1))
[[a],[b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn) #計算參數
f = lambda k: (x0[0]-b/a)*np.exp(-a*(k-1))-(x0[0]-b/a)*np.exp(-a*(k-2)) #還原值
delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))
C = delta.std()/x0.std()
P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
return f, a, b, x0[0], C, P #返回灰色預測函數、a、b、首項、方差比、小殘差概率
inputfile = ‘./demo/data/data1.csv‘ #輸入的數據文件
outputfile = ‘./demo/tmp/data1_GM11.xls‘ #灰色預測後保存的路徑
data = pd.read_csv(inputfile) #讀取數據
data.index = range(1994, 2014)
data.loc[2014] = None
data.loc[2015] = None
l = [‘x1‘, ‘x2‘, ‘x3‘, ‘x4‘, ‘x5‘, ‘x7‘]
for i in l:
f = GM11(data[i][:20].as_matrix())[0]
data[i][2014] = f(len(data)-1) #2014年預測結果
data[i][2015] = f(len(data)) #2015年預測結果
data[i] = data[i].round(2) #保留兩位小數
data[l+[‘y‘]].to_excel(outputfile) #結果輸出
data[l+[‘y‘]]
x1 | x2 | x3 | x4 | x5 | x7 | y | |
---|---|---|---|---|---|---|---|
1994 | 3831732.00 | 181.54 | 448.19 | 7571.00 | 6212.70 | 525.71 | 64.87 |
1995 | 3913824.00 | 214.63 | 549.97 | 9038.16 | 7601.73 | 618.25 | 99.75 |
1996 | 3928907.00 | 239.56 | 686.44 | 9905.31 | 8092.82 | 638.94 | 88.11 |
1997 | 4282130.00 | 261.58 | 802.59 | 10444.60 | 8767.98 | 656.58 | 106.07 |
1998 | 4453911.00 | 283.14 | 904.57 | 11255.70 | 9422.33 | 758.83 | 137.32 |
1999 | 4548852.00 | 308.58 | 1000.69 | 12018.52 | 9751.44 | 878.26 | 188.14 |
2000 | 4962579.00 | 348.09 | 1121.13 | 13966.53 | 11349.47 | 923.67 | 219.91 |
2001 | 5029338.00 | 387.81 | 1248.29 | 14694.00 | 11467.35 | 978.21 | 271.91 |
2002 | 5070216.00 | 453.49 | 1370.68 | 13380.47 | 10671.78 | 1009.24 | 269.10 |
2003 | 5210706.00 | 533.55 | 1494.27 | 15002.59 | 11570.58 | 1175.17 | 300.55 |
2004 | 5407087.00 | 598.33 | 1677.77 | 16884.16 | 13120.83 | 1348.93 | 338.45 |
2005 | 5744550.00 | 665.32 | 1905.84 | 18287.24 | 14468.24 | 1519.16 | 408.86 |
2006 | 5994973.00 | 738.97 | 2199.14 | 19850.66 | 15444.93 | 1696.38 | 476.72 |
2007 | 6236312.00 | 877.07 | 2624.24 | 22469.22 | 18951.32 | 1863.34 | 838.99 |
2008 | 6529045.00 | 1005.37 | 3187.39 | 25316.72 | 20835.95 | 2105.54 | 843.14 |
2009 | 6791495.00 | 1118.03 | 3615.77 | 27609.59 | 22820.89 | 2659.85 | 1107.67 |
2010 | 7110695.00 | 1304.48 | 4476.38 | 30658.49 | 25011.61 | 3263.57 | 1399.16 |
2011 | 7431755.00 | 1700.87 | 5243.03 | 34438.08 | 28209.74 | 3412.21 | 1535.14 |
2012 | 7512997.00 | 1969.51 | 5977.27 | 38053.52 | 30490.44 | 3758.39 | 1579.68 |
2013 | 7599295.00 | 2110.78 | 6882.85 | 42049.14 | 33156.83 | 4454.55 | 2088.14 |
2014 | 8142148.24 | 2239.29 | 7042.31 | 43611.84 | 35046.63 | 4600.40 | NaN |
2015 | 8460489.28 | 2581.14 | 8166.92 | 47792.22 | 38384.22 | 5214.78 | NaN |
import pandas as pd
inputfile = ‘./tmp/data1_GM11.xls‘ #灰色預測後保存的路徑
outputfile = ‘./data/revenue.xls‘ #神經網絡預測後保存的結果
modelfile = ‘./tmp/1-net.model‘ #模型保存路徑
data = pd.read_excel(inputfile) #讀取數據
feature = [‘x1‘, ‘x2‘, ‘x3‘, ‘x4‘, ‘x5‘, ‘x7‘] #特征所在列
data_train = data.loc[range(1994,2014)].copy() #取2014年前的數據建模
data_mean = data_train.mean()
data_std = data_train.std()
data_train = (data_train - data_mean)/data_std #數據標準化
x_train = data_train[feature].as_matrix() #特征數據
y_train = data_train[‘y‘].as_matrix() #標簽數據
from keras.models import Sequential
from keras.layers.core import Dense, Activation
model = Sequential() #建立模型
model.add(Dense(input_dim=6, output_dim=12))
model.add(Activation(‘relu‘)) #用relu函數作為激活函數,能夠大幅提供準確度
model.add(Dense(input_dim=12, output_dim=1))
model.compile(loss=‘mean_squared_error‘, optimizer=‘adam‘) #編譯模型
model.fit(x_train, y_train, nb_epoch = 10000, batch_size = 16, verbose=0) #訓練模型,學習一萬次
model.save_weights(modelfile) #保存模型參數
#預測,並還原結果。
x = ((data[feature] - data_mean[feature])/data_std[feature]).as_matrix()
data[u‘y_pred‘] = model.predict(x) * data_std[‘y‘] + data_mean[‘y‘]
data.to_excel(outputfile)
/Users/januswing/Library/Python/3.6/lib/python/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=6, units=12)`
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:21: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=12, units=1)`
/Users/januswing/Library/Python/3.6/lib/python/site-packages/keras/models.py:942: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
warnings.warn(‘The `nb_epoch` argument in `fit` ‘
import matplotlib.pyplot as plt #畫出預測結果圖
p = data[[‘y‘,‘y_pred‘]].plot(subplots = True, style=[‘b-o‘,‘r-*‘])
plt.show()
?
提出問題:
識別關鍵特征的方法還有哪些?哪些在PyTorch裏面可以用?
學習筆記:【案例】財政收入影響因素分析及預測模型