學習筆記：【案例】財政收入影響因素分析及預測模型

阿新 • • 發佈：2018-08-07

6.0 pearson end 4.5 pandas 問題特征 ase max()

案例來源：《Python數據分析與挖掘實戰》第13章

案例背景與挖掘目標

輸入數據：
《某市統計年鑒》（1995-2014）

挖掘目標：

梳理影響地方財政收入的關鍵特征，分析、識別影響地方財政收入的關鍵特征的選擇模型
結合目標1的因素分析，對某市2015年的財政總收入及各個類別收入進行預測

分析方法與過程（選擇的原則）

以往對財政收入的分析會使用多元線性回歸模型，和最小二乘估計方法來估計回歸模型的系統，但這樣的結果對數據依賴程度很大，並且求得的往往只是局部最優解，後續的檢驗可能會失去應有的意義。
因此本案例運用Adaptive-Lasso變量選擇方法來研究。
子任務規劃

從某市統計局網站以及各統計年鑒搜集到該市財政收入以及各類別收入

建立Adaptive-Lasso變量選擇模型
代入構建好的人工神經網絡模型中，從而得到2015年預測值

實驗
掌握Adaptive-Lasso變量選擇和神經網絡預測模型

分析數據，識別關鍵特征，使用Adaptive-Lasso變量選擇方法進行篩選
用GM(1,1)灰色預測方法得到篩選出的關鍵影響因素的2014、2015的預測值
代入神經網絡模型，得到2014、2015預測值

代碼存檔：

實驗

掌握Adaptive-Lasso變量選擇和神經網絡預測模型

分析數據，識別關鍵特征，使用Adaptive-Lasso變量選擇方法進行篩選
用GM(1,1)灰色預測方法得到篩選出的關鍵影響因素的2014、2015的預測值

代入神經網絡模型，得到2014、2015預測值

import numpy as np
import pandas as pd
import os

# 查看數據概況
dpath = ‘./demo/data/data1.csv‘
input_data = pd.read_csv(dpath)
r = [input_data.min(),input_data.max(),input_data.mean(),input_data.std()]
r = pd.DataFrame(r, index=[‘Min‘,‘Max‘,‘Mean‘,‘Std‘])
r = np.round(r,2)
print(r)

              x1       x2       x3        x4        x5          x6       x7  Min   3831732.00   181.54   448.19   7571.00   6212.70  6370241.00   525.71   
Max   7599295.00  2110.78  6882.85  42049.14  33156.83  8323096.00  4454.55   
Mean  5579519.95   765.04  2370.83  19644.69  15870.95  7350513.60  1712.24   
Std   1262194.72   595.70  1919.17  10203.02   8199.77   621341.85  1184.71   

            x8      x9     x10     x11   x12       x13        y  
Min     985.31   60.62   65.66   97.50  1.03   5321.00    64.87  
Max   15420.14  228.46  852.56  120.00  1.91  41972.00  2088.14  
Mean   5705.80  129.49  340.22  103.31  1.42  17273.80   618.08  
Std    4478.40   50.51  251.58    5.51  0.25  11109.19   609.25

# 求解Pearson相關系數
np.round(input_data.corr(method=‘pearson‘),2)

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	y
x1	1.00	0.95	0.95	0.97	0.97	0.99	0.95	0.97	0.98	0.98	-0.29	0.94	0.96	0.94
x2	0.95	1.00	1.00	0.99	0.99	0.92	0.99	0.99	0.98	0.98	-0.13	0.89	1.00	0.98
x3	0.95	1.00	1.00	0.99	0.99	0.92	1.00	0.99	0.98	0.99	-0.15	0.89	1.00	0.99
x4	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.19	0.91	1.00	0.99
x5	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.18	0.90	0.99	0.99
x6	0.99	0.92	0.92	0.95	0.95	1.00	0.93	0.95	0.97	0.96	-0.34	0.95	0.94	0.91
x7	0.95	0.99	1.00	0.99	0.99	0.93	1.00	0.99	0.98	0.99	-0.15	0.89	1.00	0.99
x8	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.15	0.90	1.00	0.99
x9	0.98	0.98	0.98	0.99	0.99	0.97	0.98	0.99	1.00	0.99	-0.23	0.91	0.99	0.98
x10	0.98	0.98	0.99	1.00	1.00	0.96	0.99	1.00	0.99	1.00	-0.17	0.90	0.99	0.99
x11	-0.29	-0.13	-0.15	-0.19	-0.18	-0.34	-0.15	-0.15	-0.23	-0.17	1.00	-0.43	-0.16	-0.12
x12	0.94	0.89	0.89	0.91	0.90	0.95	0.89	0.90	0.91	0.90	-0.43	1.00	0.90	0.87
x13	0.96	1.00	1.00	1.00	0.99	0.94	1.00	1.00	0.99	0.99	-0.16	0.90	1.00	0.99
y	0.94	0.98	0.99	0.99	0.99	0.91	0.99	0.99	0.98	0.99	-0.12	0.87	0.99	1.00

結果顯示只有X11與結果y值呈現負相關，其余變量均為正相關。

# 導入AdaptiveLasso
from sklearn import linear_model
model = linear_model.Lasso(alpha=1)
model.fit(input_data.iloc[:,0:13], input_data[‘y‘])
model.coef_

/Users/januswing/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)





array([-1.85085555e-04, -3.15519378e-01,  4.32896206e-01, -3.15753523e-02,
        7.58007814e-02,  4.03145358e-04,  2.41255896e-01, -3.70482514e-02,
       -2.55448330e+00,  4.41363280e-01,  5.69277642e+00, -0.00000000e+00,
       -3.98946837e-02])

def GM11(x0): #自定義灰色預測函數
  import numpy as np
  x1 = x0.cumsum() #1-AGO序列
  z1 = (x1[:len(x1)-1] + x1[1:])/2.0 #緊鄰均值（MEAN）生成序列
  z1 = z1.reshape((len(z1),1))
  B = np.append(-z1, np.ones_like(z1), axis = 1)
  Yn = x0[1:].reshape((len(x0)-1, 1))
  [[a],[b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn) #計算參數
  f = lambda k: (x0[0]-b/a)*np.exp(-a*(k-1))-(x0[0]-b/a)*np.exp(-a*(k-2)) #還原值
  delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))
  C = delta.std()/x0.std()
  P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
  return f, a, b, x0[0], C, P #返回灰色預測函數、a、b、首項、方差比、小殘差概率

inputfile = ‘./demo/data/data1.csv‘ #輸入的數據文件
outputfile = ‘./demo/tmp/data1_GM11.xls‘ #灰色預測後保存的路徑
data = pd.read_csv(inputfile) #讀取數據
data.index = range(1994, 2014)

data.loc[2014] = None
data.loc[2015] = None
l = [‘x1‘, ‘x2‘, ‘x3‘, ‘x4‘, ‘x5‘, ‘x7‘]

for i in l:
  f = GM11(data[i][:20].as_matrix())[0]
  data[i][2014] = f(len(data)-1) #2014年預測結果
  data[i][2015] = f(len(data)) #2015年預測結果
  data[i] = data[i].round(2) #保留兩位小數

data[l+[‘y‘]].to_excel(outputfile) #結果輸出

data[l+[‘y‘]]

	x1	x2	x3	x4	x5	x7	y
1994	3831732.00	181.54	448.19	7571.00	6212.70	525.71	64.87
1995	3913824.00	214.63	549.97	9038.16	7601.73	618.25	99.75
1996	3928907.00	239.56	686.44	9905.31	8092.82	638.94	88.11
1997	4282130.00	261.58	802.59	10444.60	8767.98	656.58	106.07
1998	4453911.00	283.14	904.57	11255.70	9422.33	758.83	137.32
1999	4548852.00	308.58	1000.69	12018.52	9751.44	878.26	188.14
2000	4962579.00	348.09	1121.13	13966.53	11349.47	923.67	219.91
2001	5029338.00	387.81	1248.29	14694.00	11467.35	978.21	271.91
2002	5070216.00	453.49	1370.68	13380.47	10671.78	1009.24	269.10
2003	5210706.00	533.55	1494.27	15002.59	11570.58	1175.17	300.55
2004	5407087.00	598.33	1677.77	16884.16	13120.83	1348.93	338.45
2005	5744550.00	665.32	1905.84	18287.24	14468.24	1519.16	408.86
2006	5994973.00	738.97	2199.14	19850.66	15444.93	1696.38	476.72
2007	6236312.00	877.07	2624.24	22469.22	18951.32	1863.34	838.99
2008	6529045.00	1005.37	3187.39	25316.72	20835.95	2105.54	843.14
2009	6791495.00	1118.03	3615.77	27609.59	22820.89	2659.85	1107.67
2010	7110695.00	1304.48	4476.38	30658.49	25011.61	3263.57	1399.16
2011	7431755.00	1700.87	5243.03	34438.08	28209.74	3412.21	1535.14
2012	7512997.00	1969.51	5977.27	38053.52	30490.44	3758.39	1579.68
2013	7599295.00	2110.78	6882.85	42049.14	33156.83	4454.55	2088.14
2014	8142148.24	2239.29	7042.31	43611.84	35046.63	4600.40	NaN
2015	8460489.28	2581.14	8166.92	47792.22	38384.22	5214.78	NaN

import pandas as pd
inputfile = ‘./tmp/data1_GM11.xls‘ #灰色預測後保存的路徑
outputfile = ‘./data/revenue.xls‘ #神經網絡預測後保存的結果
modelfile = ‘./tmp/1-net.model‘ #模型保存路徑
data = pd.read_excel(inputfile) #讀取數據
feature = [‘x1‘, ‘x2‘, ‘x3‘, ‘x4‘, ‘x5‘, ‘x7‘] #特征所在列

data_train = data.loc[range(1994,2014)].copy() #取2014年前的數據建模
data_mean = data_train.mean()
data_std = data_train.std()
data_train = (data_train - data_mean)/data_std #數據標準化
x_train = data_train[feature].as_matrix() #特征數據
y_train = data_train[‘y‘].as_matrix() #標簽數據

from keras.models import Sequential
from keras.layers.core import Dense, Activation

model = Sequential() #建立模型
model.add(Dense(input_dim=6, output_dim=12))
model.add(Activation(‘relu‘)) #用relu函數作為激活函數，能夠大幅提供準確度
model.add(Dense(input_dim=12, output_dim=1))
model.compile(loss=‘mean_squared_error‘, optimizer=‘adam‘) #編譯模型
model.fit(x_train, y_train, nb_epoch = 10000, batch_size = 16, verbose=0) #訓練模型，學習一萬次
model.save_weights(modelfile) #保存模型參數

#預測，並還原結果。
x = ((data[feature] - data_mean[feature])/data_std[feature]).as_matrix()
data[u‘y_pred‘] = model.predict(x) * data_std[‘y‘] + data_mean[‘y‘]
data.to_excel(outputfile)

/Users/januswing/Library/Python/3.6/lib/python/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=6, units=12)`
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:21: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=12, units=1)`
/Users/januswing/Library/Python/3.6/lib/python/site-packages/keras/models.py:942: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  warnings.warn(‘The `nb_epoch` argument in `fit` ‘

import matplotlib.pyplot as plt #畫出預測結果圖
p = data[[‘y‘,‘y_pred‘]].plot(subplots = True, style=[‘b-o‘,‘r-*‘])
plt.show()

提出問題：

識別關鍵特征的方法還有哪些？哪些在PyTorch裏面可以用？

學習筆記：【案例】財政收入影響因素分析及預測模型

6.0 pearson end 4.5 pandas 問題特征 ase max() 案例來源：《Python數據分析與挖掘實戰》第13章案例背景與挖掘目標輸入數據：《某市統計年鑒》（1995-2014）挖掘目標：梳理影響地方財政收入的關鍵特征，分析、識別影響地

學習筆記：【案例】財政收入影響因素分析及預測模型

案例背景與挖掘目標

分析方法與過程（選擇的原則）

代碼存檔：

掌握Adaptive-Lasso變量選擇和神經網絡預測模型

提出問題：

學習筆記：【案例】財政收入影響因素分析及預測模型

學習筆記：【案例】中醫證型關聯規則挖掘

Linux學習筆記：【00？】BootLoader能夠做什麽

學習筆記：【Web 叢集實戰】05_CentOS 7.x 系統安裝後的基本配置及調優_楊利婷

Python學習筆記5 【轉載】基本矩陣運算_20170618

大前端學習筆記整理【五】rem與px換算的計算方式

大前端學習筆記整理【三】行內元素與塊級元素的區別以及絕對定位與固定定位的差異

大前端學習筆記整理【一】CSS盒模型與基於盒模型的6種元素居中方案

大前端學習筆記整理【七】HTTP協議以及http與https的區別

大前端學習筆記整理【五】關於JavaScript中的關鍵字——this

大前端學習筆記整理【二】CSS視覺格式化模型

大前端學習筆記整理【四】LESS基礎

大前端學習筆記整理【六】this關鍵字詳解

【iOS】客戶端安全性問題分析及處理方式

python學習筆記-day10-【類的擴展：重寫父類，新式類與經典的區別】

python學習筆記- day10-【問題： python為什麽python的多線程不能利用多核CPU？】

【學習筆記：CG基礎2】 Convex Hull

【學習筆記：計算幾何基礎3】 Convex Hull

【學習筆記：計算幾何基礎4】 Geometric Intersection

【學習筆記：計算幾何基礎5】 Triangulation

學習筆記：【案例】財政收入影響因素分析及預測模型

案例背景與挖掘目標

分析方法與過程（選擇的原則）

代碼存檔：

掌握Adaptive-Lasso變量選擇和神經網絡預測模型

提出問題：

相關推薦