1. 程式人生 > >標準化/歸一化對機器學習經典模型的影響

標準化/歸一化對機器學習經典模型的影響

歸一化

資料標準化(歸一化)處理是資料探勘的一項基礎工作,不同評價指標往往具有不同的量綱和量綱單位,這樣的情況會影響到資料分析的結果,為了消除指標之間的量綱影響,需要進行資料標準化處理,以解決資料指標之間的可比性。原始資料經過資料標準化處理後,各指標處於同一數量級,適合進行綜合對比評價。

歸一化的幾種方法

MinMaxScaler

也稱為離差標準化,是對原始資料的線性變換,使結果值對映到[0 - 1]之間。轉換函式如下:
x

= x m i n m
a x m i n
x^*=\frac{x-min}{max-min}

MinMaxScaler

與上述標準化方法相似,但是它通過除以最大值將訓練集縮放至[-1,1]。這意味著資料已經以0為中心或者是含有非常非常多0的稀疏資料。

StandardScaler

計算訓練集的平均值和標準差,以便測試資料集使用相同的變換

實驗

實驗方法

我們通過比較在不同標準化方法下,四種機器學習中的經典模型的均方誤差(mean-square error, MSE)的大小來得出不同標準化或不標準化影響

實驗程式碼及其結果

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('huodian.csv')
data = data.sort_values(by='time',ascending=True)
data.reset_index(inplace=True,drop=True)

target = data['T1AOMW_AV']#target即是Y
del data['T1AOMW_AV']#在原data中刪去Y
# 找出存在缺失值的列
All_NaN = pd.DataFrame(data.isnull().sum()).reset_index()
All_NaN.columns = ['name','times']
All_NaN.describe()
times
count 170.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
# 去掉資料中變化較小的特徵
feature_describe_T = data.describe().T
unstd_feature = feature_describe_T[feature_describe_T['std']>=1].index
data = data[unstd_feature]
#刪除無關變數
del data['time']

test_data = data[:5000]
#切分資料集
data1 = data[5000:16060]
target1 = target[5000:16060]
data2 = data[16060:]
target2 = target[16060:]

import scipy.stats as stats
dict_corr = {
    'spearman' : [],
    'pearson' : [],
    'kendall' : [],
    'columns' : []
}
#對每一列求各項係數
for i in data.columns:
    corr_pear,pval = stats.pearsonr(data[i],target)
    corr_spear,pval = stats.spearmanr(data[i],target)
    corr_kendall,pval = stats.kendalltau(data[i],target)
    
    dict_corr['pearson'].append(abs(corr_pear))
    dict_corr['spearman'].append(abs(corr_spear))
    dict_corr['kendall'].append(abs(corr_kendall))
    
    dict_corr['columns'].append(i)
    
# 篩選新屬性  
dict_corr =pd.DataFrame(dict_corr)
new_fea = list(dict_corr[(dict_corr['pearson']>0.1) & (dict_corr['spearman']>0.15) & (dict_corr['kendall']>0.15)&(dict_corr['pearson']<0.93) & (dict_corr['spearman']<0.93) & (dict_corr['kendall']<0.93)]['columns'].values)
#new_fea = list(dict_corr[(dict_corr['pearson']<0.63) & (dict_corr['spearman']<0.69) & (dict_corr['kendall']<0.63)]['columns'].values)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import MinMaxScaler,StandardScaler,MaxAbsScaler
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.svm import SVR
import warnings
warnings.filterwarnings("ignore")
##分割資料集和測試集
X_train, X_test, y_train, y_test = train_test_split(data[new_fea],target,test_size=0.25,random_state=12345)



print('without normalization:')

estimator_lr = Lasso(alpha=0.5).fit(X_train,y_train)
predict_lr = estimator_lr.predict(X_test)
print('Lssao:',mse(predict_lr,y_test))

estimator_rg = Ridge(alpha=0.5).fit(X_train,y_train)
predict_rg = estimator_rg.predict(X_test)
print('Ridge:',mse(predict_rg,y_test))

estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(X_train,y_train)
predict_svr = estimator_svr.predict(X_test)
print('SVR:',mse(predict_svr,y_test))

estimator_RF = RandomForestRegressor().fit(X_train,y_train)
predict_RF = estimator_RF.predict(X_test)
print('RF:',mse(predict_RF,y_test))


mm = MinMaxScaler()
mm_x_train = mm.fit_transform(X_train)
mm_x_test = mm.transform(X_test)

print('MinMaxScaler:')
estimator_lr = Lasso(alpha=0.5).fit(mm_x_train,y_train)
predict_lr = estimator_lr.predict(mm_x_test)
print('Lssao:',mse(predict_lr,y_test))

estimator_rg = Ridge(alpha=0.5).fit(mm_x_train,y_train)
predict_rg = estimator_rg.predict(mm_x_test)
print('Ridge:',mse(predict_rg,y_test))

estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(mm_x_train,y_train)
predict_svr = estimator_svr.predict(mm_x_test)
print('SVR:',mse(predict_svr,y_test))

estimator_RF = RandomForestRegressor().fit(mm_x_train,y_train)
predict_RF = estimator_RF.predict(mm_x_test)
print('RF:',mse(predict_RF,y_test))



ma = MaxAbsScaler()
ma_x_train = ma.fit_transform(X_train)
ma_x_test = ma.transform(X_test)

print('MaxAbsScaler:')
estimator_lr = Lasso(alpha=0.5).fit(ma_x_train,y_train)
predict_lr = estimator_lr.predict(ma_x_test)
print('Lssao:',mse(predict_lr,y_test))

estimator_rg = Ridge(alpha=0.5).fit(ma_x_train,y_train)
predict_rg = estimator_rg.predict(ma_x_test)
print('Ridge:',mse(predict_rg,y_test))

estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(ma_x_train,y_train)
predict_svr = estimator_svr.predict(ma_x_test)
print('SVR:',mse(predict_svr,y_test))

estimator_RF = RandomForestRegressor().fit(ma_x_train,y_train)
predict_RF = estimator_RF.predict(ma_x_test)
print('RF:',mse(predict_RF,y_test))



ss = StandardScaler()
ss_x_train = ss.fit_transform(X_train)
ss_x_test = ss.transform(X_test)


print('StandardScaler:')
estimator_lr = Lasso(alpha=0.5).fit(ss_x_train,y_train)
predict_lr = estimator_lr.predict(ss_x_test)
print('Lssao:',mse(predict_lr,y_test))

estimator_rg = Ridge(alpha=0.5).fit(ss_x_train,y_train)
predict_rg = estimator_rg.predict(ss_x_test)
print('Ridge:',mse(predict_rg,y_test))

estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(ss_x_train,y_train)
predict_svr = estimator_svr.predict(ss_x_test)
print('SVR:',mse(predict_svr,y_test))

estimator_RF = RandomForestRegressor().fit(ss_x_train,y_train)
predict_RF = estimator_RF.predict(ss_x_test)
print('RF:',mse(predict_RF,y_test))
without normalization:
Lssao: 64.48569344896079
Ridge: 52.32215979123271
SVR: 2562.6181533319277
RF: 11.342877117923145
MinMaxScaler:
Lssao: 110.64816111661362
Ridge: 55.430338750636416
SVR: 37.81036885831256
RF: 10.204243317509082
MaxAbsScaler:
Lssao: 257.7066786267883
Ridge: 63.91979829622576
SVR: 69.74587878254961
RF: 11.721070230746417
StandardScaler:
Lssao: 81.70216554870805
Ridge: 52.5282264448465
SVR: 7.996381635964344
RF: 9.615276857782204

實驗結果分析

通過對比不難發現,對於Lssao模型,在歸一化之後其MSE有較明顯的增大,對於Ridge除MaxAbsScaler外歸一化的影響均不大,對於SVR假如不對其進行歸一化,其MSE會非常大,而使用Standerscaler效果最好,而不同的歸一化方法,或是否歸一化,對RF影響不大

原因分析

svm實質上選擇的是分割兩類資料最遠的超平面,由於錯分類造成了影響,不進行歸一化會造成對平面的影響,導致得到的劃分平面不準確測試整合功率低