1. 程式人生 > >特徵提升之特徵篩選

特徵提升之特徵篩選

良好的資料特徵組合不需太多,就可以使得模型的效能表現突出。冗餘的特徵雖然不會影響到模型的效能,但使得CPU的計算做了無用功。比如,PCA主要用於去除多餘的線性相關的特徵組合,因為這些冗餘的特徵組合不會對模型訓練有更多貢獻。不良的特徵自然會降低模型的精度。 特徵篩選與PCA這類通過主成分對特徵進行重建的方法略有區別:對於PCA,經常無法解釋重建之後的特徵;然而特徵篩選不存在對特徵值的修改,從而更加側重於尋找那些對模型的效能提升較大的少量特徵。

下面沿用Titanic資料集,試圖通過特徵篩選來尋找最佳的特徵組合,並且達到提高預測準確性的目標。

Python原始碼:

#coding=utf-8
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import feature_selection
from sklearn.cross_validation import cross_val_score
import numpy as np
import pylab as pl

#-------------download data
titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
#-------------sperate data and target
y=titanic['survived']
X=titanic.drop(['row.names','name','survived'],axis=1)
#-------------fulfill lost data with mean value
X['age'].fillna(X['age'].mean(),inplace=True)
X.fillna('UNKNOWN',inplace=True)
#-------------split data,25% for test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)
#-------------feature vectorization
vec=DictVectorizer()
X_train=vec.fit_transform(X_train.to_dict(orient='record'))
X_test=vec.transform(X_test.to_dict(orient='record'))
#-------------
print 'Dimensions of handled vector',len(vec.feature_names_)
#-------------use DTClassifier to predict and measure performance
dt=DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train,y_train)
print dt.score(X_test,y_test)
#-------------selection features ranked in the front 20%,use DTClassifier with the same config to predict and measure performance
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=20)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

percentiles=range(1,100,2)
results=[]

for i in percentiles:
    fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=i)
    X_train_fs=fs.fit_transform(X_train,y_train)
    scores=cross_val_score(dt,X_train_fs,y_train,cv=5)
    results=np.append(results,scores.mean())
print results
#-------------find feature selection percent with the best performance
opt=int(np.where(results==results.max())[0])
print 'Optimal number of features',percentiles[opt]
#TypeError: only integer scalar arrays can be converted to a scalar index
#transfer list to array
#print 'Optimal number of features',np.array(percentiles)[opt]

#-------------use the selected features and the same config to measure performance on test datas
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=7)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

pl.plot(percentiles,results)
pl.xlabel('percentiles of features')
pl.ylabel('accuracy')
pl.show()
Result:



分析:

1.經過初步的特徵處理後,最後訓練與測試資料均有474個維度的特徵。 2.若之間使用全部474個維度的特徵用於訓練決策樹模型進行分類預測,在測試集上的準確性約為81.76% 3.若篩選前20%的維度的特徵,在相同的模型配置下進行預測,那麼在測試集上表現的準確性約為82.37% 4.如果按照固定的間隔採用不同百分比的特徵進行訓練與測試,那麼如圖所示,通過交叉驗證得出的準確性有很大的波動,最好的模型效能表現在選取前7%維度的特徵的時候; 5.如果使用前7%維度的特徵,那麼最終決策樹模型可以在測試集上表現出85.71%的準確性,比最初使用全部特徵的模型高處近4%。