特徵提升之特徵篩選

阿新 • • 發佈：2019-01-26

良好的資料特徵組合不需太多，就可以使得模型的效能表現突出。冗餘的特徵雖然不會影響到模型的效能，但使得CPU的計算做了無用功。比如，PCA主要用於去除多餘的線性相關的特徵組合，因為這些冗餘的特徵組合不會對模型訓練有更多貢獻。不良的特徵自然會降低模型的精度。特徵篩選與PCA這類通過主成分對特徵進行重建的方法略有區別：對於PCA，經常無法解釋重建之後的特徵；然而特徵篩選不存在對特徵值的修改，從而更加側重於尋找那些對模型的效能提升較大的少量特徵。

下面沿用Titanic資料集，試圖通過特徵篩選來尋找最佳的特徵組合，並且達到提高預測準確性的目標。

Python原始碼：

#coding=utf-8
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import feature_selection
from sklearn.cross_validation import cross_val_score
import numpy as np
import pylab as pl

#-------------download data
titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
#-------------sperate data and target
y=titanic['survived']
X=titanic.drop(['row.names','name','survived'],axis=1)
#-------------fulfill lost data with mean value
X['age'].fillna(X['age'].mean(),inplace=True)
X.fillna('UNKNOWN',inplace=True)
#-------------split data，25% for test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)
#-------------feature vectorization
vec=DictVectorizer()
X_train=vec.fit_transform(X_train.to_dict(orient='record'))
X_test=vec.transform(X_test.to_dict(orient='record'))
#-------------
print 'Dimensions of handled vector',len(vec.feature_names_)
#-------------use DTClassifier to predict and measure performance
dt=DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train,y_train)
print dt.score(X_test,y_test)
#-------------selection features ranked in the front 20%,use DTClassifier with the same config to predict and measure performance
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=20)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

percentiles=range(1,100,2)
results=[]

for i in percentiles:
    fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=i)
    X_train_fs=fs.fit_transform(X_train,y_train)
    scores=cross_val_score(dt,X_train_fs,y_train,cv=5)
    results=np.append(results,scores.mean())
print results
#-------------find feature selection percent with the best performance
opt=int(np.where(results==results.max())[0])
print 'Optimal number of features',percentiles[opt]
#TypeError: only integer scalar arrays can be converted to a scalar index
#transfer list to array
#print 'Optimal number of features',np.array(percentiles)[opt]

#-------------use the selected features and the same config to measure performance on test datas
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=7)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

pl.plot(percentiles,results)
pl.xlabel('percentiles of features')
pl.ylabel('accuracy')
pl.show()

Result：

分析：

1.經過初步的特徵處理後，最後訓練與測試資料均有474個維度的特徵。 2.若之間使用全部474個維度的特徵用於訓練決策樹模型進行分類預測，在測試集上的準確性約為81.76% 3.若篩選前20%的維度的特徵，在相同的模型配置下進行預測，那麼在測試集上表現的準確性約為82.37% 4.如果按照固定的間隔採用不同百分比的特徵進行訓練與測試，那麼如圖所示，通過交叉驗證得出的準確性有很大的波動，最好的模型效能表現在選取前7％維度的特徵的時候； 5.如果使用前7%維度的特徵，那麼最終決策樹模型可以在測試集上表現出85.71%的準確性，比最初使用全部特徵的模型高處近4%。

特徵提升之特徵篩選

特徵提升之特徵篩選

14.【進階】特徵提升之特徵抽取----DictVectorizer

15.【進階】特徵提升之特徵抽取--CountVectorizer和TfidfVectorizer

1. 特徵工程之特徵預處理

2. 特徵工程之特徵選擇

機器學習特徵工程之特徵抽取

機器學習特徵工程之特徵預處理

【資料平臺】sklearn庫特徵工程之特徵選擇和降維

不想累死就來看看 : 特徵工程之特徵選擇

特徵工程之特徵抽取

特徵工程之特徵選擇

面對各種資料怎麼處理 : 特徵工程之特徵表達

資料探勘篇——特徵工程之特徵降維

Alink漫談(九) ：特徵工程之特徵雜湊/標準化縮放

機器學習系列之特徵工程

特徵選擇(1):特徵相關性度量之互資訊量(matlab程式碼實現)

資料特徵工程之量化裝箱

系統學習機器學習之特徵工程（二）--離散型特徵編碼方式：LabelEncoder、one-hot與啞變數*

影象處理之特徵提取

特徵工程之Histogram編碼

特徵提升之特徵篩選

相關推薦