使用R語言進行機器學習特徵選擇①

特徵選擇 R語言 · 發表 2018-10-20 07:02:16

摘要：特徵選擇是實用機器學習的重要一步，一般資料集都帶有太多的特徵用於模型構建，如何找出有用特徵是值得關注的內容。使用caret包,使用遞迴特徵消除法，rfe引數:x，預測變數的矩陣或資料框,y，輸出結果向量（數值型或因子型）,sizes，用於測試的特定子集大小的整型向量,rfeControl...

特徵選擇是實用機器學習的重要一步，一般資料集都帶有太多的特徵用於模型構建，如何找出有用特徵是值得關注的內容。

使用caret包,使用遞迴特徵消除法，rfe引數:x，預測變數的矩陣或資料框,y，輸出結果向量（數值型或因子型）,sizes，用於測試的特定子集大小的整型向量,rfeControl，用於指定預測模型和方法的一系列選項

一些列函式可以用於rfeControl$functions，包括：線性迴歸（lmFuncs），隨機森林（rfFuncs），樸素貝葉斯(nbFuncs)，bagged trees（treebagFuncs)和可以用於caret的train函式的函式（caretFuncs）。

1 移除冗餘特徵,移除高度關聯的特徵。

set.seed(1234)
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
Matrix <- PimaIndiansDiabetes[,1:8]





library(Hmisc)
up_CorMatrix <- function(cor,p) {ut <- upper.tri(cor) 
data.frame(row = rownames(cor)[row(cor)[ut]] ,
column = rownames(cor)[col(cor)[ut]], 
cor =(cor)[ut] ) }

res <- rcorr(as.matrix(Matrix))
cor_data <- up_CorMatrix (res$r)
cor_data <- subset(cor_data, cor_data$cor > 0.5)
 cor_data
row columncor
22 pregnantage 0.5443412

2 根據重要性進行特徵排序

特徵重要性可以通過構建模型獲取。一些模型，諸如決策樹，內建有特徵重要性的獲取機制。另一些模型，每個特徵重要性利用ROC曲線分析獲取。下例載入Pima Indians Diabetes資料集，構建一個Learning Vector Quantization（LVQ）模型。varImp用於獲取特徵重要性。從圖中可以看出glucose, mass和age是前三個最重要的特徵，insulin是最不重要的特徵。

# ensure results are repeatable
set.seed(1234)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

ROC curve variable importance

Importance
glucose0.7881
mass0.6876
age0.6869
pregnant0.6195
pedigree0.6062
pressure0.5865
triceps0.5536
insulin0.5379

3特徵選擇

自動特徵選擇用於構建不同子集的許多模型，識別哪些特徵有助於構建準確模型，哪些特徵沒什麼幫助。特徵選擇的一個流行的自動方法稱為遞迴特徵消除（Recursive Feature Elimination）或RFE。

下例在Pima Indians Diabetes資料集上提供RFE方法例子。隨機森林演算法用於每一輪迭代中評估模型的方法。該演算法用於探索所有可能的特徵子集。從圖中可以看出當使用5個特徵時即可獲取與最高效能相差無幾的結果。

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))


Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables AccuracyKappa AccuracySD KappaSD Selected
10.6926 0.26530.04916 0.10925
20.7343 0.39060.04725 0.10847
30.7356 0.40580.05105 0.11126
40.7513 0.44350.04222 0.09472
50.7604 0.45390.05007 0.11691*
60.7499 0.43640.04327 0.09967
70.7603 0.45740.04052 0.09838
80.7590 0.45490.04804 0.10781

The top 5 variables (out of 5):
glucose, mass, age, pregnant, insulin

使用R語言進行機器學習特徵選擇①

您可能也會喜歡…