Feature Selection with the Caret R Package

阿新 • • 發佈：2019-01-12

Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.

The caret R package provides tools to automatically report on the relevance and importance of attributes in your data and even select the most important features for you.

In this post you will discover the feature selection tools in the Caret R package with standalone recipes in R.

After reading this post you will know:

How to remove redundant features from your dataset.
How to rank features in your dataset by their importance.
How to select features from your dataset using the Recursive Feature Elimination method.

Let’s get started.

Confidence Intervals for Machine Learning
Photo by Paul Balfe, some rights reserved.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Remove Redundant Features

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.

The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is removed as it correlates highly with the pregnant attribute.

Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.

Identify highly correlated features in caret r package R

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# calculate correlation matrix
correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])
# summarize the correlation matrix
print(correlationMatrix)
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated attributes
print(highlyCorrelated)

123456789101112131415

# ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# calculate correlation matrixcorrelationMatrix<-cor(PimaIndiansDiabetes[,1:8])# summarize the correlation matrixprint(correlationMatrix)# find attributes that are highly corrected (ideally >0.75)highlyCorrelated<-findCorrelation(correlationMatrix,cutoff=0.5)# print indexes of highly correlated attributesprint(highlyCorrelated)

Rank Features By Importance

The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.

The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

Rank features by importance using the caret r package R

# ensure results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

1234567891011121314151617

# ensure results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datasetdata(PimaIndiansDiabetes)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(diabetes~.,data=PimaIndiansDiabetes,method="lvq",preProcess="scale",trControl=control)# estimate variable importanceimportance<-varImp(model,scale=FALSE)# summarize importanceprint(importance)# plot importanceplot(importance)

Rank of Features by Importance using Caret R Package

Feature Selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.

A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.

The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.

Automatically select features using Caret R Package R

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))

1234567891011121314151617

# ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# define the control using a random forest selection functioncontrol<-rfeControl(functions=rfFuncs,method="cv",number=10)# run the RFE algorithmresults<-rfe(PimaIndiansDiabetes[,1:8],PimaIndiansDiabetes[,9],sizes=c(1:8),rfeControl=control)# summarize the resultsprint(results)# list the chosen featurespredictors(results)# plot the resultsplot(results,type=c("g","o"))

Feature Selection Using the Caret R Package

Summary

In this post you discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.

Three standalone recipes in R were provided that you can copy-and-paste into your own project and adapt for your specific problems.

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

…with just a few lines of R code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Feature Selection with the Caret R Package

Need more Help with R for Machine Learning?

Remove Redundant Features

Rank Features By Importance

Feature Selection

Summary

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

Feature Selection with the Caret R Package

Data Visualization with the Caret R package

Tuning Machine Learning Models Using the Caret R Package

Compare Models And Select The Best Using The Caret R Package

Become a Better R Programmer with the Awesome ‘lobstr’ Package

Caret R Package for Applied Predictive Modeling

How To Estimate Model Accuracy in R Using The Caret Package

Feature Selection: A/B Test With Tableau

Compare outlier detection methods with the OutliersO3 package

R programming for feature selection and regression

Feature Selection for Time Series Forecasting with Python

Feature Selection in Python with Scikit

[SCSS] Write similar classes with the SCSS @for Control Directive

MySQL故障處理一例_Another MySQL daemon already running with the same unix socket

poj-2996 Help Me with the Game

Can not find a java.io.InputStream with the name [downloadFile] in the invocation stack.

oralce11g RAC 啟動後 CRS-0184: Cannot communicate with the CRS daemon.

Your build settings specify a provisioning profile with the UUID, no provisioning profile was

【RMAN】RMAN-05001: auxiliary filename conflicts with the target database

poj3311Hie with the Pie

Feature Selection with the Caret R Package

Need more Help with R for Machine Learning?

Remove Redundant Features

Rank Features By Importance

Feature Selection

Summary

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning ToYour Own Projects

相關推薦

Finally Bring Machine Learning To
Your Own Projects