1. 程式人生 > >Feature Selection with the Caret R Package

Feature Selection with the Caret R Package

Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.

The caret R package provides tools to automatically report on the relevance and importance of attributes in your data and even select the most important features for you.

In this post you will discover the feature selection tools in the Caret R package with standalone recipes in R.

After reading this post you will know:

  • How to remove redundant features from your dataset.
  • How to rank features in your dataset by their importance.
  • How to select features from your dataset using the Recursive Feature Elimination method.

Let’s get started.

Confidence Intervals for Machine Learning

Confidence Intervals for Machine Learning
Photo by Paul Balfe, some rights reserved.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Remove Redundant Features

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.

The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is removed as it correlates highly with the pregnant attribute.

Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.

Identify highly correlated features in caret r package R
123456789101112131415 # ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# calculate correlation matrixcorrelationMatrix<-cor(PimaIndiansDiabetes[,1:8])# summarize the correlation matrixprint(correlationMatrix)# find attributes that are highly corrected (ideally >0.75)highlyCorrelated<-findCorrelation(correlationMatrix,cutoff=0.5)# print indexes of highly correlated attributesprint(highlyCorrelated)

Rank Features By Importance

The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.

The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

Rank features by importance using the caret r package R
1234567891011121314151617 # ensure results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datasetdata(PimaIndiansDiabetes)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(diabetes~.,data=PimaIndiansDiabetes,method="lvq",preProcess="scale",trControl=control)# estimate variable importanceimportance<-varImp(model,scale=FALSE)# summarize importanceprint(importance)# plot importanceplot(importance)
Rank of Features by Importance

Rank of Features by Importance using Caret R Package

Feature Selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.

A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.

The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.

Automatically select features using Caret R Package R
1234567891011121314151617 # ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# define the control using a random forest selection functioncontrol<-rfeControl(functions=rfFuncs,method="cv",number=10)# run the RFE algorithmresults<-rfe(PimaIndiansDiabetes[,1:8],PimaIndiansDiabetes[,9],sizes=c(1:8),rfeControl=control)# summarize the resultsprint(results)# list the chosen featurespredictors(results)# plot the resultsplot(results,type=c("g","o"))
Feature Selection

Feature Selection Using the Caret R Package

Summary

In this post you discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.

Three standalone recipes in R were provided that you can copy-and-paste into your own project and adapt for your specific problems.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

Feature Selection with the Caret R Package

Tweet Share Share Google Plus Selecting the right features in your data can mean the difference

Data Visualization with the Caret R package

Tweet Share Share Google Plus The caret package in R is designed to streamline the process of ap

Tuning Machine Learning Models Using the Caret R Package

Tweet Share Share Google Plus Machine learning algorithms are parameterized so that they can be

Compare Models And Select The Best Using The Caret R Package

Tweet Share Share Google Plus The Caret R package allows you to easily construct many different

Become a Better R Programmer with the Awesome ‘lobstr’ Package

“Tools amplify your talent. The better your tools, and the better you know how to use them, the more productive you can be.” — Andrew Hunt, The Pragmatic P

Caret R Package for Applied Predictive Modeling

Tweet Share Share Google Plus The R platform for statistical computing is perhaps the most popul

How To Estimate Model Accuracy in R Using The Caret Package

Tweet Share Share Google Plus When you are building a predictive model, you need a way to evalua

Feature Selection: A/B Test With Tableau

Feature Selection: A/B Test With TableauDuring a data science project it is important to prepare the data before analyzing them or create a model that gene

Compare outlier detection methods with the OutliersO3 package

by Antony Unwin, University of Augsburg, GermanyThere are many different methods for identifying outliers and a lot of them are available

R programming for feature selection and regression

data introduction Select packages Split dataset feature selection tune parameters prediciton 1. data introduction 我的資料包含

Feature Selection for Time Series Forecasting with Python

Tweet Share Share Google Plus The use of machine learning methods on time series data requires f

Feature Selection in Python with Scikit

Tweet Share Share Google Plus Not all data attributes are created equal. More is not always bett

[SCSS] Write similar classes with the SCSS @for Control Directive

att oop enc rem coo tro from mil for Writing similar classes with minor variations, like utility classes, can be a pain to write and upda

MySQL故障處理一例_Another MySQL daemon already running with the same unix socket

read mon 解決 roo blog local 啟動mysql style 處理 MySQL故障處理一例:“Another MySQL daemon already running with the same unix socket”。 [root@test-121

poj-2996 Help Me with the Game

ora except small source ade else sub sca arch Help Me with the Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions:

Can not find a java.io.InputStream with the name [downloadFile] in the invocation stack.

dex parameter work put 嚴重 efi open post onerror 1、錯誤描寫敘述八月 14, 2015 4:22:45 下午 com.opensymphony.xwork2.util.logging.jdk.JdkLogger error

oralce11g RAC 啟動後 CRS-0184: Cannot communicate with the CRS daemon.

asm art bili 解決 completed target let 服務器 style 很奇怪的一個問題! ORACLE數據庫服務器,系統啟動之後,查看集群狀態,發現CRS實例不可用,然後網上查找資料; 隔了幾分鐘之後,再次查詢相關集群服務狀態,發現正常了!!!

Your build settings specify a provisioning profile with the UUID, no provisioning profile was

settings 解決 目的 del 查找 set post 出錯 pretty iOS 真機調試問題 在Archive項目時,出現了“Your build settings specify a provisioning profile with the UUID

【RMAN】RMAN-05001: auxiliary filename conflicts with the target database

cat 主庫 check unique lin 創建 庫文件 lgwr err oracle 11.2.0.4 運行以下腳本,使用活動數據庫復制技術創建dataguard備庫報錯rman-005001: run{ duplicate target database

poj3311Hie with the Pie

bsp wid pan sel scan namespace names like val Hie with the Pie Time Limit: 2000MS Memory Limit: 65536K Total Submissions: 7599