How To Estimate Model Accuracy in R Using The Caret Package

阿新 • • 發佈：2019-01-12

When you are building a predictive model, you need a way to evaluate the capability of the model on unseen data.

This is typically done by estimating accuracy using data that was not used to train the model such as a test set, or using cross validation. The

caret package in R provides a number of methods to estimate the accuracy of a machines learning algorithm.

In this post you discover 5 approaches for estimating model performance on unseen data. You will also have access to recipes in R using the caret package for each method, that you can copy and paste into your own project, right now.

Estimating Model Accuracy

We have considered model accuracy before in the configuration of test options in a test harness. You can read more in the post: How To Choose The Right Test Options When Evaluating Machine Learning Algorithms.

In this post you can going to discover 5 different methods that you can use to estimate model accuracy.

They are as follows and each will be described in turn:

Data Split
Bootstrap
k-fold Cross Validation
Repeated k-fold Cross Validation
Leave One Out Cross Validation

Generally, I would recommend Repeated k-fold Cross Validation, but each method has its features and benefits, especially when the amount of data or space and time complexity are considered. Consider which approach best suits your problem.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Split

Data splitting involves partitioning the data into an explicit training dataset used to prepare the model and an unseen test dataset used to evaluate the models performance on unseen data.

It is useful when you have a very large dataset so that the test dataset can provide a meaningful estimation of performance, or for when you are using slow methods and need a quick approximation of performance.

The example below splits the iris dataset so that 80% is used for training a Naive Bayes model and 20% is used to evaluate the models performance.

Data Split in R R

# load the libraries
library(caret)
library(klaR)
# load the iris dataset
data(iris)
# define an 80%/20% train/test split of the dataset
split=0.80
trainIndex <- createDataPartition(iris$Species, p=split, list=FALSE)
data_train <- iris[ trainIndex,]
data_test <- iris[-trainIndex,]
# train a naive bayes model
model <- NaiveBayes(Species~., data=data_train)
# make predictions
x_test <- data_test[,1:4]
y_test <- data_test[,5]
predictions <- predict(model, x_test)
# summarize results
confusionMatrix(predictions$class, y_test)

123456789101112131415161718

# load the librarieslibrary(caret)library(klaR)# load the iris datasetdata(iris)# define an 80%/20% train/test split of the datasetsplit=0.80trainIndex<-createDataPartition(iris$Species,p=split,list=FALSE)data_train<-iris[trainIndex,]data_test<-iris[-trainIndex,]# train a naive bayes modelmodel<-NaiveBayes(Species~.,data=data_train)# make predictionsx_test<-data_test[,1:4]y_test<-data_test[,5]predictions<-predict(model,x_test)# summarize resultsconfusionMatrix(predictions$class,y_test)

Bootstrap

Bootstrap resampling involves taking random samples from the dataset (with re-selection) against which to evaluate the model. In aggregate, the results provide an indication of the variance of the models performance. Typically, large number of resampling iterations are performed (thousands or tends of thousands).

The following example uses a bootstrap with 10 resamples to prepare a Naive Bayes model.

Data Bootstrap in R R

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
train_control <- trainControl(method="boot", number=100)
# train the model
model <- train(Species~., data=iris, trControl=train_control, method="nb")
# summarize results
print(model)

12345678910

# load the librarylibrary(caret)# load the iris datasetdata(iris)# define training controltrain_control<-trainControl(method="boot",number=100)# train the modelmodel<-train(Species~.,data=iris,trControl=train_control,method="nb")# summarize resultsprint(model)

k-fold Cross Validation

The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided.

It is a robust method for estimating accuracy, and the size of k and tune the amount of bias in the estimate, with popular values set to 3, 5, 7 and 10.

The following example uses 10-fold cross validation to estimate Naive Bayes on the iris dataset.

k-fold Cross Validation in R R

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
train_control <- trainControl(method="cv", number=10)
# fix the parameters of the algorithm
grid <- expand.grid(.fL=c(0), .usekernel=c(FALSE))
# train the model
model <- train(Species~., data=iris, trControl=train_control, method="nb", tuneGrid=grid)
# summarize results
print(model)

123456789101112

# load the librarylibrary(caret)# load the iris datasetdata(iris)# define training controltrain_control<-trainControl(method="cv",number=10)# fix the parameters of the algorithmgrid<-expand.grid(.fL=c(0),.usekernel=c(FALSE))# train the modelmodel<-train(Species~.,data=iris,trControl=train_control,method="nb",tuneGrid=grid)# summarize resultsprint(model)

Repeated k-fold Cross Validation

The process of splitting the data into k-folds can be repeated a number of times, this is called Repeated k-fold Cross Validation. The final model accuracy is taken as the mean from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats to estimate Naive Bayes on the iris dataset.

Repeated k-fold Cross Validation in R R

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(Species~., data=iris, trControl=train_control, method="nb")
# summarize results
print(model)

12345678910

# load the librarylibrary(caret)# load the iris datasetdata(iris)# define training controltrain_control<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(Species~.,data=iris,trControl=train_control,method="nb")# summarize resultsprint(model)

Leave One Out Cross Validation

In Leave One Out Cross Validation (LOOCV), a data instance is left out and a model constructed on all other data instances in the training set. This is repeated for all data instances.

The following example demonstrates LOOCV to estimate Naive Bayes on the iris dataset.

Leave One Out Cross Validation in R R

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
train_control <- trainControl(method="LOOCV")
# train the model
model <- train(Species~., data=iris, trControl=train_control, method="nb")
# summarize results
print(model)

12345678910

# load the librarylibrary(caret)# load the iris datasetdata(iris)# define training controltrain_control<-trainControl(method="LOOCV")# train the modelmodel<-train(Species~.,data=iris,trControl=train_control,method="nb")# summarize resultsprint(model)

Summary

In this post you discovered 5 different methods that you can use to estimate the accuracy of your model on unseen data.

Those methods were: Data Split, Bootstrap, k-fold Cross Validation, Repeated k-fold Cross Validation, and Leave One Out Cross Validation.

You can learn more about the caret package in R at the caret package homepage and the caret package CRAN page. If you would like to master the caret package, I would recommend the book written by the author of the package, titled: Applied Predictive Modeling, especially Chapter 4 on overfitting models.

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

…with just a few lines of R code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

How To Estimate Model Accuracy in R Using The Caret Package

Estimating Model Accuracy

Need more Help with R for Machine Learning?

Data Split

Bootstrap

k-fold Cross Validation

Repeated k-fold Cross Validation

Leave One Out Cross Validation

Summary

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

How To Estimate Model Accuracy in R Using The Caret Package

How to setup Assigned Access in Windows 10 (Kiosk Mode) 設置分配的訪問權限(Kiosk模式)

How to Catch Ctrl-C in Shell Script

在pycharm中調試ryu應用（How to debug Ryu applications in Pycharm or other IDEs）

How To Enable EPEL Repository in RHEL/CentOS 7/6/5?

[Selenium+Java] How to Upload & Download a File using Selenium Webdriver

How to fix Error: listen EADDRINUSE while using nodejs

How to execute sudo command in remote host via SSH

How to setup kernel debug in Virtual Machine and redirect usermode debug sessions

How to remove ROM cfg in MAME

How to get current timestamps in Java

How To Handle Click Events In Android RecyclerViews

How to Generate SQL Trace In OAF

How To Use Retrofit Library In Your Android App

How To Create Custom Dialog In Android With Validation

How to make a GroupBox in website development by VS.NET2005

How to split a string in C++

How to get browser information in JSP?

[iOS] How to limit character input in UIAlertView UITextField

How to Disable Directory Browsing in WordPress

How To Estimate Model Accuracy in R Using The Caret Package

Estimating Model Accuracy

Need more Help with R for Machine Learning?

Data Split

Bootstrap

k-fold Cross Validation

Repeated k-fold Cross Validation

Leave One Out Cross Validation

Summary

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning ToYour Own Projects

相關推薦

Finally Bring Machine Learning To
Your Own Projects