How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Ensembles can give you a boost in accuracy on your dataset.

In this post you will discover how you can create three of the most powerful types of ensembles in R.

This case study will step you through Boosting, Bagging and Stacking and show you how you can continue to ratchet up the accuracy of the models on your own datasets.

Let’s get started.

Build an Ensemble Of Machine Learning Algorithms in R
Photo by Barbara Hobbs, some rights reserved.

Increase The Accuracy Of Your Models

It can take time to find well performing machine learning algorithms for your dataset. This is because of the trial and error nature of applied machine learning.

Once you have a shortlist of accurate models, you can use algorithm tuning to get the most from each algorithm.

Another approach that you can use to increase accuracy on your dataset is to combine the predictions of multiple different models together.

This is called an ensemble prediction.

Combine Model Predictions Into Ensemble Predictions

The three most popular methods for combining the predictions from different models are:

Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

This post will not explain each of these methods. It assumes you are generally familiar with machine learning algorithms and ensemble methods and that you are looking for information on how to create ensembles with R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Ensemble Machine Learning in R

You can create ensembles of machine learning algorithms in R.

There are three main techniques that you can create an ensemble of machine learning algorithms in R: Boosting, Bagging and Stacking. In this section, we will look at each in turn.

Before we start building ensembles, let’s define our test set-up.

Test Dataset

All of the examples of ensemble predictions in this case study will use the ionosphere dataset.

This is a dataset available from the UCI Machine Learning Repository. This dataset describes high-frequency antenna returns from high energy particles in the atmosphere and whether the return shows structure or not. The problem is a binary classification that contains 351 instances and 35 numerical attributes.

Let’s load the libraries and the dataset.

# Load libraries
library(mlbench)
library(caret)
library(caretEnsemble)

# Load the dataset
data(Ionosphere)
dataset <- Ionosphere
dataset <- dataset[,-2]
dataset$V1 <- as.numeric(as.character(dataset$V1))

12345678910

# Load librarieslibrary(mlbench)library(caret)library(caretEnsemble)# Load the datasetdata(Ionosphere)dataset<-Ionospheredataset<-dataset[,-2]dataset$V1<-as.numeric(as.character(dataset$V1))

Note that the first attribute was a factor (0,1) and has been transformed to be numeric for consistency with all of the other numeric attributes. Also note that the second attribute is a constant and has been removed.

Here is a sneak-peek at the first few rows of the ionosphere dataset.

> head(dataset)
  V1      V3       V4       V5       V6       V7       V8      V9      V10     V11      V12     V13      V14      V15
1  1 0.99539 -0.05889  0.85243  0.02306  0.83398 -0.37708 1.00000  0.03760 0.85243 -0.17755 0.59755 -0.44945  0.60536
2  1 1.00000 -0.18829  0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685
3  1 1.00000 -0.03365  1.00000  0.00485  1.00000 -0.12062 0.88965  0.01198 0.73082  0.05346 0.85443  0.00827  0.54591
4  1 1.00000 -0.45161  1.00000  1.00000  0.71216 -1.00000 0.00000  0.00000 0.00000  0.00000 0.00000  0.00000 -1.00000
5  1 1.00000 -0.02401  0.94140  0.06531  0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712  0.34395
6  1 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706  0.06637 0.03786 -0.06302 0.00000  0.00000 -0.04572
       V16      V17      V18      V19      V20      V21      V22      V23      V24      V25      V26      V27      V28
1 -0.38223  0.84356 -0.38542  0.58212 -0.32192  0.56971 -0.29674  0.36946 -0.47357  0.56811 -0.51171  0.41078 -0.46168
2 -0.97515  0.05499 -0.62237  0.33109 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401
3  0.00299  0.83775 -0.13644  0.75535 -0.08540  0.70887 -0.27502  0.43385 -0.12062  0.57528 -0.40220  0.58984 -0.22145
4  0.14516  0.54094 -0.39330 -1.00000 -0.54467 -0.69975  1.00000  0.00000  0.00000  1.00000  0.90695  0.51613  1.00000
5 -0.27457  0.52940 -0.21780  0.45107 -0.17813  0.05982 -0.35575  0.02309 -0.52879  0.03286 -0.65158  0.13290 -0.53206
6 -0.15540 -0.00343 -0.10196 -0.11575 -0.05414  0.01838  0.03669  0.01519  0.00888  0.03513 -0.01535 -0.03240  0.09223
       V29      V30      V31      V32      V33      V34 Class
1  0.21266 -0.34090  0.42267 -0.54487  0.18641 -0.45300  good
2 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447   bad
3  0.43100 -0.17365  0.60436 -0.24180  0.56045 -0.38238  good
4  1.00000 -0.20099  0.25682  1.00000 -0.32382  1.00000   bad
5  0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697  good
6 -0.07859  0.00732  0.00000  0.00000 -0.00039  0.12011   bad

12345678910111213141516171819202122

> head(dataset) V1 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V151 1 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 0.59755 -0.44945 0.605362 1 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.516853 1 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 0.85443 0.00827 0.545914 1 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.000005 1 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712 0.343956 1 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706 0.06637 0.03786 -0.06302 0.00000 0.00000 -0.04572 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V281 -0.38223 0.84356 -0.38542 0.58212 -0.32192 0.56971 -0.29674 0.36946 -0.47357 0.56811 -0.51171 0.41078 -0.461682 -0.97515 0.05499 -0.62237 0.33109 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.184013 0.00299 0.83775 -0.13644 0.75535 -0.08540 0.70887 -0.27502 0.43385 -0.12062 0.57528 -0.40220 0.58984 -0.221454 0.14516 0.54094 -0.39330 -1.00000 -0.54467 -0.69975 1.00000 0.00000 0.00000 1.00000 0.90695 0.51613 1.000005 -0.27457 0.52940 -0.21780 0.45107 -0.17813 0.05982 -0.35575 0.02309 -0.52879 0.03286 -0.65158 0.13290 -0.532066 -0.15540 -0.00343 -0.10196 -0.11575 -0.05414 0.01838 0.03669 0.01519 0.00888 0.03513 -0.01535 -0.03240 0.09223 V29 V30 V31 V32 V33 V34 Class1 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300 good2 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447 bad3 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238 good4 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000 bad5 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697 good6 -0.07859 0.00732 0.00000 0.00000 -0.00039 0.12011 bad

For more information, see the description of the Ionosphere dataset on the UCI Machine Learning Repository.

1. Boosting Algorithms

We can look at two of the most popular boosting machine learning algorithms:

C5.0
Stochastic Gradient Boosting

Below is an example of the C5.0 and Stochastic Gradient Boosting (using the Gradient Boosting Modeling implementation) algorithms in R. Both algorithms include parameters that are not tuned in this example.

# Example of Boosting Algorithms
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
# C5.0
set.seed(seed)
fit.c50 <- train(Class~., data=dataset, method="C5.0", metric=metric, trControl=control)
# Stochastic Gradient Boosting
set.seed(seed)
fit.gbm <- train(Class~., data=dataset, method="gbm", metric=metric, trControl=control, verbose=FALSE)
# summarize results
boosting_results <- resamples(list(c5.0=fit.c50, gbm=fit.gbm))
summary(boosting_results)
dotplot(boosting_results)

1234567891011121314

# Example of Boosting Algorithmscontrol<-trainControl(method="repeatedcv",number=10,repeats=3)seed<-7metric<-"Accuracy"# C5.0set.seed(seed)fit.c50<-train(Class~.,data=dataset,method="C5.0",metric=metric,trControl=control)# Stochastic Gradient Boostingset.seed(seed)fit.gbm<-train(Class~.,data=dataset,method="gbm",metric=metric,trControl=control,verbose=FALSE)# summarize resultsboosting_results<-resamples(list(c5.0=fit.c50,gbm=fit.gbm))summary(boosting_results)dotplot(boosting_results)

We can see that the C5.0 algorithm produces a more accurate model with an accuracy of 94.58%.

Models: c5.0, gbm 
Number of resamples: 30 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
c5.0 0.8824  0.9143 0.9437 0.9458  0.9714    1    0
gbm  0.8824  0.9143 0.9429 0.9402  0.9641    1    0

1234567

Models: c5.0, gbm Number of resamples: 30 Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. NA'sc5.0 0.8824 0.9143 0.9437 0.9458 0.9714 1 0gbm 0.8824 0.9143 0.9429 0.9402 0.9641 1 0

Boosting Machine Learning Algorithms in R

Learn more about caret boosting models tree: Boosting Models.

2. Bagging Algorithms

Let’s look at two of the most popular bagging machine learning algorithms:

Bagged CART
Random Forest

Below is an example of the Bagged CART and Random Forest algorithms in R. Both algorithms include parameters that are not tuned in this example.

# Example of Bagging algorithms
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
# Bagged CART
set.seed(seed)
fit.treebag <- train(Class~., data=dataset, method="treebag", metric=metric, trControl=control)
# Random Forest
set.seed(seed)
fit.rf <- train(Class~., data=dataset, method="rf", metric=metric, trControl=control)
# summarize results
bagging_results <- resamples(list(treebag=fit.treebag, rf=fit.rf))
summary(bagging_results)
dotplot(bagging_results)

1234567891011121314

# Example of Bagging algorithmscontrol<-trainControl(method="repeatedcv",number=10,repeats=3)seed<-7metric<-"Accuracy"# Bagged CARTset.seed(seed)fit.treebag<-train(Class~.,data=dataset,method="treebag",metric=metric,trControl=control)# Random Forestset.seed(seed)fit.rf<-train(Class~.,data=dataset,method="rf",metric=metric,trControl=control)# summarize resultsbagging_results<-resamples(list(treebag=fit.treebag,rf=fit.rf))summary(bagging_results)dotplot(bagging_results)

We can see that random forest produces a more accurate model with an accuracy of 93.25%.

Models: treebag, rf 
Number of resamples: 30 

Accuracy 
          Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
treebag 0.8529  0.8946 0.9143 0.9183  0.9440    1    0
rf      0.8571  0.9143 0.9420 0.9325  0.9444    1    0

1234567

Models: treebag, rf Number of resamples: 30 Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. NA'streebag 0.8529 0.8946 0.9143 0.9183 0.9440 1 0rf 0.8571 0.9143 0.9420 0.9325 0.9444 1 0

Bagging Machine Learning Algorithms in R

Learn more about caret bagging model here: Bagging Models.

3. Stacking Algorithms

You can combine the predictions of multiple caret models using the caretEnsemble package.

Given a list of caret models, the caretStack() function can be used to specify a higher-order model to learn how to best combine the predictions of sub-models together.

Let’s first look at creating 5 sub-models for the ionosphere dataset, specifically:

Linear Discriminate Analysis (LDA)
Classification and Regression Trees (CART)
Logistic Regression (via Generalized Linear Model or GLM)
k-Nearest Neighbors (kNN)
Support Vector Machine with a Radial Basis Kernel Function (SVM)

Below is an example that creates these 5 sub-models. Note the new helpful caretList() function provided by the caretEnsemble package for creating a list of standard caret models.

# Example of Stacking algorithms
# create submodels
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('lda', 'rpart', 'glm', 'knn', 'svmRadial')
set.seed(seed)
models <- caretList(Class~., data=dataset, trControl=control, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)

123456789

# Example of Stacking algorithms# create submodelscontrol<-trainControl(method="repeatedcv",number=10,repeats=3,savePredictions=TRUE,classProbs=TRUE)algorithmList<-c('lda','rpart','glm','knn','svmRadial')set.seed(seed)models<-caretList(Class~.,data=dataset,trControl=control,methodList=algorithmList)results<-resamples(models)summary(results)dotplot(results)

We can see that the SVM creates the most accurate model with an accuracy of 94.66%.

Models: lda, rpart, glm, knn, svmRadial 
Number of resamples: 30 

Accuracy 
            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lda       0.7714  0.8286 0.8611 0.8645  0.9060 0.9429    0
rpart     0.7714  0.8540 0.8873 0.8803  0.9143 0.9714    0
glm       0.7778  0.8286 0.8873 0.8803  0.9167 0.9722    0
knn       0.7647  0.8056 0.8431 0.8451  0.8857 0.9167    0
svmRadial 0.8824  0.9143 0.9429 0.9466  0.9722 1.0000    0

12345678910

Models: lda, rpart, glm, knn, svmRadial Number of resamples: 30 Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. NA'slda 0.7714 0.8286 0.8611 0.8645 0.9060 0.9429 0rpart 0.7714 0.8540 0.8873 0.8803 0.9143 0.9714 0glm 0.7778 0.8286 0.8873 0.8803 0.9167 0.9722 0knn 0.7647 0.8056 0.8431 0.8451 0.8857 0.9167 0svmRadial 0.8824 0.9143 0.9429 0.9466 0.9722 1.0000 0

Comparison of Sub-Models for Stacking Ensemble in R

When we combine the predictions of different models using stacking, it is desirable that the predictions made by the sub-models have low correlation. This would suggest that the models are skillful but in different ways, allowing a new classifier to figure out how to get the best from each model for an improved score.

If the predictions for the sub-models were highly corrected (>0.75) then they would be making the same or very similar predictions most of the time reducing the benefit of combining the predictions.

# correlation between results
modelCor(results)
splom(results)

123	# correlation between resultsmodelCor(results)splom(results)

We can see that all pairs of predictions have generally low correlation. The two methods with the highest correlation between their predictions are Logistic Regression (GLM) and kNN at 0.517 correlation which is not considered high (>0.75).

                lda     rpart       glm       knn svmRadial
lda       1.0000000 0.2515454 0.2970731 0.5013524 0.1126050
rpart     0.2515454 1.0000000 0.1749923 0.2823324 0.3465532
glm       0.2970731 0.1749923 1.0000000 0.5172239 0.3788275
knn       0.5013524 0.2823324 0.5172239 1.0000000 0.3512242
svmRadial 0.1126050 0.3465532 0.3788275 0.3512242 1.0000000

123456

lda rpart glm knn svmRadiallda 1.0000000 0.2515454 0.2970731 0.5013524 0.1126050rpart 0.2515454 1.0000000 0.1749923 0.2823324 0.3465532glm 0.2970731 0.1749923 1.0000000 0.5172239 0.3788275knn 0.5013524 0.2823324 0.5172239 1.0000000 0.3512242svmRadial 0.1126050 0.3465532 0.3788275 0.3512242 1.0000000

Correlations Between Predictions Made By Sub-Models in Stacking Ensemble

Let’s combine the predictions of the classifiers using a simple linear model.

# stack using glm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(seed)
stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
print(stack.glm)

12345

# stack using glmstackControl<-trainControl(method="repeatedcv",number=10,repeats

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Increase The Accuracy Of Your Models

Combine Model Predictions Into Ensemble Predictions

Need more Help with R for Machine Learning?

Ensemble Machine Learning in R

Test Dataset

1. Boosting Algorithms

2. Bagging Algorithms

3. Stacking Algorithms

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

How to Build an Intuition for Machine Learning Algorithms

How To Get Started With Machine Learning Algorithms in R

Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)

Tune Machine Learning Algorithms in R (random forest case study)

How to rapidly test dozens of deep learning models in Python

How to Work Through a Regression Machine Learning Project in Weka Step

How to Evaluate Machine Learning Algorithms with R

Types of Machine Learning Algorithms and their use

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

A Tour of Machine Learning Algorithms

Take Control By Creating Targeted Lists of Machine Learning Algorithms

Machine Learning Datasets in R (10 datasets you can use right now)

Save And Finalize Your Machine Learning Model in R

How to Win at SEO in the Age of Machine Learning

How to Build an High Availability MQTT Cluster for the Internet of Things

How Machine Learning Algorithms Work (they learn a mapping of input to output)

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges

How to build an Ethereum Wallet web app

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Increase The Accuracy Of Your Models

Combine Model Predictions Into Ensemble Predictions

Need more Help with R for Machine Learning?

Ensemble Machine Learning in R

Test Dataset

1. Boosting Algorithms

2. Bagging Algorithms

3. Stacking Algorithms

相關推薦