1. 程式人生 > >How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Ensembles can give you a boost in accuracy on your dataset.

In this post you will discover how you can create three of the most powerful types of ensembles in R.

This case study will step you through Boosting, Bagging and Stacking and show you how you can continue to ratchet up the accuracy of the models on your own datasets.

Let’s get started.

Build an Ensemble Of Machine Learning Algorithms in R

Build an Ensemble Of Machine Learning Algorithms in R
Photo by Barbara Hobbs, some rights reserved.

Increase The Accuracy Of Your Models

It can take time to find well performing machine learning algorithms for your dataset. This is because of the trial and error nature of applied machine learning.

Once you have a shortlist of accurate models, you can use algorithm tuning to get the most from each algorithm.

Another approach that you can use to increase accuracy on your dataset is to combine the predictions of multiple different models together.

This is called an ensemble prediction.

Combine Model Predictions Into Ensemble Predictions

The three most popular methods for combining the predictions from different models are:

  • Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
  • Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
  • Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

This post will not explain each of these methods. It assumes you are generally familiar with machine learning algorithms and ensemble methods and that you are looking for information on how to create ensembles with R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Ensemble Machine Learning in R

You can create ensembles of machine learning algorithms in R.

There are three main techniques that you can create an ensemble of machine learning algorithms in R: Boosting, Bagging and Stacking. In this section, we will look at each in turn.

Before we start building ensembles, let’s define our test set-up.

Test Dataset

All of the examples of ensemble predictions in this case study will use the ionosphere dataset.

This is a dataset available from the UCI Machine Learning Repository. This dataset describes high-frequency antenna returns from high energy particles in the atmosphere and whether the return shows structure or not. The problem is a binary classification that contains 351 instances and 35 numerical attributes.

Let’s load the libraries and the dataset.

12345678910 # Load librarieslibrary(mlbench)library(caret)library(caretEnsemble)# Load the datasetdata(Ionosphere)dataset<-Ionospheredataset<-dataset[,-2]dataset$V1<-as.numeric(as.character(dataset$V1))

Note that the first attribute was a factor (0,1) and has been transformed to be numeric for consistency with all of the other numeric attributes. Also note that the second attribute is a constant and has been removed.

Here is a sneak-peek at the first few rows of the ionosphere dataset.

12345678910111213141516171819202122 > head(dataset)  V1      V3       V4       V5       V6       V7       V8      V9      V10     V11      V12     V13      V14      V151  1 0.99539 -0.05889  0.85243  0.02306  0.83398 -0.37708 1.00000  0.03760 0.85243 -0.17755 0.59755 -0.44945  0.605362  1 1.00000 -0.18829  0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.516853  1 1.00000 -0.03365  1.00000  0.00485  1.00000 -0.12062 0.88965  0.01198 0.73082  0.05346 0.85443  0.00827  0.545914  1 1.00000 -0.45161  1.00000  1.00000  0.71216 -1.00000 0.00000  0.00000 0.00000  0.00000 0.00000  0.00000 -1.000005  1 1.00000 -0.02401  0.94140  0.06531  0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712  0.343956  1 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706  0.06637 0.03786 -0.06302 0.00000  0.00000 -0.04572       V16      V17      V18      V19      V20      V21      V22      V23      V24      V25      V26      V27      V281 -0.38223  0.84356 -0.38542  0.58212 -0.32192  0.56971 -0.29674  0.36946 -0.47357  0.56811 -0.51171  0.41078 -0.461682 -0.97515  0.05499 -0.62237  0.33109 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.184013  0.00299  0.83775 -0.13644  0.75535 -0.08540  0.70887 -0.27502  0.43385 -0.12062  0.57528 -0.40220  0.58984 -0.221454  0.14516  0.54094 -0.39330 -1.00000 -0.54467 -0.69975  1.00000  0.00000  0.00000  1.00000  0.90695  0.51613  1.000005 -0.27457  0.52940 -0.21780  0.45107 -0.17813  0.05982 -0.35575  0.02309 -0.52879  0.03286 -0.65158  0.13290 -0.532066 -0.15540 -0.00343 -0.10196 -0.11575 -0.05414  0.01838  0.03669  0.01519  0.00888  0.03513 -0.01535 -0.03240  0.09223       V29      V30      V31      V32      V33      V34 Class1  0.21266 -0.34090  0.42267 -0.54487  0.18641 -0.45300  good2 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447   bad3  0.43100 -0.17365  0.60436 -0.24180  0.56045 -0.38238  good4  1.00000 -0.20099  0.25682  1.00000 -0.32382  1.00000   bad5  0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697  good6 -0.07859  0.00732  0.00000  0.00000 -0.00039  0.12011   bad

For more information, see the description of the Ionosphere dataset on the UCI Machine Learning Repository.

1. Boosting Algorithms

We can look at two of the most popular boosting machine learning algorithms:

  • C5.0
  • Stochastic Gradient Boosting

Below is an example of the C5.0 and Stochastic Gradient Boosting (using the Gradient Boosting Modeling implementation) algorithms in R. Both algorithms include parameters that are not tuned in this example.

1234567891011121314 # Example of Boosting Algorithmscontrol<-trainControl(method="repeatedcv",number=10,repeats=3)seed<-7metric<-"Accuracy"# C5.0set.seed(seed)fit.c50<-train(Class~.,data=dataset,method="C5.0",metric=metric,trControl=control)# Stochastic Gradient Boostingset.seed(seed)fit.gbm<-train(Class~.,data=dataset,method="gbm",metric=metric,trControl=control,verbose=FALSE)# summarize resultsboosting_results<-resamples(list(c5.0=fit.c50,gbm=fit.gbm))summary(boosting_results)dotplot(boosting_results)

We can see that the C5.0 algorithm produces a more accurate model with an accuracy of 94.58%.

1234567 Models: c5.0, gbm Number of resamples: 30 Accuracy        Min. 1st Qu. Median   Mean 3rd Qu. Max. NA'sc5.0 0.8824  0.9143 0.9437 0.9458  0.9714    1    0gbm  0.8824  0.9143 0.9429 0.9402  0.9641    1    0
Boosting Machine Learning Algorithms in R

Boosting Machine Learning Algorithms in R

Learn more about caret boosting models tree: Boosting Models.

2. Bagging Algorithms

Let’s look at two of the most popular bagging machine learning algorithms:

  • Bagged CART
  • Random Forest

Below is an example of the Bagged CART and Random Forest algorithms in R. Both algorithms include parameters that are not tuned in this example.

1234567891011121314 # Example of Bagging algorithmscontrol<-trainControl(method="repeatedcv",number=10,repeats=3)seed<-7metric<-"Accuracy"# Bagged CARTset.seed(seed)fit.treebag<-train(Class~.,data=dataset,method="treebag",metric=metric,trControl=control)# Random Forestset.seed(seed)fit.rf<-train(Class~.,data=dataset,method="rf",metric=metric,trControl=control)# summarize resultsbagging_results<-resamples(list(treebag=fit.treebag,rf=fit.rf))summary(bagging_results)dotplot(bagging_results)

We can see that random forest produces a more accurate model with an accuracy of 93.25%.

1234567 Models: treebag, rf Number of resamples: 30 Accuracy           Min. 1st Qu. Median   Mean 3rd Qu. Max. NA'streebag 0.8529  0.8946 0.9143 0.9183  0.9440    1    0rf      0.8571  0.9143 0.9420 0.9325  0.9444    1    0
Bagging Machine Learning Algorithms in R

Bagging Machine Learning Algorithms in R

Learn more about caret bagging model here: Bagging Models.

3. Stacking Algorithms

You can combine the predictions of multiple caret models using the caretEnsemble package.

Given a list of caret models, the caretStack() function can be used to specify a higher-order model to learn how to best combine the predictions of sub-models together.

Let’s first look at creating 5 sub-models for the ionosphere dataset, specifically:

  • Linear Discriminate Analysis (LDA)
  • Classification and Regression Trees (CART)
  • Logistic Regression (via Generalized Linear Model or GLM)
  • k-Nearest Neighbors (kNN)
  • Support Vector Machine with a Radial Basis Kernel Function (SVM)

Below is an example that creates these 5 sub-models. Note the new helpful caretList() function provided by the caretEnsemble package for creating a list of standard caret models.

123456789 # Example of Stacking algorithms# create submodelscontrol<-trainControl(method="repeatedcv",number=10,repeats=3,savePredictions=TRUE,classProbs=TRUE)algorithmList<-c('lda','rpart','glm','knn','svmRadial')set.seed(seed)models<-caretList(Class~.,data=dataset,trControl=control,methodList=algorithmList)results<-resamples(models)summary(results)dotplot(results)

We can see that the SVM creates the most accurate model with an accuracy of 94.66%.

12345678910 Models: lda, rpart, glm, knn, svmRadial Number of resamples: 30 Accuracy             Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA'slda       0.7714  0.8286 0.8611 0.8645  0.9060 0.9429    0rpart     0.7714  0.8540 0.8873 0.8803  0.9143 0.9714    0glm       0.7778  0.8286 0.8873 0.8803  0.9167 0.9722    0knn       0.7647  0.8056 0.8431 0.8451  0.8857 0.9167    0svmRadial 0.8824  0.9143 0.9429 0.9466  0.9722 1.0000    0
Comparison of Sub-Models for Stacking Ensemble in R

Comparison of Sub-Models for Stacking Ensemble in R

When we combine the predictions of different models using stacking, it is desirable that the predictions made by the sub-models have low correlation. This would suggest that the models are skillful but in different ways, allowing a new classifier to figure out how to get the best from each model for an improved score.

If the predictions for the sub-models were highly corrected (>0.75) then they would be making the same or very similar predictions most of the time reducing the benefit of combining the predictions.

123 # correlation between resultsmodelCor(results)splom(results)

We can see that all pairs of predictions have generally low correlation. The two methods with the highest correlation between their predictions are Logistic Regression (GLM) and kNN at 0.517 correlation which is not considered high (>0.75).

123456                 lda     rpart       glm       knn svmRadiallda       1.0000000 0.2515454 0.2970731 0.5013524 0.1126050rpart     0.2515454 1.0000000 0.1749923 0.2823324 0.3465532glm       0.2970731 0.1749923 1.0000000 0.5172239 0.3788275knn       0.5013524 0.2823324 0.5172239 1.0000000 0.3512242svmRadial 0.1126050 0.3465532 0.3788275 0.3512242 1.0000000
Correlations Between Predictions Made By Sub-Models in Stacking Ensemble

Correlations Between Predictions Made By Sub-Models in Stacking Ensemble

Let’s combine the predictions of the classifiers using a simple linear model.

12345 # stack using glmstackControl<-trainControl(method="repeatedcv",number=10,repeats

相關推薦

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

How to Build an Intuition for Machine Learning Algorithms

Tweet Share Share Google Plus Machine learning algorithms are complex. To get good at applying a

How To Get Started With Machine Learning Algorithms in R

Tweet Share Share Google Plus R is the most popular platform for applied machine learning. When

Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)

Tweet Share Share Google Plus Spot checking machine learning algorithms is how you find the best

Tune Machine Learning Algorithms in R (random forest case study)

Tweet Share Share Google Plus It is difficult to find a good machine learning algorithm for your

How to rapidly test dozens of deep learning models in Python

Although k-fold cross validation is a great way of assessing a model’s performance, it’s computationally expensive to obtain these results. We can simply s

How to Work Through a Regression Machine Learning Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How to Evaluate Machine Learning Algorithms with R

Tweet Share Share Google Plus What algorithm should you use on your dataset? This is the most co

Types of Machine Learning Algorithms and their use

GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of lear

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

There are two ways to categorize Machine Learning algorithms you may come across in the field. Generally, both approaches are useful. However, we will focu

A Tour of Machine Learning Algorithms

Tweet Share Share Google Plus In this post, we take a tour of the most popular machine learning

Take Control By Creating Targeted Lists of Machine Learning Algorithms

Tweet Share Share Google Plus Any book on machine learning will list and describe dozens of mach

Machine Learning Datasets in R (10 datasets you can use right now)

Tweet Share Share Google Plus You need standard datasets to practice machine learning. In this s

Save And Finalize Your Machine Learning Model in R

Tweet Share Share Google Plus Finding an accurate machine learning is not the end of the project

How to Win at SEO in the Age of Machine Learning

In the recent past, we have been hearing a lot about machine learning, but do we really know what is machine learning? And how it can change the organic se

How to Build an High Availability MQTT Cluster for the Internet of Things

1. Setting up the MQTT broker MQTTis a machine-to-machine (M2M)/“Internet of Things” connectivity protocol. It was designed as an extremely lightweight

How Machine Learning Algorithms Work (they learn a mapping of input to output)

Tweet Share Share Google Plus How do machine learning algorithms work? There is a common princip

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges

文章名稱:A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges 文章名稱:應用於SDN的機器學習技術綜述:研究問題與挑戰

How to build an Ethereum Wallet web app

To send Ether, we need to use native functions provided by the web3.js library, while sending tokens and checking balances involves interaction with a smar