Get Your Data Ready For Machine Learning in R with Pre

Preparing data is required to get the best results from machine learning algorithms.

In this post you will discover how to transform your data in order to best expose its structure to machine learning algorithms in R using the caret package.

You will work through 8 popular and powerful data transforms with recipes that you can study or copy and paste int your current or next machine learning project.

Let’s get started.

Pre-Process Your Machine Learning Dataset in R
Photo by Fraser Cairns, some rights reserved.

Need For Data Pre-Processing

You want to get the best accuracy from machine learning algorithms on your datasets.

Some machine learning algorithms require the data to be in a specific form. Whereas other algorithms can perform better if the data is prepared in a specific way, but not always. Finally, your raw data may not be in the best format to best expose the underlying structure and relationships to the predicted variables.

It is important to prepare your data in such a way that it gives various different machine learning algorithms the best chance on your problem.

You need to pre-process your raw data as part of your machine learning project.

Data Pre-Processing Methods

It is hard to know which data-preprocessing methods to use.

You can use rules of thumb such as:

Instance based methods are more effective if the input attributes have the same scale.
Regression methods can work better of the input attributes are standardized.

These are heuristics, but not hard and fast laws of machine learning, because sometimes you can get better results if you ignore them.

You should try a range of data transforms with a range of different machine learning algorithms. This will help you discover both good representations for your data and algorithms that are better at exploiting the structure that those representations expose.

It is a good idea to spot check a number of transforms both in isolation as well as combinations of transforms.

In the next section you will discover how you can apply data transforms in order to prepare your data in R using the caret package.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Pre-Processing With Caret in R

The caret package in R provides a number of useful data transforms.

These transforms can be used in two ways.

Standalone: Transforms can be modeled from training data and applied to multiple datasets. The model of the transform is prepared using the preProcess() function and applied to a dataset using the predict() function.
Training: Transforms can prepared and applied automatically during model evaluation. Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.

A number of data preprocessing examples are presented in this section. They are presented using the standalone method, but you can just as easily use the prepared preprocessed model during model training.

All of the preprocessing examples in this section are for numerical data. Note that the preprocessing functions will skip over non-numeric data without raising an error.

You can learn more about the data transforms provided by the caret package by reading the help for the preProcess function by typing ?preProcess and by reading the Caret Pre-Processing page.

The data transforms presented are more likely to be useful for algorithms such as regression algorithms, instance-based methods (like kNN and LVQ), support vector machines and neural networks. They are less likely to be useful for tree and rule based methods.

Summary of Transform Methods

Below is a quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.

“BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
“YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
“expoTrans“: apply a power transform like BoxCox and YeoJohnson.
“zv“: remove attributes with a zero variance (all the same value).
“nzv“: remove attributes with a near zero variance (close to the same value).
“center“: subtract mean from values.
“scale“: divide values by standard deviation.
“range“: normalize values.
“pca“: transform data to the principal components.
“ica“: transform data to the independent components.
“spatialSign“: project data onto a unit circle.

The following sections will demonstrate some of the more popular methods.

1. Scale

The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

1234567891011121314

# load librarieslibrary(caret)# load the datasetdata(iris)# summarize datasummary(iris[,1:4])# calculate the pre-process parameters from the datasetpreprocessParams<-preProcess(iris[,1:4],method=c("scale"))# summarize transform parametersprint(preprocessParams)# transform the dataset using the parameterstransformed<-predict(preprocessParams,iris[,1:4])# summarize the transformed datasetsummary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - scaled (4)

  Sepal.Length    Sepal.Width      Petal.Length     Petal.Width    
 Min.   :5.193   Min.   : 4.589   Min.   :0.5665   Min.   :0.1312  
 1st Qu.:6.159   1st Qu.: 6.424   1st Qu.:0.9064   1st Qu.:0.3936  
 Median :7.004   Median : 6.883   Median :2.4642   Median :1.7055  
 Mean   :7.057   Mean   : 7.014   Mean   :2.1288   Mean   :1.5734  
 3rd Qu.:7.729   3rd Qu.: 7.571   3rd Qu.:2.8890   3rd Qu.:2.3615  
 Max.   :9.540   Max.   :10.095   Max.   :3.9087   Max.   :3.2798

123456789101112131415161718192021

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variablesPre-processing: - ignored (0) - scaled (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :5.193 Min. : 4.589 Min. :0.5665 Min. :0.1312 1st Qu.:6.159 1st Qu.: 6.424 1st Qu.:0.9064 1st Qu.:0.3936 Median :7.004 Median : 6.883 Median :2.4642 Median :1.7055 Mean :7.057 Mean : 7.014 Mean :2.1288 Mean :1.5734 3rd Qu.:7.729 3rd Qu.: 7.571 3rd Qu.:2.8890 3rd Qu.:2.3615 Max. :9.540 Max. :10.095 Max. :3.9087 Max. :3.2798

2. Center

The center transform calculates the mean for an attribute and subtracts it from each value.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

1234567891011121314

# load librarieslibrary(caret)# load the datasetdata(iris)# summarize datasummary(iris[,1:4])# calculate the pre-process parameters from the datasetpreprocessParams<-preProcess(iris[,1:4],method=c("center"))# summarize transform parametersprint(preprocessParams)# transform the dataset using the parameterstransformed<-predict(preprocessParams,iris[,1:4])# summarize the transformed datasetsummary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)

 Sepal.Length       Sepal.Width        Petal.Length     Petal.Width     
 Min.   :-1.54333   Min.   :-1.05733   Min.   :-2.758   Min.   :-1.0993  
 1st Qu.:-0.74333   1st Qu.:-0.25733   1st Qu.:-2.158   1st Qu.:-0.8993  
 Median :-0.04333   Median :-0.05733   Median : 0.592   Median : 0.1007  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.55667   3rd Qu.: 0.24267   3rd Qu.: 1.342   3rd Qu.: 0.6007  
 Max.   : 2.05667   Max.   : 1.34267   Max.   : 3.142   Max.   : 1.3007

123456789101112131415161718192021

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variablesPre-processing: - centered (4) - ignored (0) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :-1.54333 Min. :-1.05733 Min. :-2.758 Min. :-1.0993 1st Qu.:-0.74333 1st Qu.:-0.25733 1st Qu.:-2.158 1st Qu.:-0.8993 Median :-0.04333 Median :-0.05733 Median : 0.592 Median : 0.1007 Mean : 0.00000 Mean : 0.00000 Mean : 0.000 Mean : 0.0000 3rd Qu.: 0.55667 3rd Qu.: 0.24267 3rd Qu.: 1.342 3rd Qu.: 0.6007 Max. : 2.05667 Max. : 1.34267 Max. : 3.142 Max. : 1.3007

3. Standardize

Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

1234567891011121314

# load librarieslibrary(caret)# load the datasetdata(iris)# summarize datasummary(iris[,1:4])# calculate the pre-process parameters from the datasetpreprocessParams<-preProcess(iris[,1:4],method=c("center","scale"))# summarize transform parametersprint(preprocessParams)# transform the dataset using the parameterstransformed<-predict(preprocessParams,iris[,1:4])# summarize the transformed datasetsummary(transformed)

Notice how we can list multiple methods in a list when defining the preProcess procedure in caret. Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)
  - scaled (4)

 Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
 Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
 1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
 Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
 Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064

12345678910111213141516171819202122

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variablesPre-processing: - centered (4) - ignored (0) - scaled (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799 Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880 Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064

4. Normalize

Data values can be scaled into the range of [0, 1] which is called normalization.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

1234567891011121314

# load librarieslibrary(caret)# load the datasetdata(iris)# summarize datasummary(iris[,1:4])# calculate the pre-process parameters from the datasetpreprocessParams<-preProcess(iris[,1:4],method=c("range"))# summarize transform parametersprint(preprocessParams)# transform the dataset using the parameterstransformed<-predict(preprocessParams,iris[,1:4])# summarize the transformed datasetsummary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (4)


  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

12345678910111213141516171819202122

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variablesPre-processing: - ignored (0) - re-scaling to [0, 1] (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333 Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000 Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000

5. Box-Cox Transform

When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. The BoxCox transform can perform this operation (assumes all values are positive).

# load libraries
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

123456789101112131415

# load librarieslibrary(mlbench)library(caret)# load the datasetdata(PimaIndiansDiabetes)# summarize pedigree and agesummary(PimaIndiansDiabetes[,7:8])# calculate the pre-process parameters from the datasetpreprocessParams<

Get Your Data Ready For Machine Learning in R with Pre

Tweet Share Share Google Plus Preparing data is required to get the best results from machine le

Rescaling Data for Machine Learning in Python with Scikit

Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p

Prepare Data for Machine Learning in Python with Pandas

Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin

Best Books For Machine Learning in R

Tweet Share Share Google Plus R is a powerful platform for data analysis and machine learning. I

Essential libraries for Machine Learning in Python

Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie

NXP Owns the Stage for Machine Learning in Edge Devices

SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri

NXP's New Development Platform for Machine Learning in the IoT

NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Today, MIT and Community Jameel, the social enterprise organization founded and chaired by Mohammed Abdul Latif Jameel ’78, launched the Abdul Latif Jameel

Get Your Data Ready For Machine Learning in R with Pre

Need For Data Pre-Processing

Data Pre-Processing Methods

Need more Help with R for Machine Learning?

Data Pre-Processing With Caret in R

Summary of Transform Methods

1. Scale

2. Center

3. Standardize

4. Normalize

5. Box-Cox Transform

Get Your Data Ready For Machine Learning in R with Pre

Rescaling Data for Machine Learning in Python with Scikit

Prepare Data for Machine Learning in Python with Pandas

Best Books For Machine Learning in R

Essential libraries for Machine Learning in Python

NXP Owns the Stage for Machine Learning in Edge Devices

NXP's New Development Platform for Machine Learning in the IoT

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Android Developers Blog: Get your app ready for foldable phones

Introduction to Random Number Generators for Machine Learning in Python

How To Get Started With Machine Learning in R (get results in one weekend)

[Javascript] Classify JSON text data with machine learning in Natural

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

Why Data Normalization is necessary for Machine Learning models

Data Handling using Pandas; Machine Learning in Real Life

Recommended IDE for Data Scientists and Machine Learning Engineers

Using Amazon’s Mechanical Turk for Machine Learning Data

Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine Learning

How to Prepare Data For Machine Learning

How to Get Started with Machine Learning in Python

Get Your Data Ready For Machine Learning in R with Pre

Need For Data Pre-Processing

Data Pre-Processing Methods

Need more Help with R for Machine Learning?

Data Pre-Processing With Caret in R

Summary of Transform Methods

1. Scale

2. Center

3. Standardize

4. Normalize

5. Box-Cox Transform

相關推薦