Machine Learning Datasets in R (10 datasets you can use right now)

You need standard datasets to practice machine learning.

In this short post you will discover how you can load standard classification and regression datasets in R.

This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.

It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.

Let’s get started.

Practice On Small Well-Understood Datasets

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

You can download them fast.
You can fit them into memory easily.

You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

Access Standard Datasets in R

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

How To Load Standard Datasets in R

In this section you will discover the libraries that you can use to get access to standard machine learning datasets.

You will also discover specific classification and regression that you can load and use to practice machine learning in R.

Library: datasets

Iris Flowers Dataset
Photo by Rick Ligthelm, some rights reserved.

The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.

You can load a dataset from this library by typing:

data(DataSetName)

1	data(DataSetName)

For example, to load the very commonly used iris dataset:

data(iris)

1	data(iris)

To see a list of the datasets available in this library, you can type:

# list all datasets in the package
library(help = "datasets")

12	# list all datasets in the packagelibrary(help="datasets")

Some highlights datasets from this package that you could use are below.

Iris Flowers Dataset

Description: Predict iris flower species from flower measurements.
Type: Multi-class classification
Dimensions: 150 instances, 5 attributes
Inputs: Numeric
Output: Categorical, 3 class labels
UCI Machine Learning Repository: Description
Published accuracy results: Summary

# iris flowers datasets
data(iris)
dim(iris)
levels(iris$Species)
head(iris)

12345

# iris flowers datasetsdata(iris)dim(iris)levels(iris$Species)head(iris)

You will see:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

1234567

Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa

Longley’s Economic Regression Data

Description: Predict number of people employed from economic variables
Type: Regression
Dimensions: 16 instances, 7 attributes
Inputs: Numeric
Output: Numeric

# Longley's Economic Regression Data
data(longley)
dim(longley)
head(longley)

1234	# Longley's Economic Regression Datadata(longley)dim(longley)head(longley)

You will see:

     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
1952         98.1 346.999      193.2        359.4    113.270 1952   63.639

1234567

GNP.deflator GNP Unemployed Armed.Forces Population Year Employed1947 83.0 234.289 235.6 159.0 107.608 1947 60.3231948 88.5 259.426 232.5 145.6 108.632 1948 61.1221949 88.2 258.054 368.2 161.6 109.773 1949 60.1711950 89.5 284.599 335.1 165.0 110.929 1950 61.1871951 96.2 328.975 209.9 309.9 112.075 1951 63.2211952 98.1 346.999 193.2 359.4 113.270 1952 63.639

Library: mlbench

Soybean Dataset
Photo by United Soybean Board, some rights reserved.

Direct from the manual for the library:

A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

You can learn more about the mlbench library on the mlbench CRAN page.

If not installed, you can install this library as follows:

install.packages("mlbench")

1	install.packages("mlbench")

You can load the library as follows:

# load the library
library(mlbench)

12	# load the librarylibrary(mlbench)

To see a list of the datasets available in this library, you can type:

# list the contents of the library
library(help = "mlbench")

12	# list the contents of the librarylibrary(help="mlbench")

Some highlights datasets from this library that you could use are:

Boston Housing Data

Description: Predict the house price in Boston from house details
Type: Regression
Dimensions: 506 instances, 14 attributes
Inputs: Numeric
Output: Numeric
UCI Machine Learning Repository: Description

# Boston Housing Data
data(BostonHousing)
dim(BostonHousing)
head(BostonHousing)

1234	# Boston Housing Datadata(BostonHousing)dim(BostonHousing)head(BostonHousing)

You will see:

     crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

1234567

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.02 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.63 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.74 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.45 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.26 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

Wisconsin Breast Cancer Database

Description: Predict whether a cancer is malignant or benign from biopsy details.
Type: Binary Classification
Dimensions: 699 instances, 11 attributes
Inputs: Integer (Nominal)
Output: Categorical, 2 class labels
UCI Machine Learning Repository: Description
Published accuracy results: Summary

# Wisconsin Breast Cancer Database
data(BreastCancer)
dim(BreastCancer)
levels(BreastCancer$Class)
head(BreastCancer)

12345

# Wisconsin Breast Cancer Databasedata(BreastCancer)dim(BreastCancer)levels(BreastCancer$Class)head(BreastCancer)

You will see:

       Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
1 1000025            5         1          1             1            2           1           3               1       1    benign
2 1002945            5         4          4             5            7          10           3               2       1    benign
3 1015425            3         1          1             1            2           2           3               1       1    benign
4 1016277            6         8          8             1            3           4           3               7       1    benign
5 1017023            4         1          1             3            2           1           3               1       1    benign
6 1017122            8        10         10             8            7          10           9               7       1 malignant

1234567

Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class1 1000025 5 1 1 1 2 1 3 1 1 benign2 1002945 5 4 4 5 7 10 3 2 1 benign3 1015425 3 1 1 1 2 2 3 1 1 benign4 1016277 6 8 8 1 3 4 3 7 1 benign5 1017023 4 1 1 3 2 1 3 1 1 benign6 1017122 8 10 10 8 7 10 9 7 1 malignant

Glass Identification Database

Description: Predict the glass type from chemical properties.
Type: Classification
Dimensions: 214 instances, 10 attributes
Inputs: Numeric
Output: Categorical, 7 class labels
UCI Machine Learning Repository: Description
Published accuracy results: Summary

# Glass Identification Database
data(Glass)
dim(Glass)
levels(Glass$Type)
head(Glass)

12345

# Glass Identification Databasedata(Glass)dim(Glass)levels(Glass$Type)head(Glass)

You will see:

       RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

1234567

RI Na Mg Al Si K Ca Ba Fe Type1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 12 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 13 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 14 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 15 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 16 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1

Johns Hopkins University Ionosphere database

Description: Predict high-energy structures in the atmosphere from antenna data.
Type: Classification
Dimensions: 351 instances, 35 attributes
Inputs: Numeric
Output: Categorical, 2 class labels
UCI Machine Learning Repository: Description
Published accuracy results: Summary

# Johns Hopkins University Ionosphere database
data(Ionosphere)
dim(Ionosphere)
levels(Ionosphere$Class)
head(Ionosphere)

12345

# Johns Hopkins University Ionosphere databasedata(Ionosphere)dim(Ionosphere)levels(Ionosphere$Class)head(Ionosphere)

You will see:

  V1 V2      V3       V4       V5       V6       V7       V8      V9      V10     V11      V12     V13      V14      V15      V16      V17      V18      V19
1  1  0 0.99539 -0.05889  0.85243  0.02306  0.83398 -0.37708 1.00000  0.03760 0.85243 -0.17755 0.59755 -0.44945  0.60536 -0.38223  0.84356 -0.38542  0.58212
2  1  0 1.00000 -0.18829  0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685 -0.97515  0.05499 -0.62237  0.33109
3  1  0 1.00000 -0.03365  1.00000  0.00485  1.00000 -0.12062 0.88965  0.01198 0.73082  0.05346 0.85443  0.00827  0.54591  0.00299  0.83775 -0.13644  0.75535
4  1  0 1.00000 -0.45161  1.00000  1.00000  0.71216 -1.00000 0.00000  0.00000 0.00000  0.00000 0.00000  0.00000 -1.00000  0.14516  0.54094 -0.39330 -1.00000
5  1  0 1.00000 -0.02401  0.94140  0.06531  0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712  0.34395 -0.27457  0.52940 -0.21780  0.45107
6  1  0 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706  0.06637 0.03786 -0.06302 0.00000  0.00000 -0.04572 -0.15540 -0.00343 -0.10196 -0.11575
       V20      V21      V22      V23      V24      V25      V26      V27      V28      V29      V30      V31      V32      V33      V34 Class
1 -0.32192  0.56971 -0.29674  0.36946 -0.47357  0.56811 -0.51171  0.41078 -0.46168  0.21266 -0.34090  0.42267 -0.54487  0.18641 -0.45300  good
2 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447   bad
3 -0.08540  0.70887 -0.27502  0.43385 -0.12062  0.57528 -0.40220  0.58984 -0.22145  0.43100 -0.17365  0.60436 -0.24180  0.56045 -0.38238  good
4 -0.54467 -0.69975  1.00000  0.00000  0.00000  1.00000  0.90695  0.51613  1.00000  1.00000 -0.20099  0.25682  1.00000 -0.32382  1.00000   bad
5 -0.17813  0.05982 -0.35575  0.02309 -0.52879  0.03286 -0.65158  0.13290 -0.53206  0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697  good
6 -0.05414  0.01838  0.03669  0.01519  0.00888  0.03513 -0.01535 -0.03240  0.09223 -0.07859  0.00732  0.00000  0.00000 -0.00039  0.12011   bad

1234567891011121314

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V191 1 0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 0.59755 -0.44945 0.60536 -0.38223 0.84356 -0.38542 0.582122 1 0 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685 -0.97515 0.05499 -0.62237 0.331093 1 0 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 0.85443 0.00827 0.54591 0.00299 0.83775 -0.13644 0.755354 1 0 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.00000 0.14516 0.54094 -0.39330 -1.000005 1 0 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712 0.34395 -0.27457 0.52940 -0.21780 0.451076 1 0 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706 0.06637 0.03786 -0.06302 0.00000 0.00000 -0.04572 -0.15540 -0.00343 -0.10196 -0.11575 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 Class1 -0.32192 0.56971 -0.29674 0.36946 -0.47357 0.56811 -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300 good2 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447 bad3 -0.08540 0.70887 -0.27502 0.43385 -0.12062 0.57528 -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238 good4 -0.54467 -0.69975 1.00000 0.00000 0.00000 1.00000 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000 bad5 -0.17813 0.05982 -0.35575 0.02309 -0.52879 0.03286 -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697 good6 -0.05414 0.01838 0.03669 0.01519 0.00888 0.03513 -0.01535 -0.03240 0.09223 -0.07859 0.00732 0.00000 0.00000 -0.00039 0.12011 bad

Pima Indians Diabetes Database

Description: Predict the onset of diabetes in female Pima Indians from medical record data.
Type: Binary Classification
Dimensions: 768 instances, 9 attributes
Inputs: Numeric
Output: Categorical, 2 class labels
UCI Machine Learning Repository: Description
Published accuracy results: Summary

# Pima Indians Diabetes Database
data(PimaIndiansDiabetes)
dim(PimaIndiansDiabetes)
levels(PimaIndiansDiabetes$diabetes)
head(PimaIndiansDiabetes)

12345

# Pima Indians Diabetes Databasedata(PimaIndiansDiabetes)dim(PimaIndiansDiabetes)levels(PimaIndiansDiabetes$diabetes)head(PimaIndiansDiabetes)

You will see:

  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg

1234567

Machine Learning Datasets in R (10 datasets you can use right now)

Practice On Small Well-Understood Datasets

Access Standard Datasets in R

Need more Help with R for Machine Learning?

How To Load Standard Datasets in R

Library: datasets

Iris Flowers Dataset

Longley’s Economic Regression Data

Library: mlbench

Boston Housing Data

Wisconsin Breast Cancer Database

Glass Identification Database

Johns Hopkins University Ionosphere database

Pima Indians Diabetes Database

Machine Learning Datasets in R (10 datasets you can use right now)

Tune Machine Learning Algorithms in R (random forest case study)

Better Understand Your Data in R Using Visualization (10 recipes you can use today)

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

How To Get Started With Machine Learning Algorithms in R

Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)

Save And Finalize Your Machine Learning Model in R

Facebook Portal alternatives you can buy right now

The 20 best Cyber Monday deals you can get right now

10 Machine Learning Examples in JavaScript

Amazon plans machine learning, software engineering, R&D hiring spree in UK ZDNet

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

10 jobs you can do if you d LTE Router on’t want to teach English in Germany

Applitools Recognized as a Top Artificial Intelligence and Machine Learning Solution in DevOps

Top 5 Machine Learning Libraries in Python

Yantra Learning, First Machine Learning Competition in Nepal: Hackathon Edition

Will "Leaky" Machine Learning Usher in a New Wave of Lawsuits?

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite Amazon Web Services

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite

How to Normalize and Standardize Your Machine Learning Data in Weka

Machine Learning Datasets in R (10 datasets you can use right now)

Practice On Small Well-Understood Datasets

Access Standard Datasets in R

Need more Help with R for Machine Learning?

How To Load Standard Datasets in R

Library: datasets

Iris Flowers Dataset

Longley’s Economic Regression Data

Library: mlbench

Boston Housing Data

Wisconsin Breast Cancer Database

Glass Identification Database

Johns Hopkins University Ionosphere database

Pima Indians Diabetes Database

相關推薦