Machine Learning Datasets in R (10 datasets you can use right now)
You need standard datasets to practice machine learning.
In this short post you will discover how you can load standard classification and regression datasets in R.
This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.
It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.
Let’s get started.
Practice On Small Well-Understood Datasets
There are hundreds of standard test datasets that you can use to practice and get better at machine learning.
Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.
This last point is critical when practicing machine learning because:
- You can download them fast.
- You can fit them into memory easily.
- You can run algorithms on them quickly.
Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:
Access Standard Datasets in R
You can load the standard datasets into R as CSV files.
There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).
Which libraries should you use and what datasets are good to start with.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
How To Load Standard Datasets in R
In this section you will discover the libraries that you can use to get access to standard machine learning datasets.
You will also discover specific classification and regression that you can load and use to practice machine learning in R.
Library: datasets
The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.
You can load a dataset from this library by typing:
1 | data(DataSetName) |
For example, to load the very commonly used iris dataset:
1 | data(iris) |
To see a list of the datasets available in this library, you can type:
12 | # list all datasets in the packagelibrary(help="datasets") |
Some highlights datasets from this package that you could use are below.
Iris Flowers Dataset
- Description: Predict iris flower species from flower measurements.
- Type: Multi-class classification
- Dimensions: 150 instances, 5 attributes
- Inputs: Numeric
- Output: Categorical, 3 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary
12345 | # iris flowers datasetsdata(iris)dim(iris)levels(iris$Species)head(iris) |
You will see:
1234567 | Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa |
Longley’s Economic Regression Data
- Description: Predict number of people employed from economic variables
- Type: Regression
- Dimensions: 16 instances, 7 attributes
- Inputs: Numeric
- Output: Numeric
1234 | # Longley's Economic Regression Datadata(longley)dim(longley)head(longley) |
You will see:
1234567 | GNP.deflator GNP Unemployed Armed.Forces Population Year Employed1947 83.0 234.289 235.6 159.0 107.608 1947 60.3231948 88.5 259.426 232.5 145.6 108.632 1948 61.1221949 88.2 258.054 368.2 161.6 109.773 1949 60.1711950 89.5 284.599 335.1 165.0 110.929 1950 61.1871951 96.2 328.975 209.9 309.9 112.075 1951 63.2211952 98.1 346.999 193.2 359.4 113.270 1952 63.639 |
Library: mlbench
Direct from the manual for the library:
A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.
You can learn more about the mlbench library on the mlbench CRAN page.
If not installed, you can install this library as follows:
1 | install.packages("mlbench") |
You can load the library as follows:
12 | # load the librarylibrary(mlbench) |
To see a list of the datasets available in this library, you can type:
12 | # list the contents of the librarylibrary(help="mlbench") |
Some highlights datasets from this library that you could use are:
Boston Housing Data
- Description: Predict the house price in Boston from house details
- Type: Regression
- Dimensions: 506 instances, 14 attributes
- Inputs: Numeric
- Output: Numeric
- UCI Machine Learning Repository: Description
1234 | # Boston Housing Datadata(BostonHousing)dim(BostonHousing)head(BostonHousing) |
You will see:
1234567 | crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.02 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.63 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.74 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.45 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.26 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 |
Wisconsin Breast Cancer Database
- Description: Predict whether a cancer is malignant or benign from biopsy details.
- Type: Binary Classification
Dimensions: 699 instances, 11 attributes - Inputs: Integer (Nominal)
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary
12345 | # Wisconsin Breast Cancer Databasedata(BreastCancer)dim(BreastCancer)levels(BreastCancer$Class)head(BreastCancer) |
You will see:
1234567 | Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class1 1000025 5 1 1 1 2 1 3 1 1 benign2 1002945 5 4 4 5 7 10 3 2 1 benign3 1015425 3 1 1 1 2 2 3 1 1 benign4 1016277 6 8 8 1 3 4 3 7 1 benign5 1017023 4 1 1 3 2 1 3 1 1 benign6 1017122 8 10 10 8 7 10 9 7 1 malignant |
Glass Identification Database
- Description: Predict the glass type from chemical properties.
- Type: Classification
- Dimensions: 214 instances, 10 attributes
- Inputs: Numeric
- Output: Categorical, 7 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary
12345 | # Glass Identification Databasedata(Glass)dim(Glass)levels(Glass$Type)head(Glass) |
You will see:
1234567 | RI Na Mg Al Si K Ca Ba Fe Type1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 12 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 13 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 14 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 15 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 16 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1 |
Johns Hopkins University Ionosphere database
- Description: Predict high-energy structures in the atmosphere from antenna data.
- Type: Classification
- Dimensions: 351 instances, 35 attributes
- Inputs: Numeric
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary
12345 | # Johns Hopkins University Ionosphere databasedata(Ionosphere)dim(Ionosphere)levels(Ionosphere$Class)head(Ionosphere) |
You will see:
1234567891011121314 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V191 1 0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 0.59755 -0.44945 0.60536 -0.38223 0.84356 -0.38542 0.582122 1 0 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685 -0.97515 0.05499 -0.62237 0.331093 1 0 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 0.85443 0.00827 0.54591 0.00299 0.83775 -0.13644 0.755354 1 0 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.00000 0.14516 0.54094 -0.39330 -1.000005 1 0 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712 0.34395 -0.27457 0.52940 -0.21780 0.451076 1 0 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706 0.06637 0.03786 -0.06302 0.00000 0.00000 -0.04572 -0.15540 -0.00343 -0.10196 -0.11575 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 Class1 -0.32192 0.56971 -0.29674 0.36946 -0.47357 0.56811 -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300 good2 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447 bad3 -0.08540 0.70887 -0.27502 0.43385 -0.12062 0.57528 -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238 good4 -0.54467 -0.69975 1.00000 0.00000 0.00000 1.00000 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000 bad5 -0.17813 0.05982 -0.35575 0.02309 -0.52879 0.03286 -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697 good6 -0.05414 0.01838 0.03669 0.01519 0.00888 0.03513 -0.01535 -0.03240 0.09223 -0.07859 0.00732 0.00000 0.00000 -0.00039 0.12011 bad |
Pima Indians Diabetes Database
- Description: Predict the onset of diabetes in female Pima Indians from medical record data.
- Type: Binary Classification
- Dimensions: 768 instances, 9 attributes
- Inputs: Numeric
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary
12345 | # Pima Indians Diabetes Databasedata(PimaIndiansDiabetes)dim(PimaIndiansDiabetes)levels(PimaIndiansDiabetes$diabetes)head(PimaIndiansDiabetes) |
You will see:
1234567 | 相關推薦Machine Learning Datasets in R (10 datasets you can use right now)Tweet Share Share Google Plus You need standard datasets to practice machine learning. In this s Tune Machine Learning Algorithms in R (random forest case study)Tweet Share Share Google Plus It is difficult to find a good machine learning algorithm for your Better Understand Your Data in R Using Visualization (10 recipes you can use today)Tweet Share Share Google Plus You must understand your data to get the best results from machine How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi How To Get Started With Machine Learning Algorithms in RTweet Share Share Google Plus R is the most popular platform for applied machine learning. When Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)Tweet Share Share Google Plus Spot checking machine learning algorithms is how you find the best Save And Finalize Your Machine Learning Model in RTweet Share Share Google Plus Finding an accurate machine learning is not the end of the project Facebook Portal alternatives you can buy right nowFacebook is jumping into the smart display race with a new video communication device called Facebook Portal. It comes with built-in Alexa, supports voice The 20 best Cyber Monday deals you can get right nowThis is one of the best Black Friday deals of the year. It makes a fantastic gift, and with Ancestry's more than 10 million members, you might even be able 10 Machine Learning Examples in JavaScriptMachine learning libraries are becoming faster and more accessible with each passing year, showing no signs of slowing down. While traditionally Python has Amazon plans machine learning, software engineering, R&D hiring spree in UK ZDNetRetail to cloud-computing giant Amazon plans to hire over 1,000 new staff across three sites in the UK, and will open a new office in Manchester next year. How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/ 10 jobs you can do if you d LTE Router on’t want to teach English in Germanywww.inhandnetworks.de Are you an English speaker in Germany? From positions that are sporty to those which are academic or creative Applitools Recognized as a Top Artificial Intelligence and Machine Learning Solution in DevOpsAccording to the report, AI is now the number one strategic enterprise IT investment priority in 2018. Applitools developed the first and only AI-powered i Top 5 Machine Learning Libraries in Python(Sponsors) Get started learning Python with DataCamp's free Intro to Python tutorial. Learn Data Science by completing interactive coding challenges and Yantra Learning, First Machine Learning Competition in Nepal: Hackathon EditionRobotics Association of Nepal (RAN) in association with Fusemachines, Inc., Developers Session [Intel Software Nepal Representative] and Synergy Tech Softw Will "Leaky" Machine Learning Usher in a New Wave of Lawsuits?A computer science professor at Cornell University has a new twist on Marc Andreessen’s 2011 pronouncement that software is “eating the world.” Accordi Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite Amazon Web ServicesCreating and training machine learning models has become less time consuming and more cost efficient thanks to technology advancements like open source sof How to Normalize and Standardize Your Machine Learning Data in WekaTweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you |