1. 程式人生 > >Machine Learning Datasets in R (10 datasets you can use right now)

Machine Learning Datasets in R (10 datasets you can use right now)

You need standard datasets to practice machine learning.

In this short post you will discover how you can load standard classification and regression datasets in R.

This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.

It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.

Let’s get started.

Practice On Small Well-Understood Datasets

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

  • You can download them fast.
  • You can fit them into memory easily.
  • You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

Access Standard Datasets in R

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

How To Load Standard Datasets in R

In this section you will discover the libraries that you can use to get access to standard machine learning datasets.

You will also discover specific classification and regression that you can load and use to practice machine learning in R.

Library: datasets

Iris Flowers Dataset

Iris Flowers Dataset
Photo by Rick Ligthelm, some rights reserved.

The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.

You can load a dataset from this library by typing:

1 data(DataSetName)

For example, to load the very commonly used iris dataset:

1 data(iris)

To see a list of the datasets available in this library, you can type:

12 # list all datasets in the packagelibrary(help="datasets")

Some highlights datasets from this package that you could use are below.

Iris Flowers Dataset

  • Description: Predict iris flower species from flower measurements.
  • Type: Multi-class classification
  • Dimensions: 150 instances, 5 attributes
  • Inputs: Numeric
  • Output: Categorical, 3 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary
12345 # iris flowers datasetsdata(iris)dim(iris)levels(iris$Species)head(iris)

You will see:

1234567   Sepal.Length Sepal.Width Petal.Length Petal.Width Species1          5.1         3.5          1.4         0.2  setosa2          4.9         3.0          1.4         0.2  setosa3          4.7         3.2          1.3         0.2  setosa4          4.6         3.1          1.5         0.2  setosa5          5.0         3.6          1.4         0.2  setosa6          5.4         3.9          1.7         0.4  setosa

Longley’s Economic Regression Data

  • Description: Predict number of people employed from economic variables
  • Type: Regression
  • Dimensions: 16 instances, 7 attributes
  • Inputs: Numeric
  • Output: Numeric
1234 # Longley's Economic Regression Datadata(longley)dim(longley)head(longley)

You will see:

1234567      GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed1947         83.0 234.289      235.6        159.0    107.608 1947   60.3231948         88.5 259.426      232.5        145.6    108.632 1948   61.1221949         88.2 258.054      368.2        161.6    109.773 1949   60.1711950         89.5 284.599      335.1        165.0    110.929 1950   61.1871951         96.2 328.975      209.9        309.9    112.075 1951   63.2211952         98.1 346.999      193.2        359.4    113.270 1952   63.639

Library: mlbench

Soybean Dataset

Soybean Dataset
Photo by United Soybean Board, some rights reserved.

Direct from the manual for the library:

A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

You can learn more about the mlbench library on the mlbench CRAN page.

If not installed, you can install this library as follows:

1 install.packages("mlbench")

You can load the library as follows:

12 # load the librarylibrary(mlbench)

To see a list of the datasets available in this library, you can type:

12 # list the contents of the librarylibrary(help="mlbench")

Some highlights datasets from this library that you could use are:

Boston Housing Data

  • Description: Predict the house price in Boston from house details
  • Type: Regression
  • Dimensions: 506 instances, 14 attributes
  • Inputs: Numeric
  • Output: Numeric
  • UCI Machine Learning Repository: Description
1234 # Boston Housing Datadata(BostonHousing)dim(BostonHousing)head(BostonHousing)

You will see:

1234567      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.02 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.63 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.74 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.45 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.26 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

Wisconsin Breast Cancer Database

  • Description: Predict whether a cancer is malignant or benign from biopsy details.
  • Type: Binary Classification
    Dimensions: 699 instances, 11 attributes
  • Inputs: Integer (Nominal)
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary
12345 # Wisconsin Breast Cancer Databasedata(BreastCancer)dim(BreastCancer)levels(BreastCancer$Class)head(BreastCancer)

You will see:

1234567        Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class1 1000025            5         1          1             1            2           1           3               1       1    benign2 1002945            5         4          4             5            7          10           3               2       1    benign3 1015425            3         1          1             1            2           2           3               1       1    benign4 1016277            6         8          8             1            3           4           3               7       1    benign5 1017023            4         1          1             3            2           1           3               1       1    benign6 1017122            8        10         10             8            7          10           9               7       1 malignant

Glass Identification Database

  • Description: Predict the glass type from chemical properties.
  • Type: Classification
  • Dimensions: 214 instances, 10 attributes
  • Inputs: Numeric
  • Output: Categorical, 7 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary
12345 # Glass Identification Databasedata(Glass)dim(Glass)levels(Glass$Type)head(Glass)

You will see:

1234567        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    12 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    13 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    14 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    15 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    16 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Johns Hopkins University Ionosphere database

  • Description: Predict high-energy structures in the atmosphere from antenna data.
  • Type: Classification
  • Dimensions: 351 instances, 35 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary
12345 # Johns Hopkins University Ionosphere databasedata(Ionosphere)dim(Ionosphere)levels(Ionosphere$Class)head(Ionosphere)

You will see:

1234567891011121314   V1 V2      V3       V4       V5       V6       V7       V8      V9      V10     V11      V12     V13      V14      V15      V16      V17      V18      V191  1  0 0.99539 -0.05889  0.85243  0.02306  0.83398 -0.37708 1.00000  0.03760 0.85243 -0.17755 0.59755 -0.44945  0.60536 -0.38223  0.84356 -0.38542  0.582122  1  0 1.00000 -0.18829  0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685 -0.97515  0.05499 -0.62237  0.331093  1  0 1.00000 -0.03365  1.00000  0.00485  1.00000 -0.12062 0.88965  0.01198 0.73082  0.05346 0.85443  0.00827  0.54591  0.00299  0.83775 -0.13644  0.755354  1  0 1.00000 -0.45161  1.00000  1.00000  0.71216 -1.00000 0.00000  0.00000 0.00000  0.00000 0.00000  0.00000 -1.00000  0.14516  0.54094 -0.39330 -1.000005  1  0 1.00000 -0.02401  0.94140  0.06531  0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712  0.34395 -0.27457  0.52940 -0.21780  0.451076  1  0 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706  0.06637 0.03786 -0.06302 0.00000  0.00000 -0.04572 -0.15540 -0.00343 -0.10196 -0.11575       V20      V21      V22      V23      V24      V25      V26      V27      V28      V29      V30      V31      V32      V33      V34 Class1 -0.32192  0.56971 -0.29674  0.36946 -0.47357  0.56811 -0.51171  0.41078 -0.46168  0.21266 -0.34090  0.42267 -0.54487  0.18641 -0.45300  good2 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447   bad3 -0.08540  0.70887 -0.27502  0.43385 -0.12062  0.57528 -0.40220  0.58984 -0.22145  0.43100 -0.17365  0.60436 -0.24180  0.56045 -0.38238  good4 -0.54467 -0.69975  1.00000  0.00000  0.00000  1.00000  0.90695  0.51613  1.00000  1.00000 -0.20099  0.25682  1.00000 -0.32382  1.00000   bad5 -0.17813  0.05982 -0.35575  0.02309 -0.52879  0.03286 -0.65158  0.13290 -0.53206  0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697  good6 -0.05414  0.01838  0.03669  0.01519  0.00888  0.03513 -0.01535 -0.03240  0.09223 -0.07859  0.00732  0.00000  0.00000 -0.00039  0.12011   bad

Pima Indians Diabetes Database

  • Description: Predict the onset of diabetes in female Pima Indians from medical record data.
  • Type: Binary Classification
  • Dimensions: 768 instances, 9 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary
12345 # Pima Indians Diabetes Databasedata(PimaIndiansDiabetes)dim(PimaIndiansDiabetes)levels(PimaIndiansDiabetes$diabetes)head(PimaIndiansDiabetes)

You will see:

1234567

相關推薦

Machine Learning Datasets in R (10 datasets you can use right now)

Tweet Share Share Google Plus You need standard datasets to practice machine learning. In this s

Tune Machine Learning Algorithms in R (random forest case study)

Tweet Share Share Google Plus It is difficult to find a good machine learning algorithm for your

Better Understand Your Data in R Using Visualization (10 recipes you can use today)

Tweet Share Share Google Plus You must understand your data to get the best results from machine

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

How To Get Started With Machine Learning Algorithms in R

Tweet Share Share Google Plus R is the most popular platform for applied machine learning. When

Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)

Tweet Share Share Google Plus Spot checking machine learning algorithms is how you find the best

Save And Finalize Your Machine Learning Model in R

Tweet Share Share Google Plus Finding an accurate machine learning is not the end of the project

Facebook Portal alternatives you can buy right now

Facebook is jumping into the smart display race with a new video communication device called Facebook Portal. It comes with built-in Alexa, supports voice

The 20 best Cyber Monday deals you can get right now

This is one of the best Black Friday deals of the year. It makes a fantastic gift, and with Ancestry's more than 10 million members, you might even be able

10 Machine Learning Examples in JavaScript

Machine learning libraries are becoming faster and more accessible with each passing year, showing no signs of slowing down. While traditionally Python has

Amazon plans machine learning, software engineering, R&D hiring spree in UK ZDNet

Retail to cloud-computing giant Amazon plans to hire over 1,000 new staff across three sites in the UK, and will open a new office in Manchester next year.

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

10 jobs you can do if you d LTE Router on’t want to teach English in Germany

www.inhandnetworks.de Are you an English speaker in Germany? From positions that are sporty to those which are academic or creative

Applitools Recognized as a Top Artificial Intelligence and Machine Learning Solution in DevOps

According to the report, AI is now the number one strategic enterprise IT investment priority in 2018. Applitools developed the first and only AI-powered i

Top 5 Machine Learning Libraries in Python

(Sponsors) Get started learning Python with DataCamp's free Intro to Python tutorial. Learn Data Science by completing interactive coding challenges and

Yantra Learning, First Machine Learning Competition in Nepal: Hackathon Edition

Robotics Association of Nepal (RAN) in association with Fusemachines, Inc., Developers Session [Intel Software Nepal Representative] and Synergy Tech Softw

Will "Leaky" Machine Learning Usher in a New Wave of Lawsuits?

A computer science professor at Cornell University has a new twist on Marc Andreessen’s 2011 pronouncement that software is “eating the world.”  Accordi

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite Amazon Web Services

Creating and training machine learning models has become less time consuming and more cost efficient thanks to technology advancements like open source sof

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you