1. 程式人生 > >Better Understand Your Data in R Using Descriptive Statistics (8 recipes you can use today)

Better Understand Your Data in R Using Descriptive Statistics (8 recipes you can use today)

You must become intimate with your data.

Any machine learning models that you build are only as good as the data that you provide them. The first step in understanding your data is to actually look at some raw values and calculate some basic statistics.

In this post, you will discover how you can quickly get a handle on your dataset with descriptive statistics examples and recipes in R.

These recipes are perfect for you if you are a developer just getting started using R for machine learning.

Let’s get started.

Update Nov/2016: As a helpful update, this tutorial assumes you have the mlbench

and e1071 R packages installed. They can be installed by typing:

1 install.packages("e1071","mlbench")
Descriptive Statistics Examples

Understand Your Data in R Using Descriptive Statistics
Photo by Enamur Reza

, some rights reserved.

You Must Understand Your Data

Understanding the data that you have is critically important.

You can run techniques and algorithms on your data, but it is not until you take the time to truly understand your dataset that you can fully understand the context of the results you achieve.

Better Understanding Equals Better Results

A deeper understanding of your data will give you better results.

Taking the time to study the data you have will help you in ways that are less obvious. You build an intuition for the data and for the entities that individual records or observations represent. These can bias you towards specific techniques (for better or worse), but you can also be inspired.

For example, examine your data in detail may trigger ideas for specific techniques to investigate:

  • Data Cleaning. You may discover missing or corrupt data and think of various data cleaning operations to perform such as marking or removing bad data and imputing missing data.
  • Data Transforms. You may discover that some attributes have familiar distributions such as Gaussian or exponential giving you ideas of scaling or log or other transforms you could apply.
  • Data Modeling. You may notice properties of the data such as distributions or data types that suggest the use (or to not use) specific machine learning algorithms.

Use Descriptive Statistics

You need to look at your data. And you need to look at your data from different perspectives.

Inspecting your data will help you to build up your intuition and prompt you to start asking questions about the data that you have.

Multiple perspectives will challenge you to think about the data from different perspectives, helping you to ask more and better questions.

Two methods for looking at your data are:

  1. Descriptive Statistics
  2. Data Visualization

The first and best place to start is to calculate basic summary descriptive statistics on your data.

You need to learn the shape, size, type and general layout of the data that you have.

Let’s look at some ways that you can summarize your data using R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Summarize Data in R With Descriptive Statistics

In this section, you will discover 8 quick and simple ways to summarize your dataset.

Each method is briefly described and includes a recipe in R that you can run yourself or copy and adapt to your own needs.

1. Peek At Your Data

The very first thing to do is to just look at some raw data from your dataset.

If your dataset is small you might be able to display it all on the screen. Often it is not, so you can take a small sample and review that.

123456 # load the librarylibrary(mlbench)# load the datasetdata(PimaIndiansDiabetes)# display first 20 rows of datahead(PimaIndiansDiabetes,n=20)

The head function will display the first 20 rows of data for you to review and think about.

123456789101112131415161718192021 pregnant glucose pressure triceps insulin mass pedigree age diabetes161487235033.60.62750pos21856629026.60.35131neg38183640023.30.67232pos418966239428.10.16721neg50137403516843.12.28833pos65116740025.60.20130neg737850328831.00.24826pos81011500035.30.13429neg92197704554330.50.15853pos10812596000.00.23254pos114110920037.60.19130neg1210168740038.00.53734pos1310139800027.11.44157neg141189602384630.10.39859pos155166721917525.80.58751pos16710000030.00.48432pos170118844723045.80.55131pos187107740029.60.25431pos19110330388343.30.18333neg20111570309634.60.52932pos

2. Dimensions of Your Data

How much data do you have? You may have a general idea, but it is much better to have a precise figure.

If you have a lot of instances, you may need to work with a smaller sample of the data so that model training and evaluation is computationally tractable. If you have a vast number of attributes, you may need to select those that are most relevant. If you have more attributes than instances you may need to select specific modeling techniques.

123456 # load the librarieslibrary(mlbench)# load the datasetdata(PimaIndiansDiabetes)# display the dimensions of the datasetdim(PimaIndiansDiabetes)

This shows the rows and columns of your loaded dataset.

1 [1]7689

3. Data Types

You need to know the types of the attributes in your data.

This is invaluable. The types will indicate the types of further analysis, types of visualization and even the types of machine learning algorithms that you can use.

Additionally, perhaps some attributes were loaded as one type (e.g. integer) and could in-fact be represented as another type (a categorical factor). Inspecting the types helps expose these  issues and spark ideas early.

123456 # load librarylibrary(mlbench)# load datasetdata(BostonHousing)# list types for each attributesapply(BostonHousing,class)

This lists the data type of each attribute in your dataset.

1234 crim        zn     indus      chas       nox        rm       age       dis       rad       tax   ptratiob"numeric""numeric""numeric""factor""numeric""numeric""numeric""numeric""numeric""numeric""numeric""numeric"lstat      medv"numeric""numeric"

4. Class Distribution

In a classification problem, you must know the proportion of instances that belong to each class value.

This is important because it may highlight an imbalance in the data, that if severe may need to be addressed with rebalancing techniques. In the case of a multi-class classification problem, it may expose class with a small or zero instances that may be candidates for removing from the dataset.

1234567 # load the librarieslibrary(mlbench)# load the datasetdata(PimaIndiansDiabetes)# distribution of class variabley<-PimaIndiansDiabetes$diabetescbind(freq=table(y),percentage=prop.table(table(y))*100)

This recipe creates a useful table showing the number of instances that belong to each class as well as the percentage that this represents from the entire dataset.

123 freq percentageneg50065.10417pos26834.89583

5. Data Summary

There is a most valuable function called summary() that summarizes each attribute in your dataset in turn. This is a most valuable function.

The function creates a table for each attribute and lists a breakdown of values. Factors are described as counts next to each class label. Numerical attributes are described as:

  • Min
  • 25th percentile
  • Median
  • Mean
  • 75th percentile
  • Max

The breakdown also includes an indication of the number of missing values for an attribute (marked N/A).

1234 # load the iris datasetdata(iris)# summarize the datasetsummary(iris)

You can see that this recipe produces a lot of information for you to review. Take your time and work through each attribute in turn.

1234567 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  Min.:4.300Min.:2.000Min.:1.000Min.:0.100setosa:501stQu.:5.1001stQu.:2.8001stQu.:1.6001stQu.:0.300versicolor:50Median:5.800Median:3.000Median:4.350Median:1.300virginica:50Mean:5.843Mean:3.057Mean:3.758Mean:1.1993rdQu.:6.4003rdQu.:3.3003rdQu.:5.1003rdQu.:1.800Max.:7.900Max.:4.400Max.:6.900Max.:2.500

6. Standard Deviations

One thing missing from the summary() function above are the standard deviations.

The standard deviation along with the mean are useful to know if the data has a Gaussian (or nearly Gaussian) distribution. For example, it can useful for a quick and dirty outlier removal tool, where any values that are more than three times the standard deviation from the mean are outside of 99.7 of the data.

123456 # load the librarieslibrary(mlbench)# load the datasetdata(PimaIndiansDiabetes)# calculate standard deviation for all attributessapply(PimaIndiansDiabetes[,1:8],sd)

This calculates the standard deviation for each numeric attribute in the dataset.

12 pregnant     glucose    pressure     triceps     insulin        mass    pedigree         age3.369578131.972618219.355807215.9522176115.24400247.88416030.331328611.7602315

7. Skewness

If a distribution looks kind-of-Gaussian but is pushed far left or right it is useful to know the skew.

Getting a feeling for the skew is much easier with plots of the data, such as a histogram or density plot. It is harder to tell from looking at means, standard deviations and quartiles.

Nevertheless, calculating the skew up front gives you a reference that you can use later if you decide to correct the skew for an attribute.

123456789 # load librarieslibrary(mlbench)library(e1071)# load the datasetdata(PimaIndiansDiabetes)# calculate skewness for each variableskew<-apply(PimaIndiansDiabetes[,1:8],2,skewness)# display skewness, larger/smaller deviations from 0 show more skewprint(skew)

The further the distribution of the skew value from zero, the larger the skew to the left (negative skew value) or right (positive skew value).

12 pregnant    glucose   pressure    triceps    insulin       mass   pedigree        age0.89815490.1730754-1.83641260.10894562.2633826-0.42730731.91241791.1251880

8. Correlations

It is important to observe and think about how attributes relate to each other.

For numeric attributes, an excellent way to think about attribute-to-attribute interactions is to calculate correlations for each pair of attributes.