What is Statistics (and why is it important in machine learning)?

阿新 • • 發佈：2019-01-12

Statistics is a collection of tools that you can use to get answers to important questions about data.

You can use descriptive statistical methods to transform raw observations into information that you can understand and share. You can use inferential statistical methods to reason from small samples of data to whole domains.

In this post, you will discover clearly why statistics is important in general and for machine learning and generally the types of methods that are available.

After reading this post, you will know:

Statistics is generally considered a prerequisite to the field of applied machine learning.
We need statistics to help transform observations into information and to answer questions about samples of observations.

Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Letâ€™s get started.

A Gentle Introduction to Statistics
Photo by Mike Sutherland, some rights reserved.

Statistics is Required Prerequisite

Machine learning and statistics are two tightly related fields of study. So much so that statisticians refer to machine learning as “applied statistics” or “statistical learning” rather than the computer-science-centric name.

Machine learning is almost universally presented to beginners assuming that the reader has some background in statistics. We can make this concrete with a few cherry picked examples.

Take a look at this quote from the beginning of a popular applied machine learning book titled “Applied Predictive Modeling“:

… the reader should have some knowledge of basic statistics, including variance, correlation, simple linear regression, and basic hypothesis testing (e.g. p-values and test statistics).

Here’s another example from the popular “Introduction to Statistical Learning” book:

We expect that the reader will have had at least one elementary course in statistics.

Even when statistics is not a prerequisite, some primitive prior knowledge is required as can be seen in this quote from the widely read “Programming Collective Intelligence“:

… this book does not assume you have any prior knowledge of […] or statistics. […] but having some knowledge of trigonometry and basic statistics will help you understand the algorithms.

In order to be able to understand machine learning, some basic understanding of statistics is required.

To see why this is the case, we must first understand why we need the field of statistics in the first place.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Why Learn Statistics?

Raw observations alone are data, but they are not information or knowledge.

Data raises questions, such as:

What is the most common or expected observation?
What are the limits on the observations?
What does the data look like?

Although they appear simple, these questions must be answered in order to turn raw observations into information that we can use and share.

Beyond raw data, we may design experiments in order to collect observations. From these experimental results we may have more sophisticated questions, such as:

What variables are most relevant?
What is the difference in an outcome between two experiments?
Are the differences real or the result of noise in the data?

Questions of this type are important. The results matter to the project, to stakeholders, and to effective decision making.

Statistical methods are required to find answers to the questions that we have about data.

We can see that in order to both understand the data used to train a machine learning model and to interpret the results of testing different machine learning models, that statistical methods are required.

This is just the tip of the iceberg as each step in a predictive modeling project will require the use of a statistical method.

What is Statistics?

Statistics is a subfield of mathematics.

It refers to a collection of methods for working with data and using data to answer questions.

Statistics is the art of making numerical conjectures about puzzling questions. […] The methods were developed over several hundred years by people who were looking for answers to their questions.

— Page xiii, Statistics, Fourth Edition, 2007.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study. Often a technique can be both a classical method from statistics and a modern algorithm used for feature selection or modeling.

Although a working knowledge of statistics does not require deep theoretical knowledge, some important and easy-to-digest theorems from the relationship between statistics and probability can provide a valuable foundation.

Two examples include the law of large numbers and the central limit theorem; the first aids in understanding why bigger samples are often better and the second provides a foundation for how we can compare the expected values between samples (e.g mean values).

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data and inferential statistics for drawing conclusions from samples of data.

Statistics allow researchers to collect information, or data, from a large number of people and then summarize their typical experience. […] Statistics are also used to reach conclusions about general differences between groups. […] Statistics can also be used to see if scores on two variables are related and to make predictions.

Descriptive Statistics

Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.

Commonly, we think of descriptive statistics as the calculation of statistical values on samples of data in order to summarize properties of the sample of data, such as the common expected value (e.g. the mean or median) and the spread of the data (e.g. the variance or standard deviation).

Descriptive statistics may also cover graphical methods that can be used to visualize samples of data. Charts and graphics can provide a useful qualitative understanding of both the shape or distribution of observations as well as how variables may relate to each other.

Inferential Statistics

Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

Commonly, we think of inferential statistics as the estimation of quantities from the population distribution, such as the expected value or the amount of spread.

More sophisticated statistical inference tools can be used to quantify the likelihood of observing data samples given an assumption. These are often referred to as tools for statistical hypothesis testing, where the base assumption of a test is called the null hypothesis.

There are many examples of inferential statistical methods given the range of hypothesises we may assume and the constraints we may impose on the data in order to increase the power or likelihood that the finding of the test is correct.

Summary

In this post, you discovered clearly why statistics is important in general and for machine learning, and generally the types of methods that are available.

Specifically, you learned:

Statistics is generally considered a prerequisite to the field of applied machine learning.
We need statistics to help transform observations into information and to answer questions about samples of observations.
Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

What is Statistics (and why is it important in machine learning)?

Statistics is Required Prerequisite

Need help with Statistics for Machine Learning?

Why Learn Statistics?

What is Statistics?

Descriptive Statistics

Inferential Statistics

Further Reading

Books

Articles

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

What is Statistics (and why is it important in machine learning)?

How Beginners Get It Wrong In Machine Learning

What Is "Industrialized" AI and Why Is It Important?

What is svchost.exe And Why Is It Running?

Ask HN: What are you working on and why is it cool?

What is Cryptocurrency and Why It Matters for You

What is a Stablecoin, and Why Does It Matter?

What is NPS? And why you should know how it works

What is a Thesaurus and Why is it a Whole Other Thing from a Dictionary?

Intro: What is Blockchain and how does it work?

Ask HN: What's the mission of your company and why does it truly matter?

視頻筆記 CppCon 2015:Marshall Clow “Type Traits - what are they and why should I use them?"

SuiteScript Tutorial - How to use it and why use it?

Maven工程提示錯誤資訊：web.xml is missing and is set to true

Episode 20 "Two Is One And One Is None"

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Text and Rich Media Analytics Powered by Machine Learning

Advanced Imaging and Image Analysis Services: Digital Pathology, Machine Learning and 3D Cell Culture Models

and that Makes all the Difference | Machine Learning Blog

Quick and Dirty Data Analysis for your Machine Learning Problem

What is Statistics (and why is it important in machine learning)?

Statistics is Required Prerequisite

Need help with Statistics for Machine Learning?

Why Learn Statistics?

What is Statistics?

Descriptive Statistics

Inferential Statistics

Further Reading

Books

Articles

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

相關推薦