1. 程式人生 > >What is Statistics (and why is it important in machine learning)?

What is Statistics (and why is it important in machine learning)?

Statistics is a collection of tools that you can use to get answers to important questions about data.

You can use descriptive statistical methods to transform raw observations into information that you can understand and share. You can use inferential statistical methods to reason from small samples of data to whole domains.

In this post, you will discover clearly why statistics is important in general and for machine learning and generally the types of methods that are available.

After reading this post, you will know:

  • Statistics is generally considered a prerequisite to the field of applied machine learning.
  • We need statistics to help transform observations into information and to answer questions about samples of observations.
  • Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Let’s get started.

A Gentle Introduction to Statistics

A Gentle Introduction to Statistics
Photo by Mike Sutherland, some rights reserved.

Statistics is Required Prerequisite

Machine learning and statistics are two tightly related fields of study. So much so that statisticians refer to machine learning as “applied statistics” or “statistical learning” rather than the computer-science-centric name.

Machine learning is almost universally presented to beginners assuming that the reader has some background in statistics. We can make this concrete with a few cherry picked examples.

Take a look at this quote from the beginning of a popular applied machine learning book titled “Applied Predictive Modeling“:

… the reader should have some knowledge of basic statistics, including variance, correlation, simple linear regression, and basic hypothesis testing (e.g. p-values and test statistics).

Here’s another example from the popular “Introduction to Statistical Learning” book:

We expect that the reader will have had at least one elementary course in statistics.

Even when statistics is not a prerequisite, some primitive prior knowledge is required as can be seen in this quote from the widely read “Programming Collective Intelligence“:

… this book does not assume you have any prior knowledge of […] or statistics. […] but having some knowledge of trigonometry and basic statistics will help you understand the algorithms.

In order to be able to understand machine learning, some basic understanding of statistics is required.

To see why this is the case, we must first understand why we need the field of statistics in the first place.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Why Learn Statistics?

Raw observations alone are data, but they are not information or knowledge.

Data raises questions, such as:

  • What is the most common or expected observation?
  • What are the limits on the observations?
  • What does the data look like?

Although they appear simple, these questions must be answered in order to turn raw observations into information that we can use and share.

Beyond raw data, we may design experiments in order to collect observations. From these experimental results we may have more sophisticated questions, such as:

  • What variables are most relevant?
  • What is the difference in an outcome between two experiments?
  • Are the differences real or the result of noise in the data?

Questions of this type are important. The results matter to the project, to stakeholders, and to effective decision making.

Statistical methods are required to find answers to the questions that we have about data.

We can see that in order to both understand the data used to train a machine learning model and to interpret the results of testing different machine learning models, that statistical methods are required.

This is just the tip of the iceberg as each step in a predictive modeling project will require the use of a statistical method.

What is Statistics?

Statistics is a subfield of mathematics.

It refers to a collection of methods for working with data and using data to answer questions.

Statistics is the art of making numerical conjectures about puzzling questions. […] The methods were developed over several hundred years by people who were looking for answers to their questions.

— Page xiii, Statistics, Fourth Edition, 2007.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study. Often a technique can be both a classical method from statistics and a modern algorithm used for feature selection or modeling.

Although a working knowledge of statistics does not require deep theoretical knowledge, some important and easy-to-digest theorems from the relationship between statistics and probability can provide a valuable foundation.

Two examples include the law of large numbers and the central limit theorem; the first aids in understanding why bigger samples are often better and the second provides a foundation for how we can compare the expected values between samples (e.g mean values).

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data and inferential statistics for drawing conclusions from samples of data.

Statistics allow researchers to collect information, or data, from a large number of people and then summarize their typical experience. […] Statistics are also used to reach conclusions about general differences between groups. […] Statistics can also be used to see if scores on two variables are related and to make predictions.

Descriptive Statistics

Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.

Commonly, we think of descriptive statistics as the calculation of statistical values on samples of data in order to summarize properties of the sample of data, such as the common expected value (e.g. the mean or median) and the spread of the data (e.g. the variance or standard deviation).

Descriptive statistics may also cover graphical methods that can be used to visualize samples of data. Charts and graphics can provide a useful qualitative understanding of both the shape or distribution of observations as well as how variables may relate to each other.

Inferential Statistics

Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

Commonly, we think of inferential statistics as the estimation of quantities from the population distribution, such as the expected value or the amount of spread.

More sophisticated statistical inference tools can be used to quantify the likelihood of observing data samples given an assumption. These are often referred to as tools for statistical hypothesis testing, where the base assumption of a test is called the null hypothesis.

There are many examples of inferential statistical methods given the range of hypothesises we may assume and the constraints we may impose on the data in order to increase the power or likelihood that the finding of the test is correct.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Articles

Summary

In this post, you discovered clearly why statistics is important in general and for machine learning, and generally the types of methods that are available.

Specifically, you learned:

  • Statistics is generally considered a prerequisite to the field of applied machine learning.
  • We need statistics to help transform observations into information and to answer questions about samples of observations.
  • Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

相關推薦

What is Statistics (and why is it important in machine learning)?

Tweet Share Share Google Plus Statistics is a collection of tools that you can use to get answer

How Beginners Get It Wrong In Machine Learning

Tweet Share Share Google Plus The 5 Most Common Mistakes That Beginners Make And How To Avoid Th

What Is "Industrialized" AI and Why Is It Important?

I recently had the opportunity to participate in a fireside chat session at Forrester's New Tech & Innovation 2018 forum with J.P. Gownder, a vice pres

What is svchost.exe And Why Is It Running?

You are no doubt reading this article because you are wondering why on earth there are nearly a dozen processes running with the name svchost.exe. You

Ask HN: What are you working on and why is it cool?

It's easy to get tunnel vision when you are heads down working on a project and it's easy to forget there are thousands of other people doing the same thin

What is Cryptocurrency and Why It Matters for You

So cryptocurrencies and blockchain exist independently. But how do digital currencies work without blockchain?All cryptocurrencies make use of a distribute

What is a Stablecoin, and Why Does It Matter?

In the volatile world of crypto where a coin can be worth five cents one day and five dollars two months later, the idea of a “stablecoin” — a cryptocurren

What is NPS? And why you should know how it works

NPS, therefore ranges from -100 to +100.It looks complicated, but it is very easy and it will be explained below.The calculations:In the example below, the

What is a Thesaurus and Why is it a Whole Other Thing from a Dictionary?

What is a thesaurus? To understand it better, let’s look at this simple example. Consider the word “house,” which is defin

Intro: What is Blockchain and how does it work?

Intro: What is Blockchain and how does it work?Unless you were hibernating for past 7–8 months, I can assure you that you have encountered words like Bitco

Ask HN: What's the mission of your company and why does it truly matter?

I'm a co-founder of Mixnode and we want to "make web-scale data affordable for everyone".Having access to trillions of data points from the web is a super

視頻筆記 CppCon 2015:Marshall Clow “Type Traits - what are they and why should I use them?"

for -- per 是的 point 分類 ace ner null Video: CppCon 2015:Marshall Clow “Type Traits - what are they and why should I use them?" https://www

SuiteScript Tutorial - How to use it and why use it?

What you will learn: What SuiteScript is? How to create a Script record in NetSuite? How to write and upload a JavaScript file? How to

Maven工程提示錯誤資訊:web.xml is missing and is set to true

先看下錯誤資訊提示:大概的意思就是建立的Web工程的web.xml檔案缺失,那麼很容易就可以想到兩種解決辦法: 1、將Maven工程預設需要web.xml這一配置改為false 2、在src/main/webapp/WEB-INF下建立一個web.xml檔案 1、將

Episode 20 "Two Is One And One Is None"

Episode 20 “Two Is One And One Is None” Here are the show notes for Episode 20 “Two Is One And One Is None”. The show is called this because our To

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Tweet Share Share Google Plus Data plays a big part in machine learning. It is important to unde

Text and Rich Media Analytics Powered by Machine Learning

About 80% of big data is unstructured data - text, speech, image and video. How can we extract value from this massive and high growth asset? Micro Focus I

Advanced Imaging and Image Analysis Services: Digital Pathology, Machine Learning and 3D Cell Culture Models

Three major opportunities for improvement in early-stage in vitro and animal model studies are to improve the predictive capability of in vitro models them

and that Makes all the Difference | Machine Learning Blog

Based on a recent conversation between Joseph Sirosh, CTO for AI at Microsoft, and Roger Magoulas, VP of Radar at O’Reilly Media. Link to video recordi

Quick and Dirty Data Analysis for your Machine Learning Problem

Tweet Share Share Google Plus A part of having a good understanding of the machine learning prob