1. 程式人生 > >10 Examples of How to Use Statistical Methods in a Machine Learning Project

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Statistics and machine learning are two very closely related fields.

In fact, the line between the two can be very fuzzy at times. Nevertheless, there are methods that clearly belong to the field of statistics that are not only useful, but invaluable when working on a machine learning project.

It would be fair to say that statistical methods are required to effectively work through a machine learning predictive modeling project.

In this post, you will discover specific examples of statistical methods that are useful and required at key steps in a predictive modeling problem.

After completing this post, you will know:

  • Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
  • That statistical methods can be used to clean and prepare data ready for modeling.
  • That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

Let’s get started.

10 Examples of Where to Use Statistical Methods in an Applied Machine Learning Project

10 Examples of Where to Use Statistical Methods in an Applied Machine Learning Project
Photo by chenutis, some rights reserved.

Overview

In this post, we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.

This will demonstrate that a working knowledge of statistics is essential for successfully working through a predictive modeling problem.

  1. Problem Framing
  2. Data Understanding
  3. Data Cleaning
  4. Data Selection
  5. Data Preparation
  6. Model Evaluation
  7. Model Configuration
  8. Model Selection
  9. Model Presentation
  10. Model Predictions

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Problem Framing

Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.

This is the selection of the type of problem, e.g. regression or classification, and perhaps the structure and types of the inputs and outputs for the problem.

The framing of the problem is not always obvious. For newcomers to a domain, it may require significant exploration of the observations in the domain.

For domain experts that may be stuck seeing the issues from a conventional perspective, they too may benefit from considering the data from multiple perspectives.

Statistical methods that can aid in the exploration of the data during the framing of a problem include:

  • Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
  • Data Mining. Automatic discovery of structured relationships and patterns in the data.

2. Data Understanding

Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.

Some of this knowledge may come from domain expertise, or require domain expertise in order to interpret. Nevertheless, both experts and novices to a field of study will benefit from actually handeling real observations form the domain.

Two large branches of statistical methods are used to aid in understanding data; they are:

  • Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
  • Data Visualization. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

3. Data Cleaning

Observations from a domain are often not pristine.

Although the data is digital, it may be subjected to processes that can damage the fidelity of the data, and in turn any downstream processes or models that make use of the data.

Some examples include:

  • Data corruption.
  • Data errors.
  • Data loss.

The process of identifying and repairing issues with the data is called data cleaning

Statistical methods are used for data cleaning; for example:

  • Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
  • Imputation. Methods for repairing or filling in corrupt or missing values in observations.

4. Data Selection

Not all observations or all variables may be relevant when modeling.

The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection.

Two types of statistical methods that are used for data selection include:

  • Data Sample. Methods to systematically create smaller representative samples from larger datasets.
  • Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.

5. Data Preparation

Data can often not be used directly for modeling.

Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms.

Data preparation is performed using statistical methods. Some common examples include:

  • Scaling. Methods such as standardization and normalization.
  • Encoding. Methods such as integer encoding and one hot encoding.
  • Transforms. Methods such as power transforms like the Box-Cox method.

6. Model Evaluation

A crucial part of a predictive modeling problem is evaluating a learning method.

This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.

Generally, the planning of this process of training and evaluating a predictive model is called experimental design. This is a whole subfield of statistical methods.

  • Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model. These two represent a subfield of statistical methods.

  • Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

7. Model Configuration

A given machine learning algorithm often has a suite of hyperparameters that allow the learning method to be tailored to a specific problem.

The configuration of the hyperparameters is often empirical in nature, rather than analytical, requiring large suites of experiments in order to evaluate the effect of different hyperparameter values on the skill of the model.

The interpretation and comparison of the results between different hyperparameter configurations is made using one of two subfields of statistics, namely:

  • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
  • Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

8. Model Selection

One among many machine learning algorithms may be appropriate for a given predictive modeling problem.

The process of selecting one method as the solution is called model selection.

This may involve a suite of criteria both from stakeholders in the project and the careful interpretation of the estimated skill of the methods evaluated for the problem.

As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection. They are:

  • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
    Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

9. Model Presentation

Once a final model has been trained, it can be presented to stakeholders prior to being used or deployed to make actual predictions on real data.

A part of presenting a final model involves presenting the estimated skill of the model.

Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.

  • Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via confidence intervals.

10. Model Predictions

Finally, it will come time to start using a final model to make predictions for new data where we do not know the real outcome.

As part of making predictions, it is important to quantify the confidence of the prediction.

Just like with the process of model presentation, we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals.

  • Estimation Statistics. Methods that quantify the uncertainty for a prediction via prediction intervals.

Summary

In this tutorial, you discovered the importance of statistical methods throughout the process of working through a predictive modeling project.

Specifically, you learned:

  • Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
  • That statistical methods can be used to clean and prepare data ready for modeling.
  • That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

相關推薦

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Tweet Share Share Google Plus Statistics and machine learning are two very closely related field

How to use "man" effectively in the development of Linux

man is the system's manual pager.      The table below shows the section numbers of the manual followed by the types of pages they cont

How To Use Retrofit Library In Your Android App

Retrofit library is a Type-safe REST client for android and Java, courtesy of Square Inc. Most modern android apps make HTTP requests to some remote s

how to use “request” object within a function in jsp

request is accessible inside the scriptlet expressions, because it’s an argument of the method in which these expressions are evaluated (_jspService). But

How to Use IoT Datasets in #AI Applications

Recently, google launched a Dataset search – which is a great resource to find Datasets. In this post, I list some IoT datasets which can be used for Machi

Reified Types in Kotlin: how to use the type within a function (KAD 14)

One of the limitations that most frustrates Java developers when using generics is not being able to use the type directly. Normally this is solved b

How to setup kernel debug in Virtual Machine and redirect usermode debug sessions

轉載自:http://blog.sina.com.cn/s/blog_65e729050100m7on.html 在Windows高效排錯中提到了除錯重定向。書中沒有詳細介紹。今天恰好有機會在虛擬機器上從頭開始配置了一下,所以把詳細的內容記錄在這裡,算是補充。 文章本身使用英文寫的。由於書中是用

Ask HN: How to legally guarantee privacy in a SaaS product?

Every so often, a "privacy-focused" SaaS product comes along that makes lots of promises about what they won't do with your data. While the intentions are

How To Build Intelligent Dashboards Powered by Machine Learning

In today’s intensely competitive data-driven era, speed to insight has become critical for success. Predictive models have little value unless

How to Use 3 Kinds of NetSuite Billing Types

Project management in NetSuite offers a great deal of flexibility when managing projects and resources. With that flexibility comes the options t

How to Use Homebrew Zsh Instead of Mac OS X Default

Out of the box Mac OS X version 10.8.x (Lion) comes with zsh version 4.3.11 (i386-apple-darwin12.0). However zsh is currently at versi

How to Use React to display NASA’s Astronomy Picture of the Day

How to Use React to display NASA’s Astronomy Picture of the DayGoal: Display NASA’s Astronomy Picture of the Day from the date a user inputsIn my first wee

How to Use Parametric Statistical Significance Tests in Python

Tweet Share Share Google Plus Parametric statistical methods often mean those methods that assum

How to use this image - Redis

art compile clu contain ext nal nds iat pop link - https://store.docker.com/images/redis?tab=description start a redis instance $ docker

how to use seeta face engine to detect and recognize face

obb ref mcs oci vdc face gin engine http R啦2Z娜辟絲5卸JZ戮諳http://www.docin.com/app/user/userinfo?userid=179005792 3Z煙1VLBR1吐http://www.docin

[RxJS] Learn How To Use RxJS 5.5 Beta 2

toarray return ray erro bsp err source val com The main changes is about how you import rxjs opreators from now on. And introduce lettabl

Learn how to use Latex.

dex with ocs have been info relation ani tails I had been learning Latex for a little while during my studying at Edinburgh, which was ma

How to setup Assigned Access in Windows 10 (Kiosk Mode) 設置分配的訪問權限(Kiosk模式)

win tar mode ctr assigned all oos rsquo eve Let’s say you’re building some sort of ingenious mechanical contraption to be dis

How to use Qt Designed Ui file

code ace RM pre pos creat -m pub clu Ui Designed file In Working, we can use Qt Designer to designe UI; Then, use uic -o head.h designe.