1. 程式人生 > >How to Use Parametric Statistical Significance Tests in Python

How to Use Parametric Statistical Significance Tests in Python

Parametric statistical methods often mean those methods that assume the data samples have a Gaussian distribution.

in applied machine learning, we need to compare data samples, specifically the mean of the samples. Perhaps to see if one technique performs better than another on one or more datasets. To quantify this question and interpret the results, we can use parametric hypothesis testing methods such as the Student’s t-test and ANOVA.

In this tutorial, you will discover parametric statistical significance tests that quantify the difference between the means of two or more samples of data.

After completing this tutorial, you will know:

  • The Student’s t-test for quantifying the difference between the mean of two independent data samples.
  • The paired Student’s t-test for quantifying the difference between the mean of two dependent data samples.
  • The ANOVA and repeated measures ANOVA for checking the similarity or difference between the means of 2 or more data samples.

Let’s get started.

  • Updated May/2018: Improved language around reject vs fail to reject of statistical tests.
Introduction to Use Parametric Statistical Significance Tests in Python

Introduction to Use Parametric Statistical Significance Tests in Python
Photo by nestor ferraro, some rights reserved.

Tutorial Overview

  1. Parametric Statistical Significance Tests
  2. Test Data
  3. Student’s t-Test
  4. Paired Student t-Test
  5. Analysis of Variance Test
  6. Repeated Measures ANOVA Test

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Parametric Statistical Significance Tests

Parametric statistical tests assume that a data sample was drawn from a specific population distribution.

They often refer to statistical tests that assume the Gaussian distribution. Because it is so common for data to fit this distribution, parametric statistical methods are more commonly used.

A typical question we may have about two or more samples of data is whether they have the same distribution. Parametric statistical significance tests are those statistical methods that assume data comes from the same Gaussian distribution, that is a data distribution with the same mean and standard deviation: the parameters of the distribution.

In general, each test calculates a test statistic that must be interpreted with some background in statistics and a deeper knowledge of the statistical test itself. Tests also return a p-value that can be used to interpret the result of the test. The p-value can be thought of as the probability of observing the two data samples given the base assumption (null hypothesis) that the two samples were drawn from a population with the same distribution.

The p-value can be interpreted in the context of a chosen significance level called alpha. A common value for alpha is 5%, or 0.05. If the p-value is below the significance level, then the test says there is enough evidence to reject the null hypothesis and that the samples were likely drawn from populations with differing distributions.

  • p <= alpha: reject null hypothesis, different distribution.
  • p > alpha: fail to reject null hypothesis, same distribution.

Test Data

Before we look at specific parametric significance tests, let’s first define a test dataset that we can use to demonstrate each test.

We will generate two samples drawn from different distributions. Each sample will be drawn from a Gaussian distribution.

We will use the randn() NumPy function to generate a sample of 100 Gaussian random numbers in each sample with a mean of 0 and a standard deviation of 1. Observations in the first sample are scaled to have a mean of 50 and a standard deviation of 5. Observations in the second sample are scaled to have a mean of 51 and a standard deviation of 5.

We expect the statistical tests to discover that the samples were drawn from differing distributions, although the small sample size of 100 observations per sample will add some noise to this decision.

The complete code example is listed below.

12345678910111213 # generate gaussian data samplesfrom numpy.random import seedfrom numpy.random import randnfrom numpy import meanfrom numpy import std# seed the random number generatorseed(1)# generate two sets of univariate observationsdata1=5*randn(100)+50data2=5*randn(100)+51# summarizeprint('data1: mean=%.3f stdv=%.3f'%(mean(data1),std(data1)))print('data2: mean=%.3f stdv=%.3f'%(mean(data2),std(data2)))

Running the example generates the data samples, then calculates and prints the mean and standard deviation for each sample, confirming their different distribution.

12 data1: mean=50.303 stdv=4.426data2: mean=51.764 stdv=4.660

Student’s t-Test

The Student’s t-test is a statistical hypothesis test that two independent data samples known to have a Gaussian distribution, have the same Gaussian distribution, named for William Gosset, who used the pseudonym “Student“.

One of the most commonly used t tests is the independent samples t test. You use this test when you want to compare the means of two independent samples on a given variable.

The assumption or null hypothesis of the test is that the means of two populations are equal. A rejection of this hypothesis indicates that there is sufficient evidence that the means of the populations are different, and in turn that the distributions are not equal.

  • Fail to Reject H0: Sample distributions are equal.
  • Reject H0: Sample distributions are not equal.

The Student’s t-test is available in Python via the ttest_ind() SciPy function. The function takes two data samples as arguments and returns the calculated statistic and p-value.

We can demonstrate the Student’s t-test on the test problem with the expectation that the test discovers the difference in distribution between the two independent samples. The complete code example is listed below.

123456789101112131415161718 # Student's t-testfrom numpy.random import seedfrom numpy.random import randnfrom scipy.stats import ttest_ind# seed the random number generatorseed(1)# generate two independent samplesdata1=5*randn(100)+50data2=5*randn(100)+51# compare samplesstat,p=ttest_ind(data1,data2)print('Statistics=%.3f, p=%.3f'%(stat,p))# interpretalpha=0.05ifp>alpha:print('Same distributions (fail to reject H0)')else:print('Different distributions (reject H0)')

Running the example calculates the Student’s t-test on the generated data samples and prints the statistic and p-value.

The interpretation of the statistic finds that the sample means are different, with a significance of at least 5%.

12 Statistics=-2.262, p=0.025Different distributions (reject H0)

Paired Student’s t-Test

We may wish to compare the means between two data samples that are related in some way.

For example, the data samples may represent two independent measures or evaluations of the same object. These data samples are repeated or dependent and are referred to as paired samples or repeated measures.

Because the samples are not independent, we cannot use the Student’s t-test. Instead, we must use a modified version of the test that corrects for the fact that the data samples are dependent, called the paired Student’s t-test.

A dependent samples t test is also used to compare two means on a single dependent variable. Unlike the independent samples test, however, a dependent samples t test is used to compare the means of a single sample or of two matched or paired samples.

The test is simplified because it no longer assumes that there is variation between the observations, that observations were made in pairs, before and after a treatment on the same subject or subjects.

The default assumption, or null hypothesis of the test, is that there is no difference in the means between the samples. The rejection of the null hypothesis indicates that there is enough evidence that the sample means are different.

  • Fail to Reject H0: Paired sample distributions are equal.
  • Reject H0: Paired sample distributions are not equal.

The paired Student’s t-test can be implemented in Python using the ttest_rel() SciPy function. As with the unpaired version, the function takes two data samples as arguments and returns the calculated statistic and p-value.

We can demonstrate the paired Student’s t-test on the test dataset. Although the samples are independent, not paired, we can pretend for the sake of the demonstration that the observations are paired and calculate the statistic.

The complete example is listed below.

123456789101112131415161718 # Paired Student's t-testfrom numpy.random import seedfrom numpy.random import randnfrom scipy.stats import ttest_rel# seed the random number generatorseed(1)# generate two independent samplesdata1=5*randn(100)+50data2=5*randn(100)+51# compare samplesstat,p=ttest_rel(data1,data2)print('Statistics=%.3f, p=%.3f'%(stat,p))# interpretalpha=0.05ifp>alpha:print('Same distributions (fail to reject H0)')else:print('Different distributions (reject H0)')

Running the example calculates and prints the test statistic and p-value. The interpretation of the result suggests that the samples have different means and therefore different distributions.

12 Statistics=-2.372, p=0.020Different distributions (reject H0)

Analysis of Variance Test

There are sometimes situations where we may have multiple independent data samples.

We can perform the Student’s t-test pairwise on each combination of the data samples to get an idea of which samples have different means. This can be onerous if we are only interested in whether all samples have the same distribution or not.

To answer this question, we can use the analysis of variance test, or ANOVA for short. ANOVA is a statistical test that assumes that the mean across 2 or more groups are equal. If the evidence suggests that this is not the case, the null hypothesis is rejected and at least one data sample has a different distribution.

  • Fail to Reject H0: All sample distributions are equal.
  • Reject H0: One or more sample distributions are not equal.

Importantly, the test can only comment on whether all samples are the same or not; it cannot quantify which samples differ or by how much.

The purpose of a one-way analysis of variance (one-way ANOVA) is to compare the means of two or more groups (the independent variable) on one dependent variable to see if the group means are significantly different from each other.

The test requires that the data samples are a Gaussian distribution, that the samples are independent, and that all data samples have the same standard deviation.

The ANOVA test can be performed in Python using the f_oneway() SciPy function. The function takes two or more data samples as arguments and returns the test statistic and f-value.

We can modify our test problem to have two samples with the same mean and a third sample with a slightly different mean. We would then expect the test to discover that at least one sample has a different distribution.

The complete example is listed below.

12345678910111213141516171819 # Analysis of Variance testfrom numpy.random import seedfrom numpy.random import randnfrom scipy.stats import f_oneway# seed the random number generatorseed(1)# generate three independent samplesdata1=5*randn(100)+50data2=5*randn(100)+50data3=5*randn(100)+52# compare samplesstat,p=f_oneway(data1,data2,data3)print('Statistics=%.3f, p=%.3f'%(stat,p))# interpretalpha=0.05ifp>alpha:print('Same distributions (fail to reject H0)')else:print('Different distributions (reject H0)')

Running the example calculates and prints the test statistic and the p-value.

The interpretation of the p-value correctly rejects the null hypothesis indicating that one or more sample means differ.

12 Statistics=3.655, p=0.027Different distributions (reject H0)

Repeated Measures ANOVA Test

We may have multiple data samples that are related or dependent in some way.

For example, we may repeat the same measurements on a subject at different time periods. In this case, the samples will no longer be independent; instead we will have multiple paired samples.

We could repeat the pairwise Student’s t-test multiple times. Alternately, we can use a single test to check if all of the samples have the same mean. A variation of the ANOVA test can be used, modified to test across more than 2 samples. This test is called the repeated measures ANOVA test.

The default assumption or null hypothesis is that all paired samples have the same mean, and therefore the same distribution. If the samples suggest that this is not the case, then the null hypothesis is rejected and one or more of the paired samples have a different mean.

  • Fail to Reject H0: All paired sample distributions are equal.
  • Reject H0: One or more paired sample distributions are not equal.

Repeated-measures ANOVA has a number of advantages over paired t tests, however. First, with repeated-measures ANOVA, we can examine differences on a dependent variable that has been measured at more than two time points, whereas with an independent t test we can only compare scores on a dependent variable from two time points.

Unfortunately, at the time of writing, there is no version of the repeated measures ANOVA test available in SciPy. Hopefully this test will be added soon.

I mention this test for completeness in case you require it on your project and are able to seek and find an alternate implementation.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Update all examples to operate on data samples that have the same distribution.
  • Create a flowchart for choosing each of the three statistical significance tests given the requirements and behavior of each test.
  • Consider 3 cases of comparing data samples in a machine learning project, assume a non-Gaussian distribution for the samples, and suggest the type of test that could be used in each case.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

API

Articles

Summary

In this tutorial, you discovered parametric statistical significance tests that quantify the difference between the means of two or more samples of data.

Specifically, you learned:

  • The Student’s t-test for quantifying the difference between the mean of two independent data samples.
  • The paired Student’s t-test for quantifying the difference between the mean of two dependent data samples.
  • The ANOVA and repeated measures ANOVA for checking the similarity or difference between the means of two or more data samples.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

相關推薦

How to Use Parametric Statistical Significance Tests in Python

Tweet Share Share Google Plus Parametric statistical methods often mean those methods that assum

How To Use Simple Factory Design Pattern In Java

Simple Factory Design Pattern is one of the many design patterns – best practices on how to solve common problems. Design Patterns were made popular by

How to use epoll? A complete example in C

on 2 Jun 2011 by Mukund (@muks) Network servers are traditionally implemented using a separate process or thread per connection. F

How to use udev for Oracle ASM in Oracle Linux 6

但是在OEL6或者RHEL6中,這一切都有所變化。 主要的變化是: 1. scsi_id的命令語法發生了變化,scsi_id -g -u -s這樣的命令不再有效。 2. udevtest命令已經沒有了,整合到了udevadm中。 可以參考Redhat的官方文件(這個文件中本身有一些錯誤,在ud

【轉】How to initialize a two-dimensional array in Python?

use obj class amp example list tty address add 【wrong way:】 m=[[element] * numcols] * numrowsfor example: >>> m=[[‘a‘] *3] * 2&g

How to Create a Simple Neural Network in Python

Neural networks (NN), also called artificial neural networks (ANN) are a subset of learning algorithms within the machine learning field that are loosely b

15 Statistical Hypothesis Tests in Python (Cheat Sheet)

Tweet Share Share Google Plus Quick-reference guide to the 15 statistical hypothesis tests that

How to One Hot Encode Sequence Data in Python

Tweet Share Share Google Plus Machine learning algorithms cannot work with categorical data dire

How to Get Started with Machine Learning in Python

Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Tweet Share Share Google Plus Statistics and machine learning are two very closely related field

How to use *args and **kwargs in Python

這篇文章寫的滿好的耶,結論: 1星= array, 2星=dictionary. 1星範例: def test_var_args(farg, *args): print "formal arg:", farg for arg in args: print "an

How To Use Retrofit Library In Your Android App

Retrofit library is a Type-safe REST client for android and Java, courtesy of Square Inc. Most modern android apps make HTTP requests to some remote s

Why (and how) to use eslint in your project

Why (and how) to use eslint in your projectThis story was written by Sam Roberts, a Senior Software Engineer at IBM Canada. It was first published in IBM d

how to use “request” object within a function in jsp

request is accessible inside the scriptlet expressions, because it’s an argument of the method in which these expressions are evaluated (_jspService). But

How to use AI in the insurance value chain: customer service and policy administration

How do you approach customer service and policy administration within your organization? In this blog post, I'll demonstrate how artificial intelligence (A

How to use APIs with Pandas and store the results in Redshift

How to use APIs with Pandas and store the results in RedshiftHere is an easy tutorial to help understand how you can use Pandas to get data from a RESTFUL

How to use AI in a small business

If you are a small business owner, you may wonder if it is possible to implement AI in your business; you may even feel very challenged by the idea. There

How to Use IoT Datasets in #AI Applications

Recently, google launched a Dataset search – which is a great resource to find Datasets. In this post, I list some IoT datasets which can be used for Machi

How to use Git in a secure way

How to use Git in a secure wayWe live in a world where it is hard not to know Git, the most popular Distributed Version Control System (DVCS). Free and ope