1. 程式人生 > >15 Statistical Hypothesis Tests in Python (Cheat Sheet)

15 Statistical Hypothesis Tests in Python (Cheat Sheet)

Quick-reference guide to the 15 statistical hypothesis tests that you need in
applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

  • The name of the test.
  • What the test is checking.
  • The key assumptions of the test.
  • How the test result is interpreted.
  • Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

Let’s get started.

  • Update Nov/2018: Added a better overview of the tests covered.
Statistical Hypothesis Tests in Python Cheat Sheet

Statistical Hypothesis Tests in Python Cheat Sheet
Photo by davemichuda, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Normality Tests
    1. Shapiro-Wilk Test
    2. D’Agostino’s K^2 Test
    3. Anderson-Darling Test
  2. Correlation Tests
    1. Pearson’s Correlation Coefficient
    2. Spearman’s Rank Correlation
    3. Kendall’s Rank Correlation
    4. Chi-Squared Test
  3. Parametric Statistical Hypothesis Tests
    1. Student’s t-test
    2. Paired Student’s t-test
    3. Analysis of Variance Test (ANOVA)
    4. Repeated Measures ANOVA Test
  4. Nonparametric Statistical Hypothesis Tests
    1. Mann-Whitney U Test
    2. Wilcoxon Signed-Rank Test
    3. Kruskal-Wallis H Test
    4. Friedman Test

1. Normality Tests

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

Shapiro-Wilk Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).

Interpretation

  • H0: the sample has a Gaussian distribution.
  • H1: the sample does not have a Gaussian distribution.

Python Code

123 from scipy.stats import shapirodata1=....stat,p=shapiro(data)

More Information

D’Agostino’s K^2 Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).

Interpretation

  • H0: the sample has a Gaussian distribution.
  • H1: the sample does not have a Gaussian distribution.

Python Code

123 from scipy.stats import normaltestdata1=....stat,p=normaltest(data)

More Information

Anderson-Darling Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).

Interpretation

  • H0: the sample has a Gaussian distribution.
  • H1: the sample does not have a Gaussian distribution.
Python Code
123 from scipy.stats import andersondata1=....result=anderson(data)

More Information

2. Correlation Tests

This section lists statistical tests that you can use to check if two samples are related.

Pearson’s Correlation Coefficient

Tests whether two samples have a linear relationship.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample are normally distributed.
  • Observations in each sample have the same variance.

Interpretation

  • H0: the two samples are independent.
  • H1: there is a dependency between the samples.

Python Code

123 from scipy.stats import pearsonrdata1,data2=...corr,p=pearsonr(data1,data2)

More Information

Spearman’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.

Interpretation

  • H0: the two samples are independent.
  • H1: there is a dependency between the samples.

Python Code

123 from scipy.stats import spearmanrdata1,data2=...corr,p=spearmanr(data1,data2)

More Information

Kendall’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.

Interpretation

  • H0: the two samples are independent.
  • H1: there is a dependency between the samples.

Python Code

123 from scipy.stats import kendalltaudata1,data2=...corr,p=kendalltau(data1,data2)

More Information

Chi-Squared Test

Tests whether two categorical variables are related or independent.

Assumptions

  • Observations used in the calculation of the contingency table are independent.
  • 25 or more examples in each cell of the contingency table.

Interpretation

  • H0: the two samples are independent.
  • H1: there is a dependency between the samples.

Python Code

123 from scipy.stats import chi2_contingencytable=...stat,p,dof,expected=chi2_contingency(table)

More Information

3. Parametric Statistical Hypothesis Tests

This section lists statistical tests that you can use to compare data samples.

Student’s t-test

Tests whether the means of two independent samples are significantly different.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample are normally distributed.
  • Observations in each sample have the same variance.

Interpretation

  • H0: the means of the samples are equal.
  • H1: the means of the samples are unequal.

Python Code

123 from scipy.stats import ttest_inddata1,data2=...stat,p=ttest_ind(data1,data2)

More Information

Paired Student’s t-test

Tests whether the means of two paired samples are significantly different.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample are normally distributed.
  • Observations in each sample have the same variance.
  • Observations across each sample are paired.

Interpretation

  • H0: the means of the samples are equal.
  • H1: the means of the samples are unequal.

Python Code

123 from scipy.stats import ttest_reldata1,data2=...stat,p=ttest_rel(data1,data2)

More Information

Analysis of Variance Test (ANOVA)

Tests whether the means of two or more independent samples are significantly different.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample are normally distributed.
  • Observations in each sample have the same variance.

Interpretation

  • H0: the means of the samples are equal.
  • H1: one or more of the means of the samples are unequal.

Python Code

123 from scipy.stats import f_onewaydata1,data2,...=...stat,p=f_oneway(data1,data2,...)

More Information

Repeated Measures ANOVA Test

Tests whether the means of two or more paired samples are significantly different.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample are normally distributed.
  • Observations in each sample have the same variance.
  • Observations across each sample are paired.

Interpretation

  • H0: the means of the samples are equal.
  • H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

More Information

4. Nonparametric Statistical Hypothesis Tests

Mann-Whitney U Test

Tests whether the distributions of two independent samples are equal or not.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.

Interpretation

  • H0: the distributions of both samples are equal.
  • H1: the distributions of both samples are not equal.

Python Code

123 from scipy.stats import mannwhitneyudata1,data2=...stat,p=mannwhitneyu(data1,data2)

More Information

Wilcoxon Signed-Rank Test

Tests whether the distributions of two paired samples are equal or not.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.
  • Observations across each sample are paired.

Interpretation

  • H0: the distributions of both samples are equal.
  • H1: the distributions of both samples are not equal.

Python Code

123 from scipy.stats import wilcoxondata1,data2=...stat,p=wilcoxon(data1,data2)

More Information

Kruskal-Wallis H Test

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.

Interpretation

  • H0: the distributions of all samples are equal.
  • H1: the distributions of one or more samples are not equal.

Python Code

123 from scipy.stats import kruskaldata1,data2,...=...stat,p=kruskal(data1,data2,...)

More Information

Friedman Test

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

  • Observations in each sample are independent and identically distributed (iid).
  • Observations in each sample can be ranked.
  • Observations across each sample are paired.

Interpretation

  • H0: the distributions of all samples are equal.
  • H1: the distributions of one or more samples are not equal.

Python Code

123 from scipy.stats import friedmanchisquaredata1,data2,...=...stat,p=friedmanchisquare(data1,data2,...)

More Information

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

  • The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
  • The key assumptions for each test and how to interpret the test result.
  • How to implement the test using the Python API.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Did I miss an important statistical test or key assumption for one of the listed tests?
Let me know in the comments below.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

相關推薦

15 Statistical Hypothesis Tests in Python (Cheat Sheet)

Tweet Share Share Google Plus Quick-reference guide to the 15 statistical hypothesis tests that

How to Use Parametric Statistical Significance Tests in Python

Tweet Share Share Google Plus Parametric statistical methods often mean those methods that assum

閱讀筆記:Time Series FeatuRe Extraction on basis of Scalable Hypothesis testsPython package)

閱讀筆記:Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package) 摘要: 時間序列特徵工程是一個耗時的過程,因為科學家

python cheat sheet

__slots__=(),限制可動態繫結的屬性 def __str__(self),print的列印方式 def __iter__(self),def next(self),迴圈呼叫iter返回迭代

Critical Values for Statistical Hypothesis Testing and How to Calculate Them in Python

Tweet Share Share Google Plus In is common, if not standard, to interpret the results of statist

random Beasts and Where to Find Them: A Cheat Sheet for ES6 & Python 3

I love generative art, and I frequently use a pseudo-random function to add a bit of noise to an image or behavior, but every time I implement a pseudo-ran

statistical thinking in Python EDA

Histgram直方圖適合於單個變數的value分佈圖形 seaborn在matplotlib基礎上做了更高層的抽象,方便對基礎的圖表繪製。也可以繼續使用matplotlib直接繪圖,但是呼叫seabon的set()方法就能獲得好看的樣式。 # Import plotting modules import

求大神解決,已困擾兩天,python,unittest測試結果為Ran 0 tests in 0.000s

testadd_run.py #coding=utf-8 import unittest from match_ import Match class Test_match(unittest.TestCase): def setUp(self):

Python資料科學:Pandas Cheat Sheet

Key and Imports In this cheat sheet, we use the following shorthand: df | Any pandas DataFrame object s | Any pandas Series obje

Keras cheat sheet(Keras 速查手冊)

heat 打開 sset mage .com pdf .cn log amazon 轉自:https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf 右擊在新標

leetcode鏈表--15、everse-nodes-in-k-group(按照k值進行k個結點的逆序)

逆序 ever alter 解題思路 for chang 所有 node weight 題目描述 Given a linked list, reverse the nodes of a linked list k at a time and return its mod

[Python] How to unpack and pack collection in Python?

ide ont add off art video lec ref show It is a pity that i can not add the video here. As a result, i offer the link as below: How to

Data manipulation in python (module 3)

asp ins ide multipl method cts python str tro 1. Visualization wheel dimensions Abstraction - Figuration boxes and charts(abstraction) o

Data manipulation in python (module 4)

lin cal tween idt panda char depth nds packer 1. Matplotlib Backend Layer Deals with th e rendering of plots to screen or files In jup

MySQL SQL Injection Cheat Sheet

index rem not passwd tinc comm cas ret loading MySQL SQL Injection Cheat Sheet Some useful syntax reminders for SQL Injection into MySQ

leetcode-happy number implemented in python

n-n etc leet pow pytho app 分析 hide i++ 視頻分析: http://v.youku.com/v_show/id_XMTMyODkyNDA0MA==.html?from=y1.7-1.2 class Solu

Cheat sheet for Jupyter Notebook

cdi ref ax1 cdc rec 公眾 str 部分 ros 近期,DataCamp發布了jupyter notebook的 cheat sheet,【Python數據之道】第一時間與大家一起來分享下該cheat sheet的內容。 以下是該cheat

【轉】How to initialize a two-dimensional array in Python?

use obj class amp example list tty address add 【wrong way:】 m=[[element] * numcols] * numrowsfor example: >>> m=[[‘a‘] *3] * 2&g

The bytes/str dichotomy in Python 3

over utf-8 table imp tin cati center int ngs Arguably the most significant new feature of Python 3 is a much cleaner separation between

計算機科學-ASCII, Unicode & UTF-8 (in Python)

多語言 col n) 位數 lan round 進行 操作 blog 專題:ASCII, Unicode & UTF-8 (in Python) 1.基本概念: (1)字符集:已編號字符的有序集合,包括字符編號和字符,對計算機沒有直接意義 (2)編碼方案:將字符集中