1. 程式人生 > >小馬哥課堂-統計學-t分布

小馬哥課堂-統計學-t分布

posit under 分布 \n null test spec lex 另一個

T distribution

定義

在概率論和統計學中,學生t-分布(t-distribution),可簡稱為t分布,用於根據小樣本來估計 呈正態分布且方差未知的總體的均值。如果總體方差已知(例如在樣本數量足夠多時),則應該用正態分布來估計總體均值。

In probability and statistics, Student‘s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

If we take a sample of n observations from a normal distribution, then the t-distribution with \(\displaystyle \nu =n-1\) degrees of freedom can be defined as the distribution of the location of the sample mean relative to the true mean, divided by the sample standard deviation, after multiplying by the standardizing term \(\displaystyle \sqrt {n}\)

. In this way, the t-distribution can be used to construct a confidence interval for the true mean.

概率密度函數(pdf)

\(f(t)=\frac{\displaystyle \Gamma(\frac{\nu+1}{2})}{\displaystyle \sqrt{\nu\pi}\cdot\Gamma(\frac {\nu} {2})} \Large \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}\),where \(\displaystyle \nu\)

is the number of degrees of freedom and \(\displaystyle \Gamma\) is the gamma function.

特點

t分布曲線形態與n(確切地說與自由度df)大小有關。與標準正態分布曲線相比,自由度df越小,t分布曲線愈平坦,曲線中間愈低,曲線雙側尾部翹得愈高;自由度df愈大,t分布曲線愈接近正態分布曲線,當自由度df->∞時,t分布曲線為標準正態分布曲線。

The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may produce outlying values when the denominator of the ratio falls close to zero. The Student‘s t-distribution is a special case of the generalised hyperbolic distribution.

作用

在概率論和統計學中,t-分布 經常應用在 對正態分布的總體的均值 進行估計。t檢驗改進了Z檢驗,不論樣本數量大或小皆可應用。在樣本數量大(超過120)時,可以應用Z檢驗,但Z檢驗用在小的樣本會產生很大的誤差,因此樣本很小的情況下得改用t檢驗。

The t-distribution plays a role in a number of widely used statistical analyses, including Student‘s t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Student‘s t-distribution also arises in the Bayesian analysis of data from a normal family.

t分布的產生

Let X1, ..., Xn be independent and identically distributed as N(μ, σ2), i.e. this is a sample of size n from a normally distributed population with expected mean value μ and variance σ2.

Let \(\overline X = \frac 1 n \displaystyle\sum_{i=1}^n X_i\) be the sample mean,Let \(S^2=\frac{1}{n-1}\displaystyle\sum_{i=1}^n(X_i-\overline X)^2\) be the(Bessel-corrected)sample variance.Then the random variable \(\frac{\overline X - \mu}{\frac {\sigma} {\sqrt n}}\) has a standard normal distribution(i.e. normal with expected value 0 and variance 1),and the random variable \(\frac{\overline X - \mu}{\frac{S}{\sqrt n}}\) (where S has been substituted for \(\sigma\))has a t distribution with n-1 degrees of freedom.

t分布置信區間的計算

Suppose the number A is so chosen that \(Pr(-A<T<A)=0.9\),when T has a t-distribution with n-1 degrees of freedom. By symmetry, this is the same as saying that A satisfies \(Pr(T<A)=0.95\),so A is the "95th percentile" of this probability distribution, or \(A=t_{(0.05,n-1)}\).Then \(\displaystyle Pr\left( -A < \frac{\overline X_n-\mu}{\frac {S_n}{\sqrt n}}<A \right)=0.9 => Pr\left( \overline X_n-A\cdot \frac{S_n}{\sqrt n}<\mu<\overline X_n+A\cdot \frac{S_n}{\sqrt n}\right)=0.9\).Therefore, the interval whose endpoints are \(\overline X_n \pm A\cdot \frac{S_n}{\sqrt n}\). It is a 90% confidence interval for \(\mu\).Therefore, if we find the mean of a set of observations that we can reasonably expect to have normal distribution,we can use the t-distribution to examine whether the confidence limits on that mean include some theoretically predicted value-such as the value predicted on a null hypothesis.

例1

7 patients‘ blood pressure have been measured after having been given a new drug for 3 months.they had blood pressure increases of 1.5,2.9,0.9,3.9,3.2,2.1 and 1.9.Construct a 95% confidence interval for the true expected blood pressure increases for all patients in a population.

樣本容量:n=7,

樣本均值:\(\overline X=\frac{1.5+2.9+0.9+3.9+3.2+2.1+1.9}{7}=2.34\)

樣本方差: \(S=\frac{(1.5-2.34)^2+(2.9-2.34)^2+(0.9-2.34^2)+(3.9-2.34^2)+(3.2-2.34)^2+(2.1-2.34)^2+(1.9-2.34)^2}{7-1}=1.04\)

查找t-table,自由度為6的95%的雙側T值為2.447

技術分享圖片

那麽,置信區間的端點是\(2.34\pm2.447\cdot\frac{1.04}{\sqrt 7}=2.34\pm0.9618\)

自由度

統計學上,自由度是指當以樣本的統計量來估計總體的參數時,樣本中獨立或能自由變化的數據的個數,稱為該統計量的自由度。

自由度的解釋:

  1. 若存在兩個變量a,b,且條件是a+b=1,顯然,我們只要知道其中一個數(a),另一個數(b=1-a)會依賴a的值變化而變化,所以這組數的自由度為1
  2. 估計總體的平均數(\(\mu\))時,由於樣本中的n個數都是相互獨立的,任一個尚未抽出的數都不受已抽出任何數值的影響,所以自由度為n。
  3. 估計總體的方差(\(\sigma^2\))時所使用的統計量是樣本的方差\(S^2\),而\(S^2\)必須用到樣本平均數\(\overline X\)來計算。在抽樣完成後\(\overline X\)已確定,所以大小為n的樣本中只要n-1個數確定了,第n個數的值就只有一個能使樣本符合\(\overline X\)的數值。也就是說,樣本中只有n-1個數可以自由變化,只要確定了這n-1個數,方差\(S^2\)也就確定了。這裏,平均數\(\overline X\)就相當於一個限制條件,由於加了這個限制條件,樣本方差\(S^2\)的自由度為n-1。

有一個有4個數據(n=4)的樣本,其平均值m等於5,即受到m=5的條件限制,在自由確定4、2、5三個數據後, 第四個數據只能是9,否則\(m\neq5\)。因而這裏的自由度df=n-1=4-1=3。推而廣之,任何統計量的自由度df=n-k(k為限制條件的個數)。

伽馬函數

In mathematics, the gamma function (represented by \(\Gamma\),the capital Greek alphabet letter gamma) is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers. If n is a positive integer,\(\Gamma(n)=(n-1)!\)

伽馬函數產生背景

1728年,哥德巴赫在考慮數列插值的問題,通俗的說就是把數列的通項公式定義從整數集合延拓到實數集合,例如數列1,4,9,16.....可以用通項公式n2自然的表達,即便 n 為實數的時候,這個通項公式也是良好定義的。直觀的說也就是可以找到一條平滑的曲線y=x2通過所有的整數點(n,n2),從而可以把定義在整數集上的公式延拓到實數集合。一天哥德巴赫開始處理階乘序列1,2,6,24,120,720,...,我們可以計算2!,3!,是否可以計算2.5!呢?我們把最初的一些(n,n!)的點畫在坐標軸上,確實可以看到,容易畫出一條通過這些點的平滑曲線。但是哥德巴赫無法解決階乘往實數集上延拓的這個問題,於是寫信請教尼古拉斯·伯努利和他的弟弟丹尼爾·伯努利,由於歐拉當時和丹尼爾·伯努利在一塊,他也因此得知了這個問題。而歐拉於1729 年完美地解決了這個問題,由此導致了伽瑪函數的誕生,當時歐拉只有22歲。

小馬哥課堂-統計學-t分布