The art of A/B testing

阿新 • • 發佈：2018-12-28

In particular, I will show:

how the Z-test can be applied to testing whether the clients experiencing B spend more time on average
how the χ² test can be used to decide whether or not version B leads to a higher conversion rate
how the Z-test can be adapted to test conversion rate of version B and if it yields the same conclusion as the χ² test

1 | Z-test for average time spent

The hypothesis to test are:

H₀: “the average time spent is the same for the two versions”
H₁: “the average time spent is higher for version B”

The first step is to model H₀

The Z-test uses the Central Limit Theorem (CLT) to do so.

Illustration of the CLT (from

Wikipedia)

The CLT establishes that:Given a random variable (rv) Xof expectation μand finite variance σ², {X₁,…,Xn}∼ X,nindependent identically distributed (iid) rv, the following approximation on their average (also a rv) can be made

In our context, we model the time spent for each client session i as a realisation:

aᵢ of rv Aᵢ ∼ A, if the client session belongs to the version A split
bᵢ of rv Bᵢ ∼ B, else

We use the approximation provided by the CLT to derive that

Under H₀, we have equality of the true means and therefore the model

Curves about N(0,1): centrered and reduced Gaussian distribution, probability density function (pdf) and associated p-values

The second step is to see how likely our samples are under H₀

Note that true expectation and variance for A and B are unknown. We introduce their respective empirical estimators:

Our samples generated the following test statistic Z, which needs to be tested against the reduced centered normal distribution:

Conceptually, Z represents the number of standard deviations the observed difference of means is away from 0. The higher this number, the lesser the likelihood of H₀.

Also notice that in the case the estimated expectations are actually different, (number of samples)↗, Z↗.

From the formula of Z, you can also get the intuition that the smaller the difference to prove is, the more samples you need.

On Python, the calculation looks like

p-value calculation and graphical representation

There is a pvalue chance that a result as extreme as we one we observed could have happened under H₀. With a common go-to α criterion of 5%, we have pvalue<α and H₀ can be rejected with confidence.

In cases where the sample size is not as big (< 30 per version), and the CLT approximation does not hold, one may take a look at Student’s t-test.

2 | χ² test for conversion rate

The hypothesis to test are:

H₀: “the conversion rate is the same for the two versions”
H₁: “the conversion rate is higher for version B”

Unlike the previous case, the outcome for each client session is not continuous but binary: either “not converted” or “converted”.

The summary of the observed outcomes is the following

The χ² test compares distributions of multinomial outcomes but we will keep to the binary case in this example.

As before, we will tackle the problem in two steps:

The first step is to model H₀

In H₀, conversions in version A and version B follow the same binomial distribution B(1,p). We pool the observations in both version A and B and derive the estimator for CR

and get \hat{p} = 0.0170

Thus, under H₀, the theoretical outcome table is

Let us look at the rv D, defined by

D represents a squared relative distance between the theoretical and the observed distributions.

According to Pearson’s theorem, under H₀, D follows a χ² probability law with 1 degree of freedom (df).

The second step is to see how likely our samples are under H₀

It consists in computing the observed D and deriving its corresponding p-value according to the χ² law.

This is how it can be done in Python:

There is a pvalue chance that a result at least as distant from the theoretical distribution as our observation would have happened under H₀. With a common go-to α criterion of 5%, we have pvalue>α and H₀ cannot be rejected.

3 | Z-test for conversion rate

The Z-test could be adapted to conversion rate by modelling conversion as an rv which realisations are in {0,1}:

1 for a conversion
0 else

We keep the same notations as before and model conversion for each client session i as a realisation:

aᵢ ∈ {0,1}of rv Aᵢ ∼ A, if the client session belongs to the version A split
bᵢ ∈ {0,1}of rv Bᵢ ∼ B, else

The first step is to model H₀

Under H₀, μ(A) = μ(B) and we have

The corresponding test statistic

This time, with binary rvs, it can be shown that the estimators for the standard deviations are functions of the expectations:

The second step is to see how likely our samples are under H₀

To this end, we compute the Z-score and the corresponding right-tailed p-value:

With this modelling, the p-value output is slightly lower than with the χ² test. With the same α=0.05 criterion, we would have rejected the null hypothesis (!!!).

This difference may be explained by a slight weakness of the Z-test, which does not acknowledge here the binary nature of the rv: μ(B)-μ(A) is actually bounded in [-1,1] and the observation is therefore attributed a lower p-value.

Always question your tests

and never make assumptions. A/B testing is indeed a great way to alleviate human bias when deciding on relevance of new features. However, do not forget that A/B testing still relies on a model of truth: as we have seen, there are different possible models.

In the case of large samples, they tend to converge to similar conclusions. In particular, the CLT approximation holds better than with small sample sizes.

In the latter cases, one may explore Student’s t-test, Welch’s t-test and Fisher’s exact test. You may also explore the realm of Reinforcement Learning in order to maximise gains while testing (Multi-armed bandits and the Exploitation vs Exploration dilemma).

Not only should you be strict in your interpretations of results but also be aware of contextual effects of your A/B test:

time of the year/month/week, the weather, the economic context can affect the nature of your audience
even if after two days of A/B testing your results are significant, they may not be over the course of a week

Main take-home messages

Hypothesis testing is about modelling a null hypothesis H₀ and assessing how likely it is, given the samples you got from the A/B test
The key is in the H₀ model and we have seen, it can be derived from the CLT (Z-test) or Pearson’s theorem (χ² test)

The art of A/B testing

1 | Z-test for average time spent

The first step is to model H₀

The second step is to see how likely our samples are under H₀

2 | χ² test for conversion rate

The first step is to model H₀

The second step is to see how likely our samples are under H₀

3 | Z-test for conversion rate

The first step is to model H₀

The second step is to see how likely our samples are under H₀

Always question your tests

Main take-home messages

The art of A/B testing

The Building Blocks of a B-Spline

A.I. and the Art of Spotting Fakes

RuntimeError: The size of tensor a (96) must match the size of tensor b (95) at non-singleton dimens

POJ 2553 The Bottom of a Graph（強連通分量）

The Bottom of a Graph

Transfer learning & The art of using Pre-trained Models in Deep Learning

return three values that can be the lengths of the sides of a triangle,

【poj2553】The Bottom of a Graph(強連通分量縮點)

【取證分析】The Art of Memory Forensics-Windows取證(Virut樣本取證)

the art of seo(chapter five)

#748 – 獲得按下時對應位置點的大小(Getting the Size of a Contact Point during Raw Touch)

POJ 2553 The Bottom of a Graph

poj 2553 The Bottom of a Graph （Tarjan強聯通）

The Benefits Of A Small Concrete Mixer Truck

HDU 4028 The time of a day （dp+離散化）

PBRT_V2 總結記錄 Expected Value of the Function of a Random Variable

TypeError: The value of a feed cannot be a tf.Tensor object.

"The conversion of a datetime2 data type to a datetime data type resulted in an out-of-range value

Assignment代寫：The possibility of a financial crisis

The art of A/B testing

1 | Z-test for average time spent

The first step is to model H₀

The second step is to see how likely our samples are under H₀

2 | χ² test for conversion rate

The first step is to model H₀

The second step is to see how likely our samples are under H₀

3 | Z-test for conversion rate

The first step is to model H₀

The second step is to see how likely our samples are under H₀

Always question your tests

Main take-home messages

相關推薦