# The t-test

This notebook implements the examples from Perry Hinton's book chapters about the t-test.

## The Cyadmine example

We have the population with known mean and standard deviation, and a sample with a given mean. The population mean is 3.2 (kilogram), the population standard deviation is 0.9. The sample mean is 3.0, and the sample size is 100.

We cannot compare a sample of babies to a population of babies, but we can compare a sample of babies to a population of samples of babies. 

As in the Peter and Hothousing example, we set up a Null Hypothesis, which states that there is nothing to see -- in this case, the sample comes from the general population of samples. 

What do we know about the distribution that is relevant to the null hypothesis, the distribution of samples of babies from the general population (the "sampling distribution")?

* By the Central Limit theorem, we know that the mean of the sampling distribution is the same as the population mean.

* The standard deviation of the sampling distribution is the standard deviation of population divided by the square root of the sample size


In [1]:
import math

population_mean = 3.2
population_sd = 0.9
sample_mean = 3.0

samplingd_mean = population_mean
samplingd_sd = population_sd / math.sqrt(100)

samplingd_sd

0.09

We have a *one-tailed* research hypothesis: The hypothesis is that the birth weight of the sample is *lower* than that of babies from the general population.

How can we test whether we have reason to reject the null hypothesis?

Method 1, as in Hinton's book: compute the z-score, and look up the result in the standard normal distribution. Specifically, we determine the probability of seeing a value as low or lower than the z-score transformation of the sample mean in the sampling distribution. 

In [2]:
zscore = (sample_mean  - samplingd_mean) / samplingd_sd

zscore

-2.222222222222224

In [3]:
from scipy import stats

stats.norm.cdf(-2.2222224)

0.013134139686817679

The probability is only a little higher than 1%, so we reject the null hypothesis.

Method 2: We directly look up the sample mean in the normal distribution with the mean and standard deviation of the sampling distribution:

In [4]:
stats.norm.cdf(sample_mean, loc = samplingd_mean, scale = samplingd_sd)

0.013134145691021061

If we didn't have any insight into whether our sample had a mean that was particularly high or particularly low, and the research hypothesis was just that the sample mean is from a different distribution than the general population, we would have a *two-tailed* hypothesis. In this case, we would have to sum up the probabilities of seeing a value this far from the mean or further out, on either side. 

The probability of seeing a value as far out from the mean as ```sample_mean``` on the *lower* side is the same as the probability of seeing a value as low as 

$-|sample\_mean - samplingd\_mean|$

in a normal distribution with a mean of 0 and the same standard deviation as the sampling distribution:

In [5]:
- abs(sample_mean - samplingd_mean)

-0.20000000000000018

In [6]:
stats.norm.cdf(- abs(sample_mean - samplingd_mean), 
               loc = 0, scale = samplingd_sd)

0.013134145691021061

The probability of this lower tail is the same as the probability of the upper tail that is $|sample\_mean - samplingd\_mean|$ above the mean, so together their probability is 

In [7]:
2 * stats.norm.cdf(- abs(sample_mean - samplingd_mean), 
               loc = 0, scale = samplingd_sd)

0.026268291382042123

## The supermarket example

We have the population with known mean, but unknown standard deviation, and a sample with given mean and standard deviation.


In [8]:
population_mean = 25
samplingd_mean = population_mean

import pandas as pd

shoppingdata = pd.Series([30, 44, 19, 32, 25, 30, 16, 41, 28, 45, 
                          28, 20, 18, 31, 15, 32, 40, 42, 29, 35, 
                          34, 22, 30, 27, 36, 26, 38, 30, 33, 24, 
                          15, 48, 31, 27, 37, 45, 12, 29, 33, 23, 
                          20, 32, 28, 26, 38, 40, 28, 32, 34, 22])

sample_size = len(shoppingdata)
sample_mean = shoppingdata.mean()
sample_sd = shoppingdata.std()

print("Sample size:", sample_size, "mean:", sample_mean, 
      "standard deviation:", sample_sd)

Sample size: 50 mean: 30.0 standard deviation: 8.429781995389675


We assume that the advertising campaign shifted the mean without changing the standard deviation.

We know that the average number of purchases (pre-campaign) in the supermarket is 25, but we do not have the standard deviation. The closest thing we have is the sample standard deviation. We can use this in the place of the population standard deviation and the math remains basically the same, but instead of a z statistic, we then have a t statistic, and we need to consult a t-distribution to obtain our probabilities. This is known as a one-sample t-test.

Our research hypothesis is that the post-campaign sales were *higher*, so we have a one-tailed test that is about the upper tail.

In [9]:
standarderror = (sample_sd/math.sqrt(sample_size))
t = (sample_mean - samplingd_mean) / standarderror

print("Standard error is", standarderror, "and t-statistic is", t)

1 - stats.t.cdf(t, df = sample_size - 1)

Standard error is 1.192151202572861 and t-statistic is 4.194098860286486


5.7235578734382564e-05

Or we can use scipy's function ```ttest_1samp```, which is for the one-sample t-test. This function *always* does a two-tailed test. To do a one-tailed test, we divide the probability by 2.

In [10]:
ttest_result = stats.ttest_1samp(shoppingdata, population_mean)
print(ttest_result)

Ttest_1sampResult(statistic=4.194098860286486, pvalue=0.00011447115746878296)


In [11]:
print("The t statistic is", ttest_result[0],
      "and the one-tailed probability is", ttest_result[1]/2)

The t statistic is 4.194098860286486 and the one-tailed probability is 5.723557873439148e-05


## The math test example

In this example, the same students were administered two math tests.

Here is the data:

In [12]:
hinton_math = pd.DataFrame({"participant": range(8), 
                            "morning":[6,4,3,5,7,6,5,6],
                            "afternoon":[5,2,4,4,3,4,5,3]})

What this really is is a sample of *differences between scores*: For each person, we have the difference between their morning and afternoon score.

We add this difference explicitly into the data frame, and compute mean and standard deviation of this sample.

In [13]:
hinton_math["difference"] = hinton_math.morning - hinton_math.afternoon
sample_mean = hinton_math.difference.mean()
sample_sd = hinton_math.difference.std()
sample_size = len(hinton_math.difference)

print("For our sample of differences in scores, we have a mean of",
     sample_mean, "and a standard deviation of", sample_sd)

For our sample of differences in scores, we have a mean of 1.5 and a standard deviation of 1.6035674514745464


The null hypothesis in this case is that the difference in scores is really zero, so the population of samples of score differences should have a mean of zero. We again estimate the standard deviation of the population (of samples of score differences) from the standard deviation of the one sample. And we again compute the t statistic. Here, the mean of the population (the sampling distribution) is zero, that is why the numerator is ```sample_mean - 0```.

In [14]:
standarderror = (sample_sd/math.sqrt(sample_size))
t = (sample_mean - 0) / standarderror
t

2.6457513110645907

The research hypothesis is that morning scores are higher than afternoon scores, so we expect the differences between morning and afternoon scores to be *greater* than the population mean. That is, we again have a one-tailed hypothesis, which is about the upper tail. 

So how likely would we be to see the differences between scores that we saw in the sample under the null hypothesis that the difference is really zero? We again ask the t distribution:

In [15]:
1 - stats.t.cdf(t, df=sample_size-1)

0.01657275013188686

Or we can use the ```scipy.stats``` method for *related* samples. (It's a related sample because we got scores from the same students, for the morning and the afternoon.) This method again only does two-tained tests, so the statistic (the t-value) should be the same as the one we computed bove, and the probability should be twice of what we got above. This is indeed what we get:

In [16]:
stats.ttest_rel(hinton_math.morning, hinton_math.afternoon)

Ttest_relResult(statistic=2.6457513110645907, pvalue=0.033145500263773664)

## The two advertising campaign example

Instead of having one shopping sample following one advertising campaign, we might also have two shopping samples following two different advertising campaigns.  We want to determine whether one advertising campaign had a significantly greater impact on sales than the other.

In [17]:
shoppingdata = pd.Series([30, 44, 19, 32, 25, 30, 16, 41, 28, 45, 
                          28, 20, 18, 31, 15, 32, 40, 42, 29, 35, 
                          34, 22, 30, 27, 36, 26, 38, 30, 33, 24, 
                          15, 48, 31, 27, 37, 45, 12, 29, 33, 23, 
                          20, 32, 28, 26, 38, 40, 28, 32, 34, 22])

shoppingdata2 = pd.Series([31, 33, 43, 12, 43, 53, 46, 39, 37, 37, 
                           31, 28, 27, 37, 39, 41, 42, 37, 51, 30, 
                           22, 31, 44, 19, 38, 32, 32, 48, 31, 39, 
                           32, 39, 34, 41, 46, 31, 30, 42, 35, 33, 
                           32, 38, 36, 35, 30, 25, 45, 40, 49, 27])



We can do this using a two-sample t-test. The method we use depends on  whether the shopping data was drawn from the same shoppers - were paired samples - or from different shoppers - were independent samples.


### Related samples

First, let's assume it was the same sequence of shoppers both times. Then we again have a two-sample related t-test. The test this time is truly two-tailed -- we have no idea which of the campaigns worked better, our research hypothesis is just that they had a different impact.

In [18]:
stats.ttest_rel(shoppingdata, shoppingdata2)

Ttest_relResult(statistic=-3.612164085488365, pvalue=0.0007137928493495321)

### Independent samples

Now we get to the more realistic case: These item counts are from two unrelated groups of shoppers. In that case we need to use a different ```scipy.stats``` method, and the outcome will be different:

In [19]:
stats.ttest_ind(shoppingdata, shoppingdata2)

Ttest_indResult(statistic=-3.552608512015114, pvalue=0.000588444582062685)