You use the chi-squared test when you have

* frequency data
* for categorical choices

Hinton's second example: a poll among a group of townspeople on whether to build a sports hall or a theater. The question is: Is there a clear preference for one choice over the other? And the null hypothesis is: there is no clear preference.


# Hinton's first example: differences in frequency counts

In Hinton's example, there were 62 votes for a sports hall, and 38 for a theater. Is that really, significantly different from a tie? An exact tie would be 50 votes each. 

We do a chi-squared test:

In [1]:
from scipy import stats

stats.chisquare([62, 38])

Power_divergenceResult(statistic=5.76, pvalue=0.01639507184919225)

We can also determine the chi-squared ($\chi^2$) statistic by hand: The value for each of the two categories is $(O-E)^2 / E$, where the expected E is 50 (under the null hypothesis. Summing up the values for the two categories gives us the chi-squared statistic:

In [2]:
observed1 = 62
value1 = (observed1 - 50)**2 / 50
observed2 = 38
value2 = (observed2 - 50)**2 / 50

chisquared_statistic = value1 + value2
chisquared_statistic

5.76

# Test of independence


Hinton's example here is that we have votes for and against a particular measure from both liberals and conservatives. The question is: Is there any pattern in the data, any tendency? Or does being liberal versus conservative vary indepdenently from being for or against?

Here is the data:

In [3]:
import pandas as pd
hinton_chi_df = pd.DataFrame({"for": [78,18], "against":[30,50], "dontknow":[12, 12]},
                             index =["conservative", "liberal"])
hinton_chi_df

Unnamed: 0,for,against,dontknow
conservative,78,30,12
liberal,18,50,12


What is the expected value of a cell here? We would not expect the same values in the "for" column for conservatives and liberals because the overall numbers of conservatives polled is not the same as the overall number of liberals polled. The total number of conservatives polled is the "row total" of the first row, and analogously for the liberals. Also, we have to take into account how many people overall were "for", and how many "against". These are the "column totals". 

The expected value of a cell is then $\frac{rowTotal \cdot columnTotal}{overallTotal}$.

Here are the row and column totals:

In [4]:
rowtotal_conservative = sum(hinton_chi_df.iloc[0])
rowtotal_liberal = sum(hinton_chi_df.iloc[1])
coltotal_for = sum(hinton_chi_df["for"])
coltotal_against = sum(hinton_chi_df["against"])
coltotal_dontknow = sum(hinton_chi_df["dontknow"])
overall_total = rowtotal_conservative + rowtotal_liberal

print("row totals:", rowtotal_conservative, rowtotal_liberal, 
      "overall total:", rowtotal_conservative + rowtotal_liberal)
print("column totals:", coltotal_for, coltotal_against, coltotal_dontknow, 
      "overall total:", coltotal_for + coltotal_against + coltotal_dontknow)

row totals: 120 80 overall total: 200
column totals: 96 80 24 overall total: 200


And the expected value for the "for" entry for conservatives:

In [5]:
expected_conservative_for = (rowtotal_conservative * coltotal_for) / overall_total
expected_conservative_for

57.6

Then we again get a chi-squared statistic as usual, as (Observed - Expected)^2 / Expected. For the very first cell again (conservatives, "for" the measure):

In [6]:
chisq_conservative_for = (hinton_chi_df.iloc[0,0] - expected_conservative_for)**2 / expected_conservative_for
chisq_conservative_for

7.225

We can again ask Python to do the chi-squared test. This time we need a different function, one that does the Chi-square test of independence of variables in a contingency table. We just hand it the whole data frame:

In [7]:
stats.chi2_contingency(hinton_chi_df)

(35.9375, 1.571343119795212e-08, 2, array([[57.6, 48. , 14.4],
        [38.4, 32. ,  9.6]]))

That is, the overall chi-squared statistic is 35.9375. The p-value is 1.57e-08, which is tiny, so we have reason to reject the null hypothesis that there is no connection between opinion about this measure on the one hand, and political leaning on the other hand. Then comes the number 2, the degrees of freedom: We have 3 categories (for, against, don't know), minus 1 is 2. The array that we were given is an array of expected values -- recognize the value 57.6 we computed above. 

# A linguistic example

This is from Harald Baayen's collection of linguistics-related datasets. One of them is about the Dative Alternation: How often do people say "Kim gave Sandy the book" versus "Kim gave the book to Sandy"? One factor that has an influence on which form people choose is whether the recipient is animate. For example, compare "Kim sent Sandy the package", "Kim sent the package to Sandy",  "Kim sent the office the package" and "Kim sent the package to the office": In the office case, does one version sound better to you than the other?

Here is the data from the Baayen dataset for realization as a noun phrase (NP), as in "Kim gave Sandy the book", versus as a prepositional phrase (PP), as in "Kim gave the book to Sandy", with recipients that are either animate or not:

In [8]:
dative_df = pd.DataFrame({"animate" : [521, 301], "inanimate":[34, 47]}, index = ["NP", "PP"])
dative_df

Unnamed: 0,animate,inanimate
NP,521,34
PP,301,47


This looks like we get more PP realizations for inanimate recipients. Is there a significant preference?

In [9]:
stats.chi2_contingency(dative_df)

(13.375538209679599,
 0.0002549274629187813,
 1,
 array([[505.21594684,  49.78405316],
        [316.78405316,  31.21594684]]))

Yes, there is, with a p-value of 0.00025. The chi-squared statistic is 13.376, and we have 2 categories, so 1 degree of freedom. The array that we got back again contains the expected values.