{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "You use the chi-squared test when you have\n", "\n", "* frequency data\n", "* for categorical choices\n", "\n", "Hinton's second example: a poll among a group of townspeople on whether to build a sports hall or a theater. The question is: Is there a clear preference for one choice over the other? And the null hypothesis is: there is no clear preference.\n", "\n", "\n", "# Hinton's first example: differences in frequency counts\n", "\n", "In Hinton's example, there were 62 votes for a sports hall, and 38 for a theater. Is that really, significantly different from a tie? An exact tie would be 50 votes each. \n", "\n", "We do a chi-squared test:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Power_divergenceResult(statistic=5.76, pvalue=0.01639507184919225)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy import stats\n", "\n", "stats.chisquare([62, 38])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also determine the chi-squared ($\\chi^2$) statistic by hand: The value for each of the two categories is $(O-E)^2 / E$, where the expected E is 50 (under the null hypothesis. Summing up the values for the two categories gives us the chi-squared statistic:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.76" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "observed1 = 62\n", "value1 = (observed1 - 50)**2 / 50\n", "observed2 = 38\n", "value2 = (observed2 - 50)**2 / 50\n", "\n", "chisquared_statistic = value1 + value2\n", "chisquared_statistic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Test of independence\n", "\n", "\n", "Hinton's example here is that we have votes for and against a particular measure from both liberals and conservatives. The question is: Is there any pattern in the data, any tendency? Or does being liberal versus conservative vary indepdenently from being for or against?\n", "\n", "Here is the data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foragainstdontknow
conservative783012
liberal185012
\n", "
" ], "text/plain": [ " for against dontknow\n", "conservative 78 30 12\n", "liberal 18 50 12" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "hinton_chi_df = pd.DataFrame({\"for\": [78,18], \"against\":[30,50], \"dontknow\":[12, 12]},\n", " index =[\"conservative\", \"liberal\"])\n", "hinton_chi_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the expected value of a cell here? We would not expect the same values in the \"for\" column for conservatives and liberals because the overall numbers of conservatives polled is not the same as the overall number of liberals polled. The total number of conservatives polled is the \"row total\" of the first row, and analogously for the liberals. Also, we have to take into account how many people overall were \"for\", and how many \"against\". These are the \"column totals\". \n", "\n", "The expected value of a cell is then $\\frac{rowTotal \\cdot columnTotal}{overallTotal}$.\n", "\n", "Here are the row and column totals:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "row totals: 120 80 overall total: 200\n", "column totals: 96 80 24 overall total: 200\n" ] } ], "source": [ "rowtotal_conservative = sum(hinton_chi_df.iloc[0])\n", "rowtotal_liberal = sum(hinton_chi_df.iloc[1])\n", "coltotal_for = sum(hinton_chi_df[\"for\"])\n", "coltotal_against = sum(hinton_chi_df[\"against\"])\n", "coltotal_dontknow = sum(hinton_chi_df[\"dontknow\"])\n", "overall_total = rowtotal_conservative + rowtotal_liberal\n", "\n", "print(\"row totals:\", rowtotal_conservative, rowtotal_liberal, \n", " \"overall total:\", rowtotal_conservative + rowtotal_liberal)\n", "print(\"column totals:\", coltotal_for, coltotal_against, coltotal_dontknow, \n", " \"overall total:\", coltotal_for + coltotal_against + coltotal_dontknow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the expected value for the \"for\" entry for conservatives:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "57.6" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expected_conservative_for = (rowtotal_conservative * coltotal_for) / overall_total\n", "expected_conservative_for" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we again get a chi-squared statistic as usual, as (Observed - Expected)^2 / Expected. For the very first cell again (conservatives, \"for\" the measure):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.225" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chisq_conservative_for = (hinton_chi_df.iloc[0,0] - expected_conservative_for)**2 / expected_conservative_for\n", "chisq_conservative_for" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can again ask Python to do the chi-squared test. This time we need a different function, one that does the Chi-square test of independence of variables in a contingency table. We just hand it the whole data frame:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(35.9375, 1.571343119795212e-08, 2, array([[57.6, 48. , 14.4],\n", " [38.4, 32. , 9.6]]))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.chi2_contingency(hinton_chi_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is, the overall chi-squared statistic is 35.9375. The p-value is 1.57e-08, which is tiny, so we have reason to reject the null hypothesis that there is no connection between opinion about this measure on the one hand, and political leaning on the other hand. Then comes the number 2, the degrees of freedom: We have 3 categories (for, against, don't know), minus 1 is 2. The array that we were given is an array of expected values -- recognize the value 57.6 we computed above. \n", "\n", "# A linguistic example\n", "\n", "This is from Harald Baayen's collection of linguistics-related datasets. One of them is about the Dative Alternation: How often do people say \"Kim gave Sandy the book\" versus \"Kim gave the book to Sandy\"? One factor that has an influence on which form people choose is whether the recipient is animate. For example, compare \"Kim sent Sandy the package\", \"Kim sent the package to Sandy\", \"Kim sent the office the package\" and \"Kim sent the package to the office\": In the office case, does one version sound better to you than the other?\n", "\n", "Here is the data from the Baayen dataset for realization as a noun phrase (NP), as in \"Kim gave Sandy the book\", versus as a prepositional phrase (PP), as in \"Kim gave the book to Sandy\", with recipients that are either animate or not:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
animateinanimate
NP52134
PP30147
\n", "
" ], "text/plain": [ " animate inanimate\n", "NP 521 34\n", "PP 301 47" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dative_df = pd.DataFrame({\"animate\" : [521, 301], \"inanimate\":[34, 47]}, index = [\"NP\", \"PP\"])\n", "dative_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks like we get more PP realizations for inanimate recipients. Is there a significant preference?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(13.375538209679599,\n", " 0.0002549274629187813,\n", " 1,\n", " array([[505.21594684, 49.78405316],\n", " [316.78405316, 31.21594684]]))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.chi2_contingency(dative_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes, there is, with a p-value of 0.00025. The chi-squared statistic is 13.376, and we have 2 categories, so 1 degree of freedom. The array that we got back again contains the expected values." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }