{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IntroStat Week 10\n", "\n", "Welcome to the 10th lecture in IntroStat\n", "\n", "During the lectures we will present both slides and notebooks. \n", "\n", "This is the notebook used in the lecture in week 10.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import scipy.stats as stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import statsmodels.stats.power as smp\n", "import statsmodels.stats.proportion as smprop\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Example: Normal approximation of binomial distribution" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "p = 1/2\n", "\n", "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "# Plot binomial distribution for n = 10\n", "n = 10\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "\n", "# Plot binomial distribution for n = 20\n", "n = 20\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "\n", "# Plot binomial distribution for n = 30\n", "n = 30\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "\n", "# Plot binomial distribution for n = 40\n", "n = 40\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the binomial for lager n looks more and more like a normal distribution.\n", "\n", "But this is a little different if p not 1/2 - the binomial distribution is non-symmetric and therefore also less \"normal\" looking:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Lets plot some binomialdistributions with p = 0.10 and increasing number of observation (n)\n", "\n", "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "p = 1/10\n", "\n", "# Again for n = 10,20,30,40\n", "n = 10\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "n = 20\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "n = 30\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "n = 40\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the change when n increases, but is does not look normal - it is still asymmetric when n = 40 (right most plot)\n", "\n", "What if we increase n even more?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "p = 1/10\n", "\n", "# plotting bimomial distributions for n = 25, 50, 100, 150\n", "n = 25\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "n = 50\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "n = 100\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "n = 150\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eventually (when n = 150) the distribution does look much more like a normal distribution.\n", "\n", "Conclusion: the normal distribution is a good approximation is n is large enough - and \"enough\" depends on the value of p." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Confidence interval of proportion for left-handed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. **Calculating Sample Proportion**: Given a sample size of $ n = 100 $ people, with $ x = 10 $ left-handed individuals, what is the sample proportion of left-handed people in this sample?\n", "\n", "$\\hat{p}$ is the sample proportion, calculated as $ \\hat{p} = \\frac{x}{n} $." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1\n" ] } ], "source": [ "n = 100 # total number of people in the sample\n", "x = 10 # number of lefthanded in the sample\n", "\n", "p_hat = x/n\n", "print(p_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. **Calculating Standard Error**: Using the sample proportion calculated, what is the standard error of the proportion for left-handed individuals in this sample?\n", "\n", "$ \\sigma_p $ is the standard error of the sample proportion, calculated as $ \\sigma_p = \\sqrt{\\frac{\\hat{p}(1 - \\hat{p})}{n}} $." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.030000000000000002\n" ] } ], "source": [ "# compute the standard error\n", "se_p_hat = np.sqrt(p_hat*(1-p_hat)/n)\n", "print(se_p_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. **Calculating Confidence Interval** (assuming normal approximation): Assuming a normal approximation, what is the 95% confidence interval for the proportion of left-handed individuals in the population, based on this sample?\n", "\n", "$\\hat{p} \\pm z_{1-\\alpha/2} \\sigma_p$\n", "\n", "where:\n", "- $ \\hat{p} $ is the sample proportion, calculated as $ \\hat{p} = \\frac{x}{n} $.\n", "- $ z_{1-\\alpha/2} $ is the critical value from the $ Z $-distribution for a confidence level of $ 1 - \\alpha $.\n", "- $ \\sigma_p $ is the standard error of the sample proportion, calculated as $ \\sigma_p = \\sqrt{\\frac{\\hat{p}(1 - \\hat{p})}{n}} $.\n", "- Confidence level 95% (implying $ z_{1-\\alpha/2} \\approx 1.96 $ for a two-tailed interval." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.0412, 0.1588]\n" ] } ], "source": [ "# compute confidence-interval using normal-approximation\n", "print([p_hat - 1.96*se_p_hat, p_hat + 1.96*se_p_hat])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is it correct to use the normal approximation?\n", "\n", "**Rule of thumb** \n", "Assume $ X \\sim \\text{bin}(n, p) $. The normal distribution is a good approximation for the binomial distribution
\n", "if $ np \\geq 15$ and
\n", "if $ n(1 - p) \\geq 15 $
\n", "that is, expected number of successes ($p$) and failures ($1-p$) are both at least 15." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[10.0, 90.0]\n" ] } ], "source": [ "### is it CORRECT to use the normal approximation?\n", "# is np > 15 ?\n", "# is n(1-p) > 15 ?\n", "\n", "print([n*p_hat, n*(1-p_hat)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers are NOT both > 15\n", "\n", "We should use another method for small samples\n", "\n", "**\"Plus 2\" approach (Note 7.7)**\n", "\n", "If the sample is not large, use $\\tilde x= x+2$ and $\\tilde n = n + 4$.\n", "\n", "In the confidence interval, insert: \n", "$$\n", "\\tilde p \\pm z_{1-\\alpha/2} \\, \\, \\sqrt{\\tilde p(1-\\tilde p)/\\tilde n}\n", "$$" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confidence Interval (using plus-2 approach): [0.053981472743028336, 0.17678775802620245]\n", "Margin of Error: 0.0614\n" ] } ], "source": [ "# Alternative method for small samples ( Remark 7.7 in the book )\n", "\n", "# \"plus-2\" method:\n", "\n", "p_tilde = (x+2)/(n+4)\n", "\n", "se_p_tilde = np.sqrt(p_tilde*(1-p_tilde)/(n+4))\n", "\n", "print(\"Confidence Interval (using plus-2 approach):\", [p_tilde - 1.96*se_p_tilde, p_tilde + 1.96*se_p_tilde])\n", "\n", "print(f\"Margin of Error: {1.96 * se_p_tilde:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Size needed to achieve a given precision\n", "\n", "**Experiment planning: (1) When we have a reasonable guess for population proportion** \n", "How large does the sample size need to be to achieve a given precision?\n", "\n", "**Method 7.13** \n", "If you want an expected (given) margin of error (ME) in a $ (1 - \\alpha) $-confidence interval, the required sample size is:\n", "\n", "$\n", "n = p(1 - p) \\left( \\frac{z_{1 - \\alpha / 2}}{\\text{ME}} \\right)^2\n", "$\n", "\n", "where $ p $ is a reasonable guess for the population proportion.\n", "\n", "**Example Calculation in Python:**\n", "\n", "Suppose we want to **determine the sample size** required for a 95% confidence level with a margin of error of 0.05 and an estimated population proportion $ p = 0.5 $." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Required sample size: 385\n" ] } ], "source": [ "import math\n", "\n", "# Given values\n", "p = 0.5 # estimated population proportion\n", "z = 1.96 # z-value for 95% confidence\n", "ME = 0.05 # margin of error\n", "\n", "# Sample size calculation\n", "n = p * (1 - p) * (z / ME) ** 2\n", "\n", "print(f\"Required sample size: {math.ceil(n)}\")\n", "\n", "# The required sample size is approximately 385." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Experiment planning: (2) When we DO NOT have a reasonable guess for population proportion** \n", "How large does the sample size need to be to achieve a given precision?\n", "\n", "**Method 7.13** \n", "If you want an expected (given) margin of error (ME) in a $(1 - \\alpha)$-confidence interval but do *not* have a reasonable guess of $ p $, the required sample size is:\n", "\n", "$\n", "n = \\frac{1}{4} \\left( \\frac{z_{1 - \\alpha/2}}{\\text{ME}} \\right)^2,\n", "$\n", "\n", "because the \"worst case\" is $ p = \\frac{1}{2} $.\n", "\n", "**Solution: Without assuming a guess for $ p $:**\n", "\n", "$\n", "n = \\frac{1}{4} \\left( \\frac{1.96}{0.01} \\right)^2 = 9604\n", "$" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Required sample size: 3458\n", "Required sample size: 9604\n" ] } ], "source": [ "import math\n", "\n", "# Given values\n", "p = 0.1 # estimated population proportion\n", "z = 1.96 # z-value for 95% confidence\n", "ME = 0.01 # margin of error\n", "\n", "# Sample size calculation, when we can estimate\n", "n = p * (1 - p) * (z / ME) ** 2\n", "\n", "print(f\"Required sample size: {math.ceil(n)}\")\n", "\n", "# Sample size calculation, without the best guess or estimate\n", "n = 1/4 * (z / ME) ** 2\n", "\n", "print(f\"Required sample size: {math.ceil(n)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Example: Hypothesis test for proportion of left-handed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Are half of all Danish citizens left-handed?**\n", "\n", "We want to test if the true proportion could be $p_0 = 0.50$ (50:50 left and right-handed people)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-8.0 1.244192114854348e-15\n" ] } ], "source": [ "z_obs,p_value = smprop.proportions_ztest(count=10, nobs=100, value=0.5, prop_var=0.5) \n", "# We want the proportion variance to be based on the proportion under the null hypothesis (prop_var = 0.5, As default we want to input this)\n", "# Otherwise the function as default uses the proportion from the sample to estimate the variance\n", "\n", "print(z_obs, p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\n", "z_{\\text{obs}} = \\frac{\\hat{p} - p_0}{\\sqrt{p_0(1 - p_0) / n}}\n", "$" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-8.0\n" ] } ], "source": [ "# We can also calculate z_obs *manually*:\n", "\n", "z_obs = (0.10 - 0.50)/np.sqrt(0.50*(1-0.50)/100)\n", "\n", "print(z_obs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p-value: 0.0000\n", "p-value: 1.244192114854348e-15\n" ] } ], "source": [ "# we can also find the p-value *manually*:\n", "\n", "print(f\"p-value: {2 * stats.norm.cdf(z_obs, loc=0, scale=1):.4f}\")\n", "\n", "# without showing only four decimal places\n", "print(f\"p-value: {2 * stats.norm.cdf(z_obs, loc=0, scale=1)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Contraceptive pills and risk of blood clots" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.40350877192982454\n" ] } ], "source": [ "# Group using birth control pills:\n", "x1 = 23\n", "n1 = 23 + 34\n", "p1 = x1/n1\n", "print(p1)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.20958083832335328\n" ] } ], "source": [ "# Group not using birth control pills (control group):\n", "x2 = 35\n", "n2 = 35 + 132\n", "p2 = x2/n2\n", "print(p2)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.19392793360647126\n" ] } ], "source": [ "# difference between groups:\n", "diff = p1-p2\n", "print(diff)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.05239234287574965, 0.33546352433719284]\n" ] } ], "source": [ "# confidence interval for diff:\n", "se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)\n", "\n", "print([diff - 1.96*se_diff, diff + 1.96*se_diff])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.8859712586466184 0.003902077897925701\n" ] } ], "source": [ "###### Test for equal proportions in the two groups:\n", "# We saw from the interval above that 0.5 was not in the interval. So what do we expect here?\n", "\n", "z_obs,p_value = smprop.proportions_ztest(count = [23, 35], nobs = [57, 167], value=0, prop_var=0)\n", "print(z_obs, p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hypothesis Test for Two Proportions** \n", "When comparing two proportions (shown here for a two-sided alternative hypothesis):\n", "\n", "$\n", "H_0 : \\; p_1 = p_2,\n", "$\n", "$\n", "H_1 : \\; p_1 \\neq p_2.\n", "$\n", "\n", "**Use the test statistic**\n", "\n", "$\n", "z_{\\text{obs}} = \\frac{\\hat{p}_1 - \\hat{p}_2}{\\sqrt{\\hat{p}(1 - \\hat{p})\\left(\\frac{1}{n_1} + \\frac{1}{n_2}\\right)}},\n", "$\n", "where $\\hat{p} = \\frac{x_1 + x_2}{n_1 + n_2}$." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p_hat or p_pooled: 0.25892857142857145\n" ] } ], "source": [ "# *Manual* calculations for the same test: \n", "p_pooled = (x1+x2)/(n1+n2)\n", "print(\"p_hat or p_pooled:\", p_pooled)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test statistic or z_obs: 2.8859712586466184\n" ] } ], "source": [ "# test statistic\n", "z_obs = diff / np.sqrt(p_pooled*(1-p_pooled)*(1/n1 + 1/n2))\n", "print(\"Test statistic or z_obs:\", z_obs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p-value: 0.003902077897925702\n" ] } ], "source": [ "# p-value\n", "print(\"p-value:\", 2 * stats.norm.cdf(-z_obs, loc=0, scale=1))\n", "\n", "# Very Strong Evidence (p < 0.001), Z-score > 3.291 (two-tailed)\n", "# Strong Evidence (0.001 ≤ p < 0.01), Z-score between 2.576 and 3.291 (two-tailed)\n", "# Moderate Evidence (0.01 ≤ p < 0.05), Z-score between 1.96 and 2.576 (two-tailed)\n", "# Weak Evidence (0.05 ≤ p < 0.10), Z-score between 1.645 and 1.96 (two-tailed)\n", "# No Evidence (p ≥ 0.10), Z-score < 1.645 (two-tailed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Contraceptive pills with $\\chi^2$" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 23 35]\n", " [ 34 132]]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PillNo pill
Blood Clot2335
No Clot34132
\n", "
" ], "text/plain": [ " Pill No pill\n", "Blood Clot 23 35\n", "No Clot 34 132" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The data in a table:\n", "table_data = np.array([[23,35],[34,132]])\n", "print(table_data)\n", "pill_study = pd.DataFrame(table_data, index=['Blood Clot', 'No Clot'], columns=['Pill', 'No pill'])\n", "# With pandas we can make a nicer table:\n", "display(pill_study)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "# this function can take either a pandas table or the data (so both table_data and pill_study)\n", "chi2, p_val, dof, (expected) = stats.chi2_contingency(pill_study, correction=False)\n", "# returns test statistic, p-value, degrees of freedom, and expected frequencies" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 14.75892857 43.24107143]\n", " [ 42.24107143 123.75892857]]\n" ] } ], "source": [ "print(expected) # expected frequencies under the null hypothesis" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chai-square test statistic: 8.328830105734347\n" ] } ], "source": [ "print(\"Chai-square test statistic:\", chi2)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P-value: 0.0039020778979257016\n" ] } ], "source": [ "print(\"P-value:\", p_val)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n" ] } ], "source": [ "print(dof)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Candidate votes over time" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[79 91 93]\n", " [84 66 60]\n", " [37 43 47]]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
4 weeks2 weeks1 week
Cand1799193
Cand2846660
Undecided374347
\n", "
" ], "text/plain": [ " 4 weeks 2 weeks 1 week\n", "Cand1 79 91 93\n", "Cand2 84 66 60\n", "Undecided 37 43 47" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# First put data into a pandas dataframe\n", "poll = np.array([[79, 91, 93], [84, 66, 60], [37, 43, 47]])\n", "print(poll)\n", "poll_df = pd.DataFrame(poll, index=['Cand1', 'Cand2', 'Undecided'], columns = ['4 weeks', '2 weeks', '1 week'])\n", "display(poll_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Row 1: votes for Candidate 1 (4, 2 and 1 week(s) before the election)
\n", "Row 1: votes for Candidate 2 (4, 2 and 1 week(s) before the election)
\n", "Row 1: undecided votes (4, 2 and 1 week(s) before the election)
" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[200 200 200]\n" ] } ], "source": [ "# calculate total number of people asked at every sample / timepoint:\n", "print(np.sum(poll, axis=0))" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[263 210 127]\n" ] } ], "source": [ "# total number for each candidate across all three timepoints:\n", "print(np.sum(poll, axis=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the overall distribution of votes. \n", "\n", "We want to know if the distributions of votes within each timepoint (sample) differs significantly from the overall distribution." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "# Now do chi2 test:\n", "# Again, we can use either the data or the pandas dataframe as input \n", "chi2, p_val, dof, expected = stats.chi2_contingency(poll, correction=False)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[87.66666667 87.66666667 87.66666667]\n", " [70. 70. 70. ]\n", " [42.33333333 42.33333333 42.33333333]]\n" ] } ], "source": [ "print(expected) # Expected under the assumptions that the null hypothesis is true (all are the same)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.961978041718169\n" ] } ], "source": [ "print(chi2)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1379112060673381\n" ] } ], "source": [ "print(p_val)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "print(dof)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }