{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IntroStat Week 10\n", "\n", "Welcome to the 10th lecture in IntroStat\n", "\n", "During the lectures we will present both slides and notebooks. \n", "\n", "This is the notebook used in the lecture in week 10.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import scipy.stats as stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import statsmodels.stats.power as smp\n", "import statsmodels.stats.proportion as smprop\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Example: Normal approximation of binomialdistribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets plot some binomialdistributions with p = 0.50 and increasing number of observation (n)\n", "\n", "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "p = 1/2\n", "\n", "n = 10\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "n = 20\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "n = 30\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "n = 40\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the binomial for lager n looks more and more like a normal distribution.\n", "\n", "But this is a little different if p not 1/2 - the binomial distribution is non-symmetric and therefore also less \"normal\" looking:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets plot some binomialdistributions with p = 0.10 and increasing number of observation (n)\n", "\n", "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "p = 1/10\n", "\n", "n = 10\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "n = 20\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "n = 30\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "n = 40\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the change when n increasis, but is does not look normal - it is still a-symmetric when n = 40 (right most plot)\n", "\n", "What if we increase n even more?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 4, figsize=(20,4))\n", "\n", "p = 1/10\n", "\n", "n = 25\n", "axs[0].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.2, color='red')\n", "n = 50\n", "axs[1].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.3, color='red')\n", "n = 100\n", "axs[2].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.4, color='red')\n", "n = 150\n", "axs[3].bar(np.arange(0, n+1, 1), stats.binom.pmf(k=np.arange(0,n+1,1), n=n, p=p), width=0.5, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eventually (when n = 150) the distribution does look much more like a normal distribution.\n", "\n", "Conclusion: the normal distribution is a good approximation is n is large enough - and \"enough\" depends on the value of p." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Confidence interval of proportion for left-handed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 100 # total number of people in the sample\n", "x = 10 # number of lefthanded in the sample\n", "\n", "p_hat = x/n\n", "print(p_hat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute the standard error\n", "se_p_hat = np.sqrt(p_hat*(1-p_hat)/n)\n", "print(se_p_hat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute confidence-interval using normal-approximation\n", "print([p_hat - 1.96*se_p_hat, p_hat + 1.96*se_p_hat])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# is it ok to use the normal approximation?\n", "# is np > 15 ?\n", "# is n(1-p) > 15 ?\n", "\n", "print([n*p_hat, n*(1-p_hat)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers are NOT both > 15\n", "\n", "We should use another method for small samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternative method for small samples\n", "\n", "# \"plus-2\" method:\n", "\n", "p_tilde = (x+2)/(n+4)\n", "\n", "se_p_tilde = np.sqrt(p_tilde*(1-p_tilde)/(n+4))\n", "\n", "print([p_tilde - 1.96*se_p_tilde, p_tilde + 1.96*se_p_tilde])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(try changing x to 3 and see what happens to the confidence interval - this is an extreme case of the normal assumption not being valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Hypothesis test for proportion of left-handed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Are half of all Danish citizens left-handed?\"\n", "\n", "We want to test if the true proportion could be $p_0 = 0.50$ (50:50 left and right-handed people)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z_obs,p_value = smprop.proportions_ztest(count=10, nobs=100, value=0.5, prop_var=0.5)\n", "print(z_obs, p_value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also calculate z_obs *manually*:\n", "\n", "z_obs = (0.10 - 0.50)/np.sqrt(0.50*(1-0.50)/100)\n", "\n", "print(z_obs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can also find the p-value *manually*:\n", "\n", "print(2*stats.norm.cdf(z_obs, loc=0, scale=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Contraceptive pills and risk of blood clots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group using birth control pills:\n", "x1 = 23\n", "n1 = 23 + 34\n", "p1 = x1/n1\n", "print(p1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group bot using birth control pills (control group):\n", "x2 = 35\n", "n2 = 35 + 132\n", "p2 = x2/n2\n", "print(p2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# difference:\n", "diff = p1-p2\n", "print(diff)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# confidence interval for diff:\n", "se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)\n", "\n", "print([diff - 1.96*se_diff, diff + 1.96*se_diff])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test for equal proportions in the two groups:\n", "\n", "z_obs,p_value = smprop.proportions_ztest(count = [23, 35], nobs = [57, 167], value=0, prop_var=0)\n", "print(z_obs, p_value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# *Manual* calculations for the test:\n", "p_pooled = (x1+x2)/(n1+n2)\n", "print(p_pooled)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z_obs = diff / np.sqrt(p_pooled*(1-p_pooled)*(1/n1 + 1/n2))\n", "print(z_obs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(2 * stats.norm.cdf(-z_obs, loc=0, scale=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Contraceptive pills with $\\chi^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table_data = np.array([[23,35],[34,132]])\n", "print(table_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chi2, p_val, dof, expected = stats.chi2_contingency(table_data, correction=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(expected)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(chi2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(p_val)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dof)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Candidate votes over time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First put data into a pandas dataframe\n", "poll = np.array([[79, 91, 93], [84, 66, 60], [37, 43, 47]])\n", "print(poll)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Row 1: votes for Candidate 1 (4, 2 and 1 week(s) before the election)
\n", "Row 1: votes for Candidate 2 (4, 2 and 1 week(s) before the election)
\n", "Row 1: undecided votes (4, 2 and 1 week(s) before the election)
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate total number of people asked at every sample / timepoint:\n", "print(np.sum(poll, axis=0))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# total number for each candidate across all three timepoints:\n", "print(np.sum(poll, axis=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the overall distribution of votes. \n", "\n", "We want to know if the distributions of votes within each timepoint (sample) differs significantly from the overall distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now do chi2 test:\n", "chi2, p_val, dof, expected = stats.chi2_contingency(poll, correction=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(expected)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(chi2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(p_val)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dof)" ] } ], "metadata": { "kernelspec": { "display_name": "pernille", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }