{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IntroStat Week 2 \n",
"\n",
"Welcome to the second lecture in IntroStat\n",
"\n",
"During the lectures we will present both slides and notebooks. \n",
"\n",
"This is the notebook used in the lecture in week 2.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Continued from week 1: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Data resembling student heights of 10 students:\n",
"x = np.array([168, 161, 167, 179, 184, 166, 198, 187, 191, 179])\n",
"x.sort()\n",
"print(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make a boxplot (median, quartiles + whiskers)\n",
"plt.boxplot(x)\n",
"# also plot actual data values:\n",
"plt.scatter([1.3]*10, x)\n",
"# show plot:\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a histogram\n",
"plt.hist(x)\n",
"plt.scatter(x, [3.1]*10)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Customize the histogram:\n",
"plt.hist(x, bins=np.arange(160,205,5), edgecolor='black', color='red')\n",
"plt.xlabel('x')\n",
"plt.ylabel('Count')\n",
"plt.title('Histogram Example')\n",
"plt.show()\n",
"print(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Customize the histogram:\n",
"plt.hist(x, bins=np.arange(160,205,5), edgecolor='black', color='red')\n",
"plt.xlabel('x')\n",
"plt.ylabel('Count')\n",
"plt.title('Histogram Example')\n",
"plt.axvline(np.mean(x), linestyle='-', color='black', ymin=.9, ymax=1)\n",
"plt.plot([np.mean(x)-np.std(x), np.mean(x)+np.std(x)], [3.1, 3.1], '|', linestyle=\"-\", color='grey')\n",
"plt.axvline(np.median(x), linestyle='--', ymin=.9, ymax=1, color='black')\n",
"plt.axvline(np.percentile(x, 0), linestyle='--', ymin=.9, ymax=1, color='black')\n",
"plt.axvline(np.percentile(x, 25), linestyle='--', ymin=.9, ymax=1, color='black')\n",
"plt.axvline(np.percentile(x, 75), linestyle='--', ymin=.9, ymax=1, color='black')\n",
"plt.axvline(np.percentile(x, 100), linestyle='--', ymin=.9, ymax=1, color='black')\n",
"plt.scatter(x, [3.1]*10)\n",
"plt.show()\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Histograms are important - they show how the data is **distributed**
\n",
"\n",
"Histograms are **dependent on the choice of binwidth** - try changing to binwidth=1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### \"Cumulative distribution\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot the \"empirical cumulated density function\"\n",
"plt.ecdf(x)\n",
"plt.show()\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with data - dataframes and Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import the Pandas library\n",
"import pandas as pd "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make a DataFrame:\n",
"data = pd.DataFrame({\n",
" 'males': [152, 171, 173, 173, 178, 179, 180, 180, 182, 182, 182, 185, \n",
" 185 ,185, 185, 185 ,186 ,187 ,190 ,190, 192, 192, 197], \n",
" 'females':[159, 166, 168 ,168 ,171 ,171 ,172, 172, 173, 174 ,175 ,175,\n",
" 175, 175, 175, 177, 178, np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]\n",
"})\n",
"print(data.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataframe has ***named columns*** (and indexed rows - rows can also have names)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(type(data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The DataFrame has a direct method for making a histogram:\n",
"data.hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Reading data from an external file**\n",
"\n",
"It is very important to learn how to read data from other files. In practice one will never type all the data into Python by hand!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_data= pd.read_csv(\"studentheights.csv\", sep=';')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(type(csv_data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that this DataFrame is differently structured compared to the one from above (which had columns: \"males\" and \"females\").\n",
"\n",
"If we want to do a histogram for each gender, we need to include the \"by=..\" argument:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_data.hist(by='Gender')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Week 2:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 1: Simmulation of a stochastic variable (rolling a dice)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = np.random.choice(range(1,7), size=1)\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 2: Simulate n rolls with a fair dice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of simulated realizations (sample size)\n",
"n = 30\n",
"\n",
"# roll dice:\n",
"xFair = np.random.choice([1,2,3,4,5,6], size=n, replace=True)\n",
"print(xFair)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot histogram:\n",
"plt.hist(xFair, bins=np.arange(0.05,7,.1),edgecolor='black')\n",
"plt.xlabel('x')\n",
"plt.ylabel('Counts')\n",
"plt.ylim([0,n])\n",
"plt.title('Histogram of xFair')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count the number of each outcome using the bincount function\n",
"counts = np.bincount(xFair)\n",
"print(counts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot the true/theoretical pdf \n",
"plt.bar(range(1,7), [1/6]*6, color='red', label='True pdf', width=0.1)\n",
"\n",
"# add the empirical pdf (normalised histogram) to the plot\n",
"plt.bar(range(1,7), counts[1:7]/n, width=0.05, label='Histogram / \"Empirical pdf\"')\n",
"\n",
"# add legend to the plot\n",
"plt.legend()\n",
"\n",
"# set limits of y-axis \n",
"plt.ylim([0,1])\n",
"\n",
"# show the plot\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 3: Simmulate n rolls with an unfair dice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of simulated realizations (sample size)\n",
"n = 30\n",
"\n",
"# roll dice:\n",
"probs = [1/7, 1/7, 1/7, 1/7, 1/7, 2/7]\n",
"xUnFair = np.random.choice([1,2,3,4,5,6], size=n, replace=True, p=probs)\n",
"print(xUnFair)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count the number of each outcome using the bincount function\n",
"counts = np.bincount(xUnFair)\n",
"print(counts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot the true/theoretical pdf \n",
"plt.bar(range(1,7), probs, color='red', label='True pdf', width=0.1)\n",
"\n",
"# add the empirical pdf to the plot\n",
"plt.bar(range(1,7), counts[1:7]/n, width=0.05, label='Empirical pdf')\n",
"\n",
"\n",
"plt.legend()\n",
"plt.ylim([0,1])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 4: ECDF"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Take the data from the unfair dice again\n",
"print(xUnFair)\n",
"print(counts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot empirical pdf and cdf\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))\n",
"\n",
"ax1.bar(range(1,7), counts[1:7]/n, label='True pdf', width=0.1)\n",
"ax1.set_ylim([0,1])\n",
"ax1.set_title('Empirical pdf')\n",
"\n",
"ax2.ecdf(xUnFair)\n",
"ax2.set_ylim([0,1])\n",
"ax2.set_title('Empirical cdf')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 5: The binomial distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import scipy.stats for theoretical distributions (and much more)\n",
"import scipy.stats as stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Probability of \"success\"\n",
"p = 0.1\n",
"\n",
"# Number of repititions\n",
"nRepeat = 30\n",
"\n",
"# Simulate Bernoulli experiment 'nRepeat' times\n",
"tmp = np.random.choice([0,1], p=[1-p, p], size=nRepeat, replace=True)\n",
"print(tmp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate number of successes\n",
"print(np.sum(tmp))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or: use the binomial distribution simulation function\n",
"stats.binom.rvs(n=nRepeat, p=p) # \"rvs\" is random variates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"have a look at documentation for binomial distribution in scipy.stats.binom:\n",
"https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html\n",
"\n",
"- .rvs
\n",
"- .pmf (\"pdf\" but for discrete distributions)
\n",
"- .cdf
\n",
"- .ppf
\n",
"
\n",
"and more.."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 6: Number of 6's with fair dice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 'nRepeat' independent draws from the set [1,2,3,4,5,6]\n",
"nRepeat = 30\n",
"xFair = np.random.choice([1,2,3,4,5,6], size=nRepeat, replace=True)\n",
"print(xFair)\n",
"\n",
"# Count number of 6's:\n",
"print(np.sum(xFair == 6))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or: use the binomial distribution simulation function\n",
"stats.binom.rvs(n=nRepeat, p=1/6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.bar(np.arange(0, nRepeat+1, 1), stats.binom.pmf(k=np.arange(0,nRepeat+1,1), n=nRepeat, p=1/6))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Callcenter example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we can calculate the probability using the pdf (\"pmf\") of the binomial:\n",
"print(stats.binom.pmf(k=6, n=6, p=0.70))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# lets plot entire pdf (\"pmf\"):\n",
"plt.bar(np.arange(0,7,1),stats.binom.pmf(k=np.arange(0,7,1), n=6, p=0.70))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Harddrive example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we can calculate the probability using the pdf (\"pmf\") of the hypergoemetrical distribution:\n",
"print(1 - stats.hypergeom.pmf(0, 10, 2, 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hospital admission example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we can calculate the probability using the pdf (\"pmf\") of the poisson distribution:\n",
"print(stats.poisson.pmf(0,0.3)+stats.poisson.pmf(1,0.3)+stats.poisson.pmf(2,0.3))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# or we can use the cumulative distribution function (cdf):\n",
"print(stats.poisson.cdf(2,0.3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 7: Sample mean of n rolls with a fair dice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of simulated realizations (sample size)\n",
"n = 30\n",
"\n",
"# n independent draws from the set (1,2,3,4,5,6) \n",
"# with equal probability of each outcome\n",
"xFair = np.random.choice(range(1, 7), size=n, replace=True)\n",
"print(xFair)\n",
"\n",
"# compute the sample mean:\n",
"print(xFair.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now change n to 1000 and try again (do not print xFair) - what happens?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 8: Sample variance of n rolls with a fair dice "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of simulated realizations (sample size)\n",
"n = 30\n",
"\n",
"# n independent draws from the set (1,2,3,4,5,6) \n",
"# with equal probability of each outcome\n",
"xFair = np.random.choice(range(1, 7), size=n, replace=True)\n",
"print(xFair)\n",
"\n",
"# compute the sample mean:\n",
"print(xFair.var(ddof=1))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pernille",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}