{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Descriptive statistics in Python\n", "\n", "Welcome to week 2 of 02402 Statistics (PF)\n", "\n", "Today we will start using Python and in this notebook we will go through some basic descriptive statistics.\n", "\n", "We will also start using the libraries: Numpy, Matplotlib and Pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This the first \"code\" cell in this jupyter notebook\n", "# all lines that start with a \"#\" are \"commented out\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate 2+2\n", "2+2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Store sample data in a variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make a Python \"list\"\n", "my_list = [1,2,3,4]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(my_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list*3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(my_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to be able to work with a data type that behaves as a vector. For this we use Numpy arrays. \n", "\n", "We can store our sample data in a Numpy array. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the Numpy library\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now work with a sample, consisting of 10 measurements of students heights. \n", "\n", "The 10 observations have the values: 168, 161, 167, 179, 184, 166, 198, 187, 191 and 179" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# store sample data in variable x:\n", "x = np.array([168, 161, 167, 179, 184, 166, 198, 187, 191, 179])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating simple statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate mean of x (average height of students)\n", "np.mean(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# \"mean()\" can also be called as a method\n", "x.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Have a look in the online documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html\n", "\n", "The datatype \"ndarray\" (also called a numpy array) has many methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets try some other \"methods\"\n", "x.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x.max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what about variance? \n", "# OBS: we need to remember ddof = 1 in order to calculate the \"sample variance\"\n", "x.var(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why ddof=1? have a look in the documentation for explanation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# standard deviation (also remember ddof=1 for \"sample standard deviation\")\n", "x.std(ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what about the median?\n", "x.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "no method called median? \n", "\n", "OK, then we call the median() function directly from numpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.median(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can also get other percentiles (50th percentile is the same as the median)\n", "np.percentile(x, [0,10,25,50,75,90,100], method='averaged_inverted_cdf')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Python has two equivalent funtions for calculating quantiles: \"percentile\" and \"quantile\"\n", "np.quantile(x, [0,0.10,0.25,0.50,0.75,0.90,1.00], method='averaged_inverted_cdf')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compare with sorted data\n", "sorted_x = np.sort(x)\n", "print(sorted_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the method=\"averaged inverted cdf\" !
\n", "\n", "There are many different ways to define percentiles!\n", "\n", "See the documentaion: https://numpy.org/doc/stable/reference/generated/numpy.percentile.html#numpy.percentile\n", "\n", "In this course (and in the book) we always use the 'averaged_inverted_cdf' method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More complex data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now add to the dataset 10 measurements of student weights. We store this data in variable y:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = np.array([65.5, 58.3, 68.1, 85.7, 80.5, 63.4, 102.6, 91.4, 86.7, 78.9])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(x)\n", "print(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate covariance:\n", "np.cov(x,y, ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the four values?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate correlation\n", "np.corrcoef(x,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now have a look at Appendix A.1 in the book :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the four values?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do you interpret a correlation of 0.9656 ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KAHOOT (x1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data visualization\n", "\n", "We use the matplotlib library to produce plots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the matplotlib.pyplot package \n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Recall our sample data:\n", "print(sorted_x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now make a histogram of the sample data\n", "plt.hist(x)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Customize your histogram\n", "plt.hist(x, bins=8, edgecolor='black', color='red', density=True)\n", "plt.xlabel('x')\n", "plt.ylabel('Density')\n", "plt.title('Histogram Example')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# specifying bin-edges:\n", "plt.hist(x, bins=[160,165,170,175,180,185,190,195,200], edgecolor='black', color='red', density=True)\n", "plt.xlabel('x')\n", "plt.ylabel('Density')\n", "plt.title('Histogram Example')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histograms are important - they show how the data is **distributed** and are often the first choice of visualising a sample
\n", "\n", "Histograms serve as *empirical distributions* (\"empirical pdf\")
\n", "\n", "Based on the histogram above, how would you guess the height-distribution in the *population* looks like?
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets try with really small bins, such that the histogram diplays all the details in the data:\n", "plt.hist(x, bins=np.arange(160,200,1), edgecolor='black', color='red', density=True)\n", "plt.xlabel('x')\n", "plt.ylabel('Density')\n", "plt.title('Histogram Example')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cumulative distribution\n", "\n", "The \"detailed\" histogram with small bins is maybe not the nicest way to display data.
\n", "\n", "But histograms are dependent on bin-choices, which is also (sometimes) not ideal..
\n", "\n", "An alternative is to do a cumulative kind of plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the \"empirical cumulated density function\" (empirical cdf)\n", "plt.ecdf(x)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compare with values \n", "print(sorted_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cumulated distribution all detailed information is kept - but it is another way to visualise the distribution of data. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets increase the y-range slightly:\n", "plt.ecdf(x)\n", "plt.ylim(-0.1,1.1)\n", "plt.xlabel('x')\n", "plt.ylabel('ecdf(x)')\n", "plt.title('Epirical cumulated density function')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The y-range goes from 0 to 1 (0% to 100%)
\n", "\n", "Every vertical line-segment is a datapoint
\n", "\n", "When the plot is \"steep\" there are many datapoints (corresponds to high values in the histogram).
\n", "\n", "The cumulated plot can be used to understand the \"averaged_inverted_cdf\" used for percentiles. \n", "Example: If you want to find the 35% percentile, start by finding 0.35 on the y-axis. Then find the corresponding value on the x-axis. This is the value of the 35% percentile.
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KAHOOT (x2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Boxplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make a boxplot\n", "plt.boxplot(x)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the *values* are on the **y-axis**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adding some explanation:\n", "plt.boxplot(x)\n", "plt.text(1.1, np.percentile(x, [0]), 'Minimum', color='blue')\n", "plt.text(1.1, np.percentile(x, [25]), 'Q1', color='blue')\n", "plt.text(1.1, np.percentile(x, [50]), 'Median', color='blue')\n", "plt.text(1.1, np.percentile(x, [75]), 'Q3', color='blue')\n", "plt.text(1.1, np.percentile(x,[100]), 'Maximun', color='blue')\n", "plt.title(\"Basic box plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "see documentation for definition of box and whiskers: \n", "\n", "https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.boxplot.html#matplotlib.axes.Axes.boxplot\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adding an outlier to the data:\n", "print(np.append(x, [235]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.boxplot(np.append(x, [235]))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the plot above you see that \"extreme values\" are plotted individually. The \"whiskers\" do not extand all the way to min and max by default. \n", "\n", "You can control the whiskers by using the \"whis=..\" option:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4)) # start by splitting the figure into two \n", "ax1.boxplot(np.append(x, [235])) # define first plot in the figure - default setting for whiskers\n", "ax2.boxplot(np.append(x, [235]), whis=(0,100)) # define second plot in the figure - set whiskers manually\n", "plt.show() # now show the entire figure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scatter plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(x,y)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you remember the correlation? Does it match with the plot?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KAHOOT (x1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrames\n", "\n", "For more complex data (many rows and many columns) we will sometime use \"DataFrames\" from the Pandas library. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the Pandas library\n", "import pandas as pd " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put our previous height and weight data into a *DataFrame*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "student_data = pd.DataFrame({\n", " 'height': x,\n", " 'weight': y\n", "})\n", "student_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(student_data))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we could also type data directly into a DataFrame:\n", "student_data = pd.DataFrame({\n", " 'height': [168, 161, 167, 179, 184, 166, 198, 187, 191, 179],\n", " 'weight': [65.5, 58.3, 68.1, 85.7, 80.5, 63.4, 102.6, 91.4, 86.7, 78.9]\n", "})\n", "student_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is good practice to have one *observational unit* in each row and different *observational variables* in the different columns. \n", "\n", "(recall Definition 1.1 from chapter 1 in the book)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The DataFrame has a direct method for making histograms:\n", "student_data.hist()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The DataFrame also has a direct method for making a scatter plot:\n", "student_data.plot.scatter(\"height\", \"weight\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading data from an external file\n", "\n", "It is very important to learn how to read data from other files. In practice one will never type all the data into Python by hand!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read data from a csv file:\n", "csv_data= pd.read_csv(\"studentheights.csv\", sep=';')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print the number of rows in the dataset:\n", "print(len(csv_data))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# view the first few rows:\n", "csv_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the data in the two columns?\n", "\n", "What is the type of data in the two columns? (quantitative, qualitative ..?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_data.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to do a boxplot by gender, we need to include the \"by=..\" argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_data.boxplot(by='Gender')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_data.hist(by=\"Gender\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if we remove the \"by=..\" statement in the plots above?" ] } ], "metadata": { "kernelspec": { "display_name": "pernille", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }