{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [UNIT 1. Import and Visualitzation of data with Python](#toc0_)\n", "\n", "This Unit includes some fast shortcuts to representing data in Python, closely following {cite:p}`kroese2020` and the corresponding [GitHub repo](https://github.com/DSML-book/).\n", "\n", "NOTE: being this a training tool, we assume the user will have installed all the needed requirements. In case some package is missing, use a terminal to install it apart using your favorite package manager.\n", "\n", "These notebooks have been tested in Linux Ubuntu using Anaconda as the package manager in most cases. The notebooks are self-containing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Table of contents** \n", "- [UNIT 1. Import and Visualitzation of data with Python](#toc1_) \n", " - [Retrieving data](#toc1_1_) \n", " - [Reformating tables](#toc1_2_) \n", " - [Structuring features](#toc1_3_) \n", " - [Summary tables and statistics](#toc1_4_) \n", " - [Data visualization](#toc1_5_) \n", " - [Qualitative variables](#toc1_5_1_) \n", " - [Quantitative variables](#toc1_5_2_) \n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Retrieving data](#toc0_)\n", "\n", "Typically the files containing data to be analyzed are stored in comma separated value (CSV) format. To work with them, the first thing to do is downloading the data. But before we ensure we have a proper folder to download the file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "folder datasets exists\n", "folder output exists\n", "El fitxer ‘datasets/wine.zip’ ja existeix, no es baixa.\n", "\n" ] } ], "source": [ "import os\n", "\n", "input_directory = \"datasets\"\n", "output_directory = 'output'\n", "\n", "# Check if the directories exist\n", "if not os.path.exists(input_directory):\n", " # If it doesn't exist, create it\n", " os.makedirs(input_directory)\n", "else:\n", " print('folder ',input_directory,' exists')\n", "\n", "if not os.path.exists(output_directory):\n", " # If it doesn't exist, create it\n", " os.makedirs(output_directory)\n", "else:\n", " print('folder ',output_directory,' exists')\n", "\n", "# now use wget to download the file into the datasets folder\n", "!wget -P $input_directory -nc https://archive.ics.uci.edu/static/public/109/wine.zip " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we read the content of the zip file using the `pandas` module. Information of the data can be found in the [ML repository at UCI](https://archive.ics.uci.edu/dataset/109/wine)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678910111213
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 10 11 12 \\\n", "0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 \n", "1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 \n", "2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 \n", "3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 \n", "4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 \n", "\n", " 13 \n", "0 1065 \n", "1 1050 \n", "2 1185 \n", "3 1480 \n", "4 735 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from zipfile import ZipFile\n", "\n", "with ZipFile('datasets/wine.zip', 'r') as f:\n", "\n", "#extract in current directory\n", " f.extractall(input_directory, members =['wine.names',\"wine.data\"])\n", "\n", "wine = pd.read_csv('datasets/wine.data',header=None)\n", "wine.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as we do not have the names of the columns, we can manually assign them based on the information contained in the web link." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classAlcoholMaliacidAshAlcalinity_of_ashMagnesiumTotal_phenolsFlavanoidsNonflavonoid_phenolsProanthocyaninsColor_intensityHue0D280_0D315_of_diluted_winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
" ], "text/plain": [ " class Alcohol Maliacid Ash Alcalinity_of_ash Magnesium \\\n", "0 1 14.23 1.71 2.43 15.6 127 \n", "1 1 13.20 1.78 2.14 11.2 100 \n", "2 1 13.16 2.36 2.67 18.6 101 \n", "3 1 14.37 1.95 2.50 16.8 113 \n", "4 1 13.24 2.59 2.87 21.0 118 \n", "\n", " Total_phenols Flavanoids Nonflavonoid_phenols Proanthocyanins \\\n", "0 2.80 3.06 0.28 2.29 \n", "1 2.65 2.76 0.26 1.28 \n", "2 2.80 3.24 0.30 2.81 \n", "3 3.85 3.49 0.24 2.18 \n", "4 2.80 2.69 0.39 1.82 \n", "\n", " Color_intensity Hue 0D280_0D315_of_diluted_wines Proline \n", "0 5.64 1.04 3.92 1065 \n", "1 4.38 1.05 3.40 1050 \n", "2 5.68 1.03 3.17 1185 \n", "3 7.80 0.86 3.45 1480 \n", "4 4.32 1.04 2.93 735 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine.columns = ['class','Alcohol','Maliacid','Ash','Alcalinity_of_ash','Magnesium','Total_phenols','Flavanoids','Nonflavonoid_phenols','Proanthocyanins','Color_intensity','Hue','0D280_0D315_of_diluted_wines','Proline']\n", "wine.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can directly read the CSV from its URL, without previously downloading it. We will use [Fisher's `iris` data](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html). Here we already have the name of the columns/features so we read them directly from the dataset.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rownamesSepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
\n", "
" ], "text/plain": [ " rownames Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 1 5.1 3.5 1.4 0.2 setosa\n", "1 2 4.9 3.0 1.4 0.2 setosa\n", "2 3 4.7 3.2 1.3 0.2 setosa\n", "3 4 4.6 3.1 1.5 0.2 setosa\n", "4 5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataname = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'\n", "\n", "iris = pd.read_csv(dataname)\n", "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as the first column is a duplicated index column, we can remove it" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
..................
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "

150 rows × 5 columns

\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa\n", ".. ... ... ... ... ...\n", "145 6.7 3.0 5.2 2.3 virginica\n", "146 6.3 2.5 5.0 1.9 virginica\n", "147 6.5 3.0 5.2 2.0 virginica\n", "148 6.2 3.4 5.4 2.3 virginica\n", "149 5.9 3.0 5.1 1.8 virginica\n", "\n", "[150 rows x 5 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.drop('rownames', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Reformating tables](#toc0_)\n", "\n", "Some times tha data comes in odd formats, not useful for analysis.\n", "\n", "For example, consider the table with scores given here, with values obtained before and after some particular training:\n", "\n", "![scores](../figures/scores.png)\n", "\n", "We want to reformat it in such a way that the score data is in a single column. Let us create a table that uses three features: *Score* (continuous data), *Time* (before or after) and *Student* (integer values from 1 to 5). We will use the `pandas.melt` method, [speciffically devoted to this goal](https://pandas.pydata.org/docs/reference/api/pandas.melt.html).\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Student Time value\n", "0 1 Before 75\n", "1 2 Before 30\n", "2 3 Before 100\n", "3 4 Before 50\n", "4 5 Before 60\n", "5 1 After 85\n", "6 2 After 50\n", "7 3 After 100\n", "8 4 After 52\n", "9 5 After 65\n" ] } ], "source": [ "# manually create dataframe with data from table\n", "values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]\n", "import pandas as pd\n", "df = pd. DataFrame (values , columns =['Student ','Before ', 'After '])\n", "# format dataframe as required\n", "df = pd.melt(df , id_vars =['Student '], var_name =\"Time\", value_vars =['Before ','After '])\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Structuring features](#toc0_)\n", "\n", "Types of features:\n", "* Quantitive (with possible discrete or continuous values)\n", "* Qualitative; can be eventually divided into a fixed number of categories: that is why they are referred as **categorical** or **factors**.\n", "\n", "Here we will work with data from {cite:p}`lafaye_de_micheaux_r_2013`. In particular, we will download the file containing nutritional measurements of thirteen features (columns) for 226 elderly individuals (rows). Note the [file](http://www.biostatisticien.eu/springeR/nutrition_elderly.xls) is now an excel file." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendersituationteacoffee...raw_fruitcooked_fruit_vegchocolfat
02100...1456
12111...5514
22104...5254
32100...4032
42121...5532
\n", "

5 rows × 13 columns

\n", "
" ], "text/plain": [ " gender situation tea coffee ... raw_fruit cooked_fruit_veg chocol \\\n", "0 2 1 0 0 ... 1 4 5 \n", "1 2 1 1 1 ... 5 5 1 \n", "2 2 1 0 4 ... 5 2 5 \n", "3 2 1 0 0 ... 4 0 3 \n", "4 2 1 2 1 ... 5 5 3 \n", "\n", " fat \n", "0 6 \n", "1 4 \n", "2 4 \n", "3 2 \n", "4 2 \n", "\n", "[5 rows x 13 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls'\n", "nutri = pd.read_excel(xls)\n", "\n", "pd.set_option('display.max_columns', 8)\n", "nutri.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to check the structure of the data we can use the `info` attribute of the `pandas` dataframe, which matches the description that is given in the original source\n", "\n", "![description](../figures/nutriage-en.png)\n", "\n", "Note the features can be classified as\n", "\n", "* qualitative\n", " * ordinal (meat, fish, raw_fruit, cooked_fruit_veg,chocol)\n", " * nominal (gender, situation), fat\n", "* quantitative\n", " * discrete (tea, coffee)\n", " * continuous (height, weight, age)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 226 entries, 0 to 225\n", "Data columns (total 13 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 gender 226 non-null int64\n", " 1 situation 226 non-null int64\n", " 2 tea 226 non-null int64\n", " 3 coffee 226 non-null int64\n", " 4 height 226 non-null int64\n", " 5 weight 226 non-null int64\n", " 6 age 226 non-null int64\n", " 7 meat 226 non-null int64\n", " 8 fish 226 non-null int64\n", " 9 raw_fruit 226 non-null int64\n", " 10 cooked_fruit_veg 226 non-null int64\n", " 11 chocol 226 non-null int64\n", " 12 fat 226 non-null int64\n", "dtypes: int64(13)\n", "memory usage: 23.1 KB\n" ] } ], "source": [ "nutri.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let us change now the data value and type of different features by means of the `replace` and the `astype` methods, and finally save the data in CSV format." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendersituationteacoffee...raw_fruitcooked_fruit_vegchocolfat
0Femalesingle00...less than once a week4-6 times a weekevery dayMix of vegetable oils
1Femalesingle11...every dayevery dayless than once a weekSunflower oil
2Femalesingle04...every dayonce a weekevery daySunflower oil
3Femalesingle00...4-6 times a weeknever2-3 times a weekMargarine
4Femalesingle21...every dayevery day2-3 times a weekMargarine
\n", "

5 rows × 13 columns

\n", "
" ], "text/plain": [ " gender situation tea coffee ... raw_fruit \\\n", "0 Female single 0 0 ... less than once a week \n", "1 Female single 1 1 ... every day \n", "2 Female single 0 4 ... every day \n", "3 Female single 0 0 ... 4-6 times a week \n", "4 Female single 2 1 ... every day \n", "\n", " cooked_fruit_veg chocol fat \n", "0 4-6 times a week every day Mix of vegetable oils \n", "1 every day less than once a week Sunflower oil \n", "2 once a week every day Sunflower oil \n", "3 never 2-3 times a week Margarine \n", "4 every day 2-3 times a week Margarine \n", "\n", "[5 rows x 13 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gender, situation, meat, fish, raw_fruit, cooked_fruit_veg chocol and fat feature are categorical\n", "DICT = {1:'Male',2:'Female'}\n", "nutri['gender'] = nutri['gender'].replace(DICT).astype('category')\n", "\n", "DICT = {1:'single',2:'couple',3:'family',4:'other'}\n", "nutri['situation'] = nutri['situation'].replace(DICT).astype('category')\n", "\n", "DICT = {0:'never',1:'less than once a week',2:'once a week',3:'2-3 times a week',4:'4-6 times a week', 5:'every day'}\n", "nutri['meat'] = nutri['meat'].replace(DICT).astype('category')\n", "nutri['fish'] = nutri['fish'].replace(DICT).astype('category')\n", "nutri['raw_fruit'] = nutri['raw_fruit'].replace(DICT).astype('category')\n", "nutri['cooked_fruit_veg'] = nutri['cooked_fruit_veg'].replace(DICT).astype('category')\n", "nutri['chocol'] = nutri['chocol'].replace(DICT).astype('category')\n", "\n", "DICT = {1:'Butter',2:'Margarine',3:'Peanut oil', 4:'Sunflower oil', 5:'Olive oil', 6:'Mix of vegetable oils', 7:'Colza oil',8:'Duck or goose fat'}\n", "nutri['fat'] = nutri['fat'].replace(DICT).astype('category')\n", "\n", "# tea and coffee are integer\n", "nutri['tea'] = nutri['tea'].astype(int)\n", "nutri['coffee'] = nutri['coffee'].astype(int)\n", "\n", "# height, weigth, age are float\n", "nutri['height'] = nutri['height'].astype(float)\n", "nutri['weight'] = nutri['weight'].astype(float)\n", "nutri['age'] = nutri['age'].astype(float)\n", "\n", "nutri.head()\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 226 entries, 0 to 225\n", "Data columns (total 13 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 gender 226 non-null category\n", " 1 situation 226 non-null category\n", " 2 tea 226 non-null int64 \n", " 3 coffee 226 non-null int64 \n", " 4 height 226 non-null float64 \n", " 5 weight 226 non-null float64 \n", " 6 age 226 non-null float64 \n", " 7 meat 226 non-null category\n", " 8 fish 226 non-null category\n", " 9 raw_fruit 226 non-null category\n", " 10 cooked_fruit_veg 226 non-null category\n", " 11 chocol 226 non-null category\n", " 12 fat 226 non-null category\n", "dtypes: category(8), float64(3), int64(2)\n", "memory usage: 12.4 KB\n" ] } ], "source": [ "\n", "nutri.info()\n", "\n", "nutri.to_csv('output/nutri.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Summary tables and statistics](#toc0_)\n", "\n", "It is extremely important to know your data before any analysis. Apart from the descriptive tools shown above, descriptive statistical measurements can be obtained from pandas with two simple methods: `describe` and `value_counts`. Interesting to note that the outcome of both `describe` and `value_counts` are pandas *series*, one-dimensional ndarrays with axis labels." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 226\n", "unique 8\n", "top Sunflower oil\n", "freq 68\n", "Name: fat, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nutri = pd.read_csv('output/nutri.csv')\n", "nutri['fat'].describe()\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fat\n", "Sunflower oil 68\n", "Peanut oil 48\n", "Olive oil 40\n", "Margarine 27\n", "Mix of vegetable oils 23\n", "Butter 15\n", "Duck or goose fat 4\n", "Colza oil 1\n", "Name: count, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nutri['fat'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also generate a contingey table by *cross tabulating* between two or more variables:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
situationcouplefamilysingleAll
gender
Female56778141
Male6322085
All119998226
\n", "
" ], "text/plain": [ "situation couple family single All\n", "gender \n", "Female 56 7 78 141\n", "Male 63 2 20 85\n", "All 119 9 98 226" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(nutri.gender,nutri.situation,margins=True) # the margins attribute adds the rows/columns totals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can now turn to descriptive statistics like the *sample mean*, $\\bar{x}$:\n", "$$\\bar{x}=\\frac{1}{n}\\sum_{i=1}^n x_i$$\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "163.96017699115043" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nutri.height.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the *p-sample quantile* of $\\mathbf{x}$, being $0

[Data visualization](#toc0_)\n", "\n", "Depending on the data type, data visualization will differ. In order to use the visualization power of Python , first we will import the `matplotlib.pyplot` module, as well as `numpy`, in addition to the already imported `pandas` module." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Qualitative variables](#toc0_)\n", "\n", "here we can show the data in a simple bar plot, taking itno account that the category (x-axis) is not per se a numerical value, so we should manually place the different categories in the plot. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAigAAAGdCAYAAAA44ojeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAjCklEQVR4nO3df1BVZeLH8c9V7AoKKJD3eidQLFJbzBLNCW3BVMzSdN3S1DVcs2w1E7XFGNT42ggjrUYjm7u6jdIPs2033aYfJqlp6maIuuWPdDUKMokwFlQQSM73j8a7ewVT9CIP+H7NnJk95zz3nOe0x3x37oVrsyzLEgAAgEFaNPYEAAAAzkegAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADCOT2NP4HLU1NTo22+/lb+/v2w2W2NPBwAAXALLsnTy5Em5XC61aPHzz0iaZKB8++23Cg0NbexpAACAy1BQUKAbbrjhZ8c0yUDx9/eX9NMFBgQENPJsAADApSgrK1NoaKj77/Gf0yQD5dzbOgEBAQQKAABNzKV8PIMPyQIAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwTr0DZevWrRo+fLhcLpdsNpvWrVvn3lddXa05c+aoR48eatOmjVwulx5++GF9++23HseorKzU9OnTFRISojZt2uj+++/XN998c8UXAwAAmod6B8rp06fVs2dPZWZm1tpXXl6u3bt3a968edq9e7feeustHT58WPfff7/HuISEBK1du1Zr1qzRtm3bdOrUKQ0bNkxnz569/CsBAADNhs2yLOuyX2yzae3atRo5cuQFx+Tk5OiOO+7Q119/rbCwMJWWlur666/XK6+8ojFjxkj673frvPfeexoyZMhFz1tWVqbAwECVlpbym2QBAGgi6vP3d4N/BqW0tFQ2m03t2rWTJOXm5qq6ulpxcXHuMS6XS5GRkdqxY0edx6isrFRZWZnHAgAAmq8GDZQzZ87o6aef1rhx49ylVFhYqOuuu07t27f3GOtwOFRYWFjncdLS0hQYGOhe+CZjAACatwYLlOrqaj300EOqqanRiy++eNHxlmVd8MuDkpKSVFpa6l4KCgq8PV0AAGCQBvk24+rqao0ePVp5eXnatGmTx/tMTqdTVVVVKikp8XiKUlRUpOjo6DqPZ7fbZbfbG2KqdcrPz1dxcfFVO19zFBISorCwsMaeBgCgifJ6oJyLk3//+9/avHmzgoODPfZHRUWpVatWys7O1ujRoyVJx48f1759+5Senu7t6dRbfn6+unbrrjMV5Y09lSatta+fDn1xkEgBAFyWegfKqVOndOTIEfd6Xl6e9u7dq6CgILlcLj3wwAPavXu33nnnHZ09e9b9uZKgoCBdd911CgwM1COPPKLZs2crODhYQUFBeuqpp9SjRw8NGjTIe1d2mYqLi3WmolzBw2arVTCfdbkc1ScKdOKdxSouLiZQAACXpd6BsmvXLg0YMMC9PmvWLElSfHy8UlJS9Pbbb0uSbrvtNo/Xbd68WbGxsZKk559/Xj4+Pho9erQqKio0cOBArVq1Si1btrzMy/C+VsGhsjtvauxpAABwTap3oMTGxurnfnXKpfxaldatW2vp0qVaunRpfU8PAACuAXwXDwAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDj1DpStW7dq+PDhcrlcstlsWrduncd+y7KUkpIil8slX19fxcbGav/+/R5jKisrNX36dIWEhKhNmza6//779c0331zRhQAAgOaj3oFy+vRp9ezZU5mZmXXuT09P15IlS5SZmamcnBw5nU4NHjxYJ0+edI9JSEjQ2rVrtWbNGm3btk2nTp3SsGHDdPbs2cu/EgAA0Gz41PcFQ4cO1dChQ+vcZ1mWMjIylJycrFGjRkmSsrKy5HA4tHr1ak2ZMkWlpaV66aWX9Morr2jQoEGSpFdffVWhoaH68MMPNWTIkCu4HAAA0Bx49TMoeXl5KiwsVFxcnHub3W5XTEyMduzYIUnKzc1VdXW1xxiXy6XIyEj3mPNVVlaqrKzMYwEAAM2XVwOlsLBQkuRwODy2OxwO977CwkJdd911at++/QXHnC8tLU2BgYHuJTQ01JvTBgAAhmmQn+Kx2Wwe65Zl1dp2vp8bk5SUpNLSUvdSUFDgtbkCAADzeDVQnE6nJNV6ElJUVOR+quJ0OlVVVaWSkpILjjmf3W5XQECAxwIAAJovrwZKeHi4nE6nsrOz3duqqqq0ZcsWRUdHS5KioqLUqlUrjzHHjx/Xvn373GMAAMC1rd4/xXPq1CkdOXLEvZ6Xl6e9e/cqKChIYWFhSkhIUGpqqiIiIhQREaHU1FT5+flp3LhxkqTAwEA98sgjmj17toKDgxUUFKSnnnpKPXr0cP9UDwAAuLbVO1B27dqlAQMGuNdnzZolSYqPj9eqVauUmJioiooKTZ06VSUlJerbt682bNggf39/92uef/55+fj4aPTo0aqoqNDAgQO1atUqtWzZ0guXBOBi8vPzVVxc3NjTaNJCQkIUFhbW2NMAmi2bZVlWY0+ivsrKyhQYGKjS0lKvfx5l9+7dioqKkjM+Q3bnTV499rWisvCICrMSlJubq169ejX2dHCe/Px8de3WXWcqyht7Kk1aa18/HfriIJEC1EN9/v6u9xMUAE1bcXGxzlSUK3jYbLUK5kf2L0f1iQKdeGexiouLCRSggRAowDWqVXAoTwkBGItvMwYAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxvB4oP/74o+bOnavw8HD5+vqqS5cuWrBggWpqatxjLMtSSkqKXC6XfH19FRsbq/3793t7KgAAoInyeqAsWrRIf/rTn5SZmamDBw8qPT1dzz33nJYuXeoek56eriVLligzM1M5OTlyOp0aPHiwTp486e3pAACAJsjrgfLPf/5TI0aM0H333afOnTvrgQceUFxcnHbt2iXpp6cnGRkZSk5O1qhRoxQZGamsrCyVl5dr9erV3p4OAABogrweKP3799fGjRt1+PBhSdK//vUvbdu2Tffee68kKS8vT4WFhYqLi3O/xm63KyYmRjt27PD2dAAAQBPk4+0DzpkzR6WlperWrZtatmyps2fPauHChRo7dqwkqbCwUJLkcDg8XudwOPT111/XeczKykpVVla618vKyrw9bQAAYBCvP0F544039Oqrr2r16tXavXu3srKy9Ic//EFZWVke42w2m8e6ZVm1tp2TlpamwMBA9xIaGurtaQMAAIN4PVB+//vf6+mnn9ZDDz2kHj16aMKECZo5c6bS0tIkSU6nU9J/n6ScU1RUVOupyjlJSUkqLS11LwUFBd6eNgAAMIjXA6W8vFwtWngetmXLlu4fMw4PD5fT6VR2drZ7f1VVlbZs2aLo6Og6j2m32xUQEOCxAACA5svrn0EZPny4Fi5cqLCwMP3iF7/Qnj17tGTJEk2aNEnST2/tJCQkKDU1VREREYqIiFBqaqr8/Pw0btw4b08HAAA0QV4PlKVLl2revHmaOnWqioqK5HK5NGXKFM2fP989JjExURUVFZo6dapKSkrUt29fbdiwQf7+/t6eDgAAaIK8Hij+/v7KyMhQRkbGBcfYbDalpKQoJSXF26cHAADNAN/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjNMggXLs2DH95je/UXBwsPz8/HTbbbcpNzfXvd+yLKWkpMjlcsnX11exsbHav39/Q0wFAAA0QV4PlJKSEvXr10+tWrXS+++/rwMHDmjx4sVq166de0x6erqWLFmizMxM5eTkyOl0avDgwTp58qS3pwMAAJogH28fcNGiRQoNDdXKlSvd2zp37uz+35ZlKSMjQ8nJyRo1apQkKSsrSw6HQ6tXr9aUKVO8PSUAANDEeP0Jyttvv63evXvrwQcfVIcOHXT77bdrxYoV7v15eXkqLCxUXFyce5vdbldMTIx27NhR5zErKytVVlbmsQAAgObL64Hy5ZdfatmyZYqIiNAHH3ygxx9/XE8++aRefvllSVJhYaEkyeFweLzO4XC4950vLS1NgYGB7iU0NNTb0wYAAAbxeqDU1NSoV69eSk1N1e23364pU6bo0Ucf1bJlyzzG2Ww2j3XLsmptOycpKUmlpaXupaCgwNvTBgAABvF6oHTs2FG33HKLx7bu3bsrPz9fkuR0OiWp1tOSoqKiWk9VzrHb7QoICPBYAABA8+X1QOnXr58OHTrkse3w4cPq1KmTJCk8PFxOp1PZ2dnu/VVVVdqyZYuio6O9PR0AANAEef2neGbOnKno6GilpqZq9OjR+vTTT7V8+XItX75c0k9v7SQkJCg1NVURERGKiIhQamqq/Pz8NG7cOG9PBwAANEFeD5Q+ffpo7dq1SkpK0oIFCxQeHq6MjAyNHz/ePSYxMVEVFRWaOnWqSkpK1LdvX23YsEH+/v7eng4AAGiCvB4okjRs2DANGzbsgvttNptSUlKUkpLSEKcHAABNHN/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAME6DB0paWppsNpsSEhLc2yzLUkpKilwul3x9fRUbG6v9+/c39FQAAEAT0aCBkpOTo+XLl+vWW2/12J6enq4lS5YoMzNTOTk5cjqdGjx4sE6ePNmQ0wEAAE1EgwXKqVOnNH78eK1YsULt27d3b7csSxkZGUpOTtaoUaMUGRmprKwslZeXa/Xq1Q01HQAA0IQ0WKBMmzZN9913nwYNGuSxPS8vT4WFhYqLi3Nvs9vtiomJ0Y4dO+o8VmVlpcrKyjwWAADQfPk0xEHXrFmj3bt3Kycnp9a+wsJCSZLD4fDY7nA49PXXX9d5vLS0NP3f//2f9ycKAACM5PUnKAUFBZoxY4ZeffVVtW7d+oLjbDabx7plWbW2nZOUlKTS0lL3UlBQ4NU5AwAAs3j9CUpubq6KiooUFRXl3nb27Flt3bpVmZmZOnTokKSfnqR07NjRPaaoqKjWU5Vz7Ha77Ha7t6cKAAAM5fUnKAMHDtTnn3+uvXv3upfevXtr/Pjx2rt3r7p06SKn06ns7Gz3a6qqqrRlyxZFR0d7ezoAAKAJ8voTFH9/f0VGRnpsa9OmjYKDg93bExISlJqaqoiICEVERCg1NVV+fn4aN26ct6cDAACaoAb5kOzFJCYmqqKiQlOnTlVJSYn69u2rDRs2yN/fvzGmAwAADHNVAuWjjz7yWLfZbEpJSVFKSsrVOD0AAGhi+C4eAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcbweKGlpaerTp4/8/f3VoUMHjRw5UocOHfIYY1mWUlJS5HK55Ovrq9jYWO3fv9/bUwEAAE2U1wNly5YtmjZtmj755BNlZ2frxx9/VFxcnE6fPu0ek56eriVLligzM1M5OTlyOp0aPHiwTp486e3pAACAJsjH2wdcv369x/rKlSvVoUMH5ebm6pe//KUsy1JGRoaSk5M1atQoSVJWVpYcDodWr16tKVOmeHtKAACgiWnwz6CUlpZKkoKCgiRJeXl5KiwsVFxcnHuM3W5XTEyMduzYUecxKisrVVZW5rEAAIDmq0EDxbIszZo1S/3791dkZKQkqbCwUJLkcDg8xjocDve+86WlpSkwMNC9hIaGNuS0AQBAI2vQQHniiSf02Wef6fXXX6+1z2azeaxbllVr2zlJSUkqLS11LwUFBQ0yXwAAYAavfwblnOnTp+vtt9/W1q1bdcMNN7i3O51OST89SenYsaN7e1FRUa2nKufY7XbZ7faGmioAADCM15+gWJalJ554Qm+99ZY2bdqk8PBwj/3h4eFyOp3Kzs52b6uqqtKWLVsUHR3t7ekAAIAmyOtPUKZNm6bVq1frH//4h/z9/d2fKwkMDJSvr69sNpsSEhKUmpqqiIgIRUREKDU1VX5+fho3bpy3pwMAAJogrwfKsmXLJEmxsbEe21euXKmJEydKkhITE1VRUaGpU6eqpKREffv21YYNG+Tv7+/t6QAAgCbI64FiWdZFx9hsNqWkpCglJcXbpwcAAM0A38UDAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM49PYEwAAwNvy8/NVXFzc2NNo0kJCQhQWFtZo5ydQAADNSn5+vrp2664zFeWNPZUmrbWvnw59cbDRIoVAAQA0K8XFxTpTUa7gYbPVKji0safTJFWfKNCJdxaruLiYQAEAwJtaBYfK7rypsaeBy8SHZAEAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYp1ED5cUXX1R4eLhat26tqKgoffzxx405HQAAYIhGC5Q33nhDCQkJSk5O1p49e3TXXXdp6NChys/Pb6wpAQAAQzRaoCxZskSPPPKIJk+erO7duysjI0OhoaFatmxZY00JAAAYwqcxTlpVVaXc3Fw9/fTTHtvj4uK0Y8eOWuMrKytVWVnpXi8tLZUklZWVeX1up06d+umchUdUU3XG68e/FlT/8I2kn/5ZNsT/R7gy3ONXjnvcbNzjV66h7vFzx7Is6+KDrUZw7NgxS5K1fft2j+0LFy60br755lrjn3nmGUsSCwsLCwsLSzNYCgoKLtoKjfIE5RybzeaxbllWrW2SlJSUpFmzZrnXa2pq9MMPPyg4OLjO8c1ZWVmZQkNDVVBQoICAgMaeDuB13ONo7q7le9yyLJ08eVIul+uiYxslUEJCQtSyZUsVFhZ6bC8qKpLD4ag13m63y263e2xr165dQ07ReAEBAdfcjY1rC/c4mrtr9R4PDAy8pHGN8iHZ6667TlFRUcrOzvbYnp2drejo6MaYEgAAMEijvcUza9YsTZgwQb1799add96p5cuXKz8/X48//nhjTQkAABii0QJlzJgxOnHihBYsWKDjx48rMjJS7733njp16tRYU2oS7Ha7nnnmmVpveQHNBfc4mjvu8Utjs6xL+VkfAACAq4fv4gEAAMYhUAAAgHEIFAAAYBwC5Rpgs9m0bt26xp4GrmETJ07UyJEjvXrMr776SjabTXv37vXqcYG6WJalxx57TEFBQQ16353/ZyU2NlYJCQkNci7TNepvkgVwbXjhhRcu7bs3AEOtX79eq1at0kcffaQuXbooJCSkQc7Dn5X/IlAANLhL/c2RgKmOHj2qjh07NvgvE+XPyn/xFk8Dq6mp0aJFi3TTTTfJbrcrLCxMCxculCR9/vnnuvvuu+Xr66vg4GA99thj7m/hlOp+tDdy5EhNnDjRvd65c2c9++yzGjdunNq2bSuXy6WlS5f+7JyOHTumMWPGqH379goODtaIESP01VdfeeuScQ3729/+ph49erjv6UGDBun06dN1PrZ+8sknlZiYqKCgIDmdTqWkpHgc64svvlD//v3VunVr3XLLLfrwww8v+nblgQMHdO+996pt27ZyOByaMGGCiouLG+Zicc2YOHGipk+frvz8fNlsNnXu3Fnr169X//791a5dOwUHB2vYsGE6evSo+zXn3oL861//qrvuuku+vr7q06ePDh8+rJycHPXu3Vtt27bVPffco++//97jXBd6O3TBggXq0aNHre1RUVGaP3++16+7sREoDSwpKUmLFi3SvHnzdODAAa1evVoOh0Pl5eW655571L59e+Xk5OjNN9/Uhx9+qCeeeKLe53juued06623avfu3UpKStLMmTNrfY3AOeXl5RowYIDatm2rrVu3atu2be4/JFVVVVd6ubiGHT9+XGPHjtWkSZN08OBBffTRRxo1atQFH1dnZWWpTZs22rlzp9LT07VgwQL3fVtTU6ORI0fKz89PO3fu1PLly5WcnHzR88fExOi2227Trl27tH79en333XcaPXq0168V15YXXnhBCxYs0A033KDjx48rJydHp0+f1qxZs5STk6ONGzeqRYsW+tWvfqWamhqP1z7zzDOaO3eudu/eLR8fH40dO1aJiYl64YUX9PHHH+vo0aOXHBeTJk3SgQMHlJOT49722Wefac+ePR7/4dpsXPT7jnHZysrKLLvdbq1YsaLWvuXLl1vt27e3Tp065d727rvvWi1atLAKCwsty7KsmJgYa8aMGR6vGzFihBUfH+9e79Spk3XPPfd4jBkzZow1dOhQ97oka+3atZZlWdZLL71kde3a1aqpqXHvr6ystHx9fa0PPvjgci8VsHJzcy1J1ldffVVrX3x8vDVixAj3ekxMjNW/f3+PMX369LHmzJljWZZlvf/++5aPj491/Phx9/7s7GyPezkvL8+SZO3Zs8eyLMuaN2+eFRcX53HMgoICS5J16NAhL1whrmXPP/+81alTpwvuLyoqsiRZn3/+uWVZ/70///KXv7jHvP7665Yka+PGje5taWlpVteuXd3rdf1Z+d+/B4YOHWr97ne/c68nJCRYsbGxV3Bl5uIJSgM6ePCgKisrNXDgwDr39ezZU23atHFv69evn2pqanTo0KF6nefOO++stX7w4ME6x+bm5urIkSPy9/dX27Zt1bZtWwUFBenMmTMejyeB+urZs6cGDhyoHj166MEHH9SKFStUUlJywfG33nqrx3rHjh1VVFQkSTp06JBCQ0PldDrd+++4446fPX9ubq42b97svq/btm2rbt26SRL3Nrzu6NGjGjdunLp06aKAgACFh4dLkvLz8z3G/e997nA4JMnjbRqHw+G+7y/Fo48+qtdff11nzpxRdXW1XnvtNU2aNOlKLsVYfEi2Afn6+l5wn2VZstlsde47t71Fixa1Ho9XV1df0rkvdOyamhpFRUXptddeq7Xv+uuvv6RjA3Vp2bKlsrOztWPHDm3YsEFLly5VcnKydu7cWef4Vq1aeazbbDb34/Gf+/NxITU1NRo+fLgWLVpUa1/Hjh3rdSzgYoYPH67Q0FCtWLFCLpdLNTU1ioyMrPVW+f/e5+fu6fO3nf+20MXOa7fbtXbtWtntdlVWVurXv/71FV6NmQiUBhQRESFfX19t3LhRkydP9th3yy23KCsrS6dPn3Y/Rdm+fbtatGihm2++WdJPwXD8+HH3a86ePat9+/ZpwIABHsf65JNPaq2f+y/H8/Xq1UtvvPGGOnTooICAgCu+RuB/2Ww29evXT/369dP8+fPVqVMnrV27tt7H6datm/Lz8/Xdd9+5/6vzf993r0uvXr3097//XZ07d5aPD/9qQ8M5ceKEDh48qD//+c+66667JEnbtm27Kuf28fFRfHy8Vq5cKbvdroceekh+fn5X5dxXG2/xNKDWrVtrzpw5SkxM1Msvv6yjR4/qk08+0UsvvaTx48erdevWio+P1759+7R582ZNnz5dEyZMcP8L+e6779a7776rd999V1988YWmTp2q//znP7XOs337dqWnp+vw4cP64x//qDfffFMzZsyoc07jx49XSEiIRowYoY8//lh5eXnasmWLZsyYoW+++aYh/3Ggmdu5c6dSU1O1a9cu5efn66233tL333+v7t271/tYgwcP1o033qj4+Hh99tln2r59u/tDshd6sjJt2jT98MMPGjt2rD799FN9+eWX2rBhgyZNmqSzZ89e0bUB/+vcT0AuX75cR44c0aZNmzRr1qyrdv7Jkydr06ZNev/995vt2zsSgdLg5s2bp9mzZ2v+/Pnq3r27xowZo6KiIvn5+emDDz7QDz/8oD59+uiBBx7QwIEDlZmZ6X7tpEmTFB8fr4cfflgxMTEKDw+v9fREkmbPnq3c3FzdfvvtevbZZ7V48WINGTKkzvn4+flp69atCgsL06hRo9S9e3dNmjRJFRUVPFHBFQkICNDWrVt177336uabb9bcuXO1ePFiDR06tN7HatmypdatW6dTp06pT58+mjx5subOnSvpp/Cvi8vl0vbt23X27FkNGTJEkZGRmjFjhgIDA9WiBf+qg/e0aNFCa9asUW5uriIjIzVz5kw999xzV+38ERERio6OVteuXdW3b9+rdt6rzWad/yEHNCmdO3dWQkLCNfurkHHt2L59u/r3768jR47oxhtvbOzpAI3Gsix169ZNU6ZMuapPbq423qgFYKS1a9eqbdu2ioiI0JEjRzRjxgz169ePOME1raioSK+88oqOHTum3/72t409nQZFoAAw0smTJ5WYmKiCggKFhIRo0KBBWrx4cWNPC2hUDodDISEhWr58udq3b9/Y02lQvMUDAACMwyfHAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHH+H2Bz3J/f1CN2AAAAAElFTkSuQmCC", "text/plain": [ "

" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "width = 0.35 # the width of the bars\n", "x = [0, 0.8, 1.6] # position of the bars\n", "situation_counts = nutri.situation.value_counts() # note that situation_counts is a pandas series\n", "plt.bar(x,situation_counts,width,edgecolor='black')\n", "plt.xticks(x,situation_counts.index)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Quantitative variables](#toc0_)\n", "\n", "Such type oif variables allow for more complex graphical representation. We will see some possibilities. We are in particular interested in visualizing the location, dispersion and shae of the data.\n", "\n", "The first example is visualizing data with a boxplot, that gives information on the location and dispersion, showing also outliers (data $x_i$ that is beyond the whiskers of the boxplot, this is, $x_iQ_3+1.5 (Q_3-Q_1)$, being $Q_3-Q_1$ called *interquantile range* or IQR)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhYAAAGwCAYAAAD16iy9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAWXklEQVR4nO3dbWzV9fn48aulUoq0bLIBRQoK0bQGJzJnvJu6PdB5Q9yWqFPZYEZnotkUFmGbDjfF6HSJe2DchP0yYyBuWTBmkky8mdOgiTegjmzlVrxhiG46bRWFQD//BwsnfyYIlKs9bX29Eh70fE/P9+rHDz1vv+e01JRSSgAAJKit9gAAwMAhLACANMICAEgjLACANMICAEgjLACANMICAEhT19sn7Orqik2bNkVjY2PU1NT09ukBgG4opURnZ2eMGTMmamv3fF2i18Ni06ZN0dLS0tunBQASvP766zF27Ng9Hu/1sGhsbIyI/w7W1NTU26cHALqho6MjWlpaKs/je9LrYbHz5Y+mpiZhAQD9zN7exuDNmwBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKSpq/YA9A9r166Nzs7Oao/RZ9Rs/yiGvP9afDRsXJS6IdUep09pbGyMI444otpjAFUiLNirtWvXxpFHHlntMfqUY0fXxoorhsWUu9+PFzZ3VXucPmfNmjXiAj6lhAV7tfNKxcKFC6Otra3K0/QNDe+uiXjyili0aFF8+BnRtVN7e3tMmzbN1S34FBMW7LO2traYMmVKtcfoGzbVRjwZ0dbaGjFmcrWnAegzvHkTAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgzYMJiy5YtsWLFitiyZUu1RwGA/TZQnscGTFisWrUqvvjFL8aqVauqPQoA7LeB8jw2YMICAKg+YQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApKmr9gAAwIHbtm1b3HXXXbF+/fqYOHFiXHnllTF48OBen2O/r1g8+eSTMXXq1BgzZkzU1NTEAw880ANjAQD7avbs2XHwwQfHzJkz484774yZM2fGwQcfHLNnz+71WfY7LD744IM45phj4s477+yJeQCA/TB79uy4/fbbY8SIEbFgwYJ44403YsGCBTFixIi4/fbbez0u9vulkLPOOivOOuusnpgFANgP27ZtizvuuCNGjRoVGzdujLq6/z6tX3bZZTFjxowYO3Zs3HHHHTFv3rxee1mkx99jsXXr1ti6dWvl446Ojh45z4cffhgREe3t7T3y+J9mO9d05xrDnvh7CN3Xne+1d911V2zfvj3mzZtXiYqd6urq4sYbb4wrrrgi7rrrrrjmmmsyx92jHg+LW265JX7+85/39GnilVdeiYiIadOm9fi5Pq1eeeWVOPnkk6s9Bn2Yv4dw4Pbne+369esjIuLcc8/d7fGdt++8X2/o8bD48Y9/HLNmzap83NHRES0tLennOeywwyIiYuHChdHW1pb++J9m7e3tMW3atMoaw574ewjd153vtRMnToyIiCVLlsRll132seNLlizZ5X69ocfDor6+Purr63v6NNHQ0BAREW1tbTFlypQeP9+n0c41hj3x9xAO3P58r73yyivj2muvjeuvvz5mzJixy8sh27dvj7lz50ZdXV1ceeWVPTHqbvkFWQDQTw0ePDhmzpwZb775ZowdOzbmz58fmzZtivnz58fYsWPjzTffjJkzZ/bq77PY7ysW77//fqxbt67y8YYNG+LFF1+MQw45JMaNG5c6HADwyW677baIiLjjjjviiiuuqNxeV1cX1157beV4b9nvsHj++efjK1/5SuXjne+fmD59etxzzz1pgwEA++a2226LefPm9YnfvLnfYXH66adHKaUnZgEAumnw4MG99iOln8R7LACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgzYMKitbU1li9fHq2trdUeBQD220B5Hqur9gBZhg4dGlOmTKn2GADQLQPleWzAXLEAAKpPWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJCmrtoD0Pdt2bIlIiJWrFhR5Un6joZ310RbRLSvWhUfbu6q9jh9Rnt7e7VHAKpMWLBXq1atioiIyy+/vMqT9B3Hjq6NFVcMi0suuSReEBYf09jYWO0RgCoRFuzV17/+9YiIaG1tjaFDh1Z3mD6iZvtH0f7+a/F/Z4+LUjek2uP0KY2NjXHEEUdUewygSmpKKaU3T9jR0RHDhw+P9957L5qamnrz1ABAN+3r87c3bwIAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJCmrrdPWEqJiIiOjo7ePjUA0E07n7d3Po/vSa+HRWdnZ0REtLS09PapAYAD1NnZGcOHD9/j8Zqyt/RI1tXVFZs2bYrGxsaoqalJe9yOjo5oaWmJ119/PZqamtIeF2vbk6xtz7CuPcfa9py+vrallOjs7IwxY8ZEbe2e30nR61csamtrY+zYsT32+E1NTX3yP8hAYG17jrXtGda151jbntOX1/aTrlTs5M2bAEAaYQEApBkwYVFfXx833HBD1NfXV3uUAcfa9hxr2zOsa8+xtj1noKxtr795EwAYuAbMFQsAoPqEBQCQRlgAAGmEBQCQpt+FxT//+c+YNm1ajBgxIoYOHRqTJ0+O5cuXV47PmDEjampqdvlzwgknVHHi/uGwww772LrV1NTEVVddFRH//Y1rP/vZz2LMmDHR0NAQp59+evz973+v8tT9w97W1p7tnu3bt8f1118fhx9+eDQ0NMSECRPixhtvjK6ursp97Nvu2Ze1tW+7r7OzM6655poYP358NDQ0xEknnRTPPfdc5Xi/37elH3nnnXfK+PHjy4wZM8ozzzxTNmzYUB599NGybt26yn2mT59evva1r5U33nij8uftt9+u4tT9w1tvvbXLmj3yyCMlIsrjjz9eSinl1ltvLY2NjWXx4sVl5cqV5cILLyzNzc2lo6OjuoP3A3tbW3u2e+bNm1dGjBhRlixZUjZs2FD++Mc/lmHDhpVf/epXlfvYt92zL2tr33bfBRdcUI466qjyxBNPlLVr15YbbrihNDU1lY0bN5ZS+v++7VdhMWfOnHLKKad84n2mT59ezjvvvN4ZaAC7+uqry8SJE0tXV1fp6uoqo0ePLrfeemvl+EcffVSGDx9efvOb31Rxyv7p/1/bUuzZ7jrnnHPKpZdeustt3/zmN8u0adNKKcW+PQB7W9tS7Nvu2rJlSxk0aFBZsmTJLrcfc8wx5brrrhsQ+7ZfvRTypz/9KY477rg4//zzY+TIkXHsscfGggULPna/v/71rzFy5Mg48sgj4/LLL4+33nqrCtP2X9u2bYuFCxfGpZdeGjU1NbFhw4bYvHlznHHGGZX71NfXx2mnnRZPP/10FSftf/53bXeyZ/ffKaecEo899lisWbMmIiJeeumlWLZsWZx99tkREfbtAdjb2u5k3+6/7du3x44dO2LIkCG73N7Q0BDLli0bEPu21/8RsgPx8ssvx69//euYNWtW/OQnP4lnn302fvCDH0R9fX185zvfiYiIs846K84///wYP358bNiwIX7605/GV7/61Vi+fHm//21mveWBBx6Id999N2bMmBEREZs3b46IiFGjRu1yv1GjRsWrr77a2+P1a/+7thH2bHfNmTMn3nvvvWhtbY1BgwbFjh074uabb46LLrooIuzbA7G3tY2wb7ursbExTjzxxLjpppuira0tRo0aFffdd18888wzccQRRwyMfVvtSyb746CDDionnnjiLrd9//vfLyeccMIeP2fTpk3loIMOKosXL+7p8QaMM844o5x77rmVj5966qkSEWXTpk273O+yyy4rZ555Zm+P16/979rujj27b+67774yduzYct9995W//e1v5d577y2HHHJIueeee0op9u2B2Nva7o59u+/WrVtXTj311BIRZdCgQeVLX/pSueSSS0pbW9uA2Lf96opFc3NzHHXUUbvc1tbWFosXL/7Ezxk/fnysXbu2p8cbEF599dV49NFH4/7776/cNnr06Ij47/8BNjc3V25/6623PlbV7Nnu1nZ37Nl9c+2118aPfvSj+Na3vhUREUcffXS8+uqrccstt8T06dPt2wOwt7XdHft2302cODGeeOKJ+OCDD6KjoyOam5vjwgsvjMMPP3xA7Nt+9R6Lk08+OVavXr3LbWvWrInx48fv8XPefvvteP3113f5D8Se/e53v4uRI0fGOeecU7lt52Z/5JFHKrdt27YtnnjiiTjppJOqMWa/tLu13R17dt9s2bIlamt3/RY2aNCgyo9E2rfdt7e13R37dv8dfPDB0dzcHP/5z39i6dKlcd555w2MfVvtSyb749lnny11dXXl5ptvLmvXri2LFi0qQ4cOLQsXLiyllNLZ2Vl++MMflqeffrps2LChPP744+XEE08shx56aL/5MZ1q2rFjRxk3blyZM2fOx47deuutZfjw4eX+++8vK1euLBdddFG/+vGnatvT2tqz3Td9+vRy6KGHVn4k8v777y+f+9znyuzZsyv3sW+7Z29ra98emIceeqj8+c9/Li+//HJ5+OGHyzHHHFOOP/74sm3btlJK/9+3/SosSinlwQcfLJMmTSr19fWltbW1zJ8/v3Jsy5Yt5Ywzziif//zny0EHHVTGjRtXpk+fXl577bUqTtx/LF26tEREWb169ceOdXV1lRtuuKGMHj261NfXl1NPPbWsXLmyClP2T3taW3u2+zo6OsrVV19dxo0bV4YMGVImTJhQrrvuurJ169bKfezb7tnb2tq3B+YPf/hDmTBhQhk8eHAZPXp0ueqqq8q7775bOd7f961/Nh0ASNOv3mMBAPRtwgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgL4RA899FCccsop8ZnPfCZGjBgR5557bqxfv75y/Omnn47JkyfHkCFD4rjjjosHHnggampq4sUXX6zc5x//+EecffbZMWzYsBg1alR8+9vfjn//+99V+GqAniYsgE/0wQcfxKxZs+K5556Lxx57LGpra+Mb3/hGdHV1RWdnZ0ydOjWOPvroWLFiRdx0000xZ86cXT7/jTfeiNNOOy0mT54czz//fDz00EPx5ptvxgUXXFClrwjoSf51U2C//Otf/4qRI0fGypUrY9myZXH99dfHxo0bY8iQIRER8dvf/jYuv/zyeOGFF2Ly5Mkxd+7ceOaZZ2Lp0qWVx9i4cWO0tLTE6tWr48gjj6zWlwL0AFcsgE+0fv36uPjii2PChAnR1NQUhx9+eEREvPbaa7F69er4whe+UImKiIjjjz9+l89fvnx5PP744zFs2LDKn9bW1spjAwNLXbUHAPq2qVOnRktLSyxYsCDGjBkTXV1dMWnSpNi2bVuUUqKmpmaX+//vRdCurq6YOnVq/OIXv/jYYzc3N/fo7EDvExbAHr399tvR3t4ed999d3z5y1+OiIhly5ZVjre2tsaiRYti69atUV9fHxERzz///C6PMWXKlFi8eHEcdthhUVfnWw4MdF4KAfbos5/9bIwYMSLmz58f69ati7/85S8xa9asyvGLL744urq64nvf+160t7fH0qVL45e//GVEROVKxlVXXRXvvPNOXHTRRfHss8/Gyy+/HA8//HBceumlsWPHjqp8XUDPERbAHtXW1sbvf//7WL58eUyaNClmzpwZt99+e+V4U1NTPPjgg/Hiiy/G5MmT47rrrou5c+dGRFTedzFmzJh46qmnYseOHXHmmWfGpEmT4uqrr47hw4dHba1vQTDQ+KkQINWiRYviu9/9brz33nvR0NBQ7XGAXuYFT+CA3HvvvTFhwoQ49NBD46WXXoo5c+bEBRdcICrgU0pYAAdk8+bNMXfu3Ni8eXM0NzfH+eefHzfffHO1xwKqxEshAEAa75wCANIICwAgjbAAANIICwAgjbAAANIICwAgjbAAANIICwAgzf8D2bsRZQItCXYAAAAASUVORK5CYII=", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.boxplot(nutri.age,widths=width,vert=False) # the vert controls the verticality of the plot\n", "plt.xlabel('age')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a slightly more complex setup, we can compare two different categories in the boxplot:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAioAAAGwCAYAAACHJU4LAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAzSklEQVR4nO3de3zOdePH8fdl2MwMOWVZVKw5jJ3kkFPkMBJCTsmponvJLHGLiLiHWw4RTYnUI1QOt8ohkWOEMdyOI6diSQ5z2tj2/f1x3+3XbpLG9vlsez0fjz3ursN2ved22WvX9d01l+M4jgAAACyUx/QAAACAP0KoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBaeU0PuBOpqak6efKkChUqJJfLZXoOAAC4DY7j6OLFi/Lx8VGePLd+zCRbh8rJkyfl6+tregYAAMiAEydOqEyZMre8TrYOlUKFCkn6zyfq7e1teA0AALgdCQkJ8vX1Tfs6fivZOlR+e7rH29ubUAEAIJu5ncM2OJgWAABYi1ABAADWIlQAAIC1CBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1jIaKm+88YZcLle6t3vvvdfkJAAAYBHjv+uncuXK+uabb9JOu7m5GVwDAABsYjxU8ubNy6MoAADgpoyHSlxcnHx8fOTu7q4aNWroH//4hx588MGbXjcpKUlJSUlppxMSErJqJv7rypUr2r9/v5Hbvnr1qo4ePapy5cqpQIECRjb4+/vL09PTyG0DmY37N/dvGxkNlRo1amjOnDny8/PTzz//rFGjRql27dras2ePihUrdsP1o6KiNGLECANL8Zv9+/crJCTE9AxjYmJiFBwcbHoGkCm4f3P/tpHLcRzH9IjfXL58WQ899JAGDhyoyMjIGy6/2SMqvr6+unDhgry9vbNyaq5l8juuffv26ZlnntHHH3+sihUrGtnAd1zIybh/c//OKgkJCSpcuPBtff02/tTP7xUsWFABAQGKi4u76eXu7u5yd3fP4lX4PU9PT+PfcVSsWNH4BiAn4v4NG1n1OipJSUnat2+fSpcubXoKAACwgNFQGTBggNauXasjR47o+++/V7t27ZSQkKBu3bqZnAUAACxh9KmfH3/8UZ06ddKZM2dUokQJ1axZU5s3b1bZsmVNzgIAAJYwGirz5s0zefMAAMByVh2jAgAA8HuECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBa1oRKVFSUXC6XIiIiTE8BAACWsCJUtm7dqhkzZqhq1aqmpwAAAIsYD5VLly6pS5cueu+991S0aNFbXjcpKUkJCQnp3gAAQM5lPFTCw8PVokULPf7443963aioKBUuXDjtzdfXNwsWAgAAU4yGyrx587R9+3ZFRUXd1vUHDx6sCxcupL2dOHEikxcCAACT8pq64RMnTqhfv376+uuv5eHhcVvv4+7uLnd390xeBgAAbGEsVGJiYnT69GmFhISknZeSkqJ169Zp6tSpSkpKkpubm6l5AADAAsZCpVGjRtq9e3e683r06CF/f38NGjSISAEAAOZCpVChQqpSpUq68woWLKhixYrdcD4AAMidjP/UDwAAwB8x9ojKzaxZs8b0BAAAYBEeUQEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFAABYi1ABAADWIlQAAIC1CBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFAABYi1ABAADWIlQAAIC1CBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFAABYi1ABAADWIlQAAIC1CBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFAABYi1ABAADWIlQAAIC1CBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFAABYy2ioTJ8+XVWrVpW3t7e8vb1Vq1YtLVu2zOQkAABgEaOhUqZMGY0ZM0bbtm3Ttm3b1LBhQ7Vq1Up79uwxOQsAAFgir8kbb9myZbrTo0eP1vTp07V582ZVrlz5husnJSUpKSkp7XRCQkKmb7TRzz//rBUrVshxHNNTstTRo0clSV988YV2795tdkwWCwgIUHBwsOkZAJDljIbK76WkpOizzz7T5cuXVatWrZteJyoqSiNGjMjiZfaZPHmyoqKiTM8w5o033jA9IctVrlxZ//73v03PAIAsZzxUdu/erVq1aikxMVFeXl5atGiRKlWqdNPrDh48WJGRkWmnExIS5Ovrm1VTrXH9+nVVqFBBe/fuNT0lS125ckX79++Xv7+/PD09Tc/JMoMGDdKXX35pegYAGGE8VB5++GHFxsbq/PnzWrBggbp166a1a9feNFbc3d3l7u5uYKV9XC6X8uY1/n9flvL29tYjjzxiekaWy5OHH84DkHsZ/0qXP39+lS9fXpIUGhqqrVu3avLkyYqOjja8DAAAmGbdt2qO46Q7YBYAAOReRh9Ree211xQWFiZfX19dvHhR8+bN05o1a7R8+XKTswAAgCWMhsrPP/+srl276tSpUypcuLCqVq2q5cuXq3HjxiZnAQAASxgNlZkzZ5q8eQAAYDnrjlEBAAD4DaECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGtlOFSSk5P1zTffKDo6WhcvXpQknTx5UpcuXbpr4wAAQO6WoRd8O3bsmJo1a6bjx48rKSlJjRs3VqFChTRu3DglJibq3Xffvds7AQBALpShR1T69eun0NBQnTt3TgUKFEg7v02bNlq1atVdGwcAAHK3DD2ismHDBm3cuFH58+dPd37ZsmX1008/3ZVhAAAAGXpEJTU1VSkpKTec/+OPP6pQoUJ3PAoAAEDKYKg0btxYkyZNSjvtcrl06dIlDR8+XM2bN79b2wAAQC6Xoad+Jk6cqMcee0yVKlVSYmKiOnfurLi4OBUvXlxz58692xsBAEAulaFQ8fHxUWxsrObOnavt27crNTVVvXr1UpcuXdIdXAsAAHAnMhQqklSgQAH17NlTPXv2vJt7AAAA0mT4Bd8++ugj1alTRz4+Pjp27Jik/zwl9K9//euujQMAALlbhkJl+vTpioyMVFhYmM6dO5f2E0BFixZNd5AtAADAnchQqEyZMkXvvfeehgwZorx5///Zo9DQUO3evfuujQMAALlbhkLlyJEjCgoKuuF8d3d3Xb58+Y5HAQAASBkMlQceeECxsbE3nL9s2TJVqlTpTjcBAABIyuBP/bz66qsKDw9XYmKiHMfRli1bNHfuXEVFRen999+/2xsBAEAulaFQ6dGjh5KTkzVw4EBduXJFnTt31n333afJkyerY8eOd3sjAADIpW77qZ8lS5bo+vXraaeff/55HTt2TKdPn1Z8fLxOnDihXr16ZcpIAACQO912qLRp00bnz5+XJLm5uen06dOSpOLFi6tkyZKZMg4AAORutx0qJUqU0ObNmyVJjuPI5XJl2igAAADpLxyj0qdPH7Vq1Uoul0sul0v33nvvH173txeAAwAAuBO3HSpvvPGGOnbsqEOHDunJJ5/UrFmzVKRIkUycBgAAcrvbDpUlS5YoLCxM/v7+Gj58uNq3by9PT8/M3AYAAHK5DB1MO3LkSF26dCmzNgEAAEjiYFoAAGAxDqYFAADW4mBaAABgrb/0Evr+/v4cTAsAALJMhn7Xz/DhwyVJv/zyiw4cOCCXyyU/Pz+VKFHiro4DAAC5220fTPt7V65cUc+ePeXj46N69eqpbt268vHxUa9evXTlypW7vREAAORSGQqV/v37a+3atVqyZInOnz+v8+fP61//+pfWrl2rV1555W5vxP/w8PDQyZMntXbtWtNTkMlOnTqlNWvWyMPDw/QUADAiQ6GyYMECzZw5U2FhYfL29pa3t7eaN2+u9957T59//vnd3oj/0bdvX4WGhqphw4YaMWIEP2WVQy1fvlzVqlXTTz/9pKlTp5qeAwBGZPipn1KlSt1wfsmSJXnqJwuULFlS33zzjYYPH66RI0fq8ccf18mTJ03Pwl1y/fp1DRo0SGFhYQoJCVFsbKzq1q1rehYAGJGhUKlVq5aGDx+uxMTEtPOuXr2qESNGqFatWndtHP6Ym5ubhg0bptWrV+vgwYOqVq2ali1bZnoW7tDRo0dVr149TZgwQePGjdNXX32lkiVLmp4FAMZk6Kd+Jk2apLCwMJUpU0bVqlWTy+VSbGys3N3d9fXXX9/tjbiF+vXrKzY2Vt27d1fz5s01YMAAjR49Wvnz5zc9DX/RwoUL1atXLxUpUkTr169XzZo1TU8CAOMy9IhKQECA4uLiFBUVpcDAQFWtWlVjxozRoUOHVLly5bu9EX+iRIkS+uKLL/TWW29p0qRJqlu3ro4cOWJ6Fm5TYmKiwsPD1bZtWzVq1Eg7duwgUgDgvzL0iEpUVJRKlSql559/Pt35H3zwgX755RcNGjTorozD7cuTJ48iIyNVp04ddezYUUFBQXr//ffVrl0709NwCwcOHFCHDh20f/9+TZs2TX369OH3aAHA72ToEZXo6Gj5+/vfcH7lypX17rvv3vEoZNwjjzyiHTt2qEmTJmrfvr1efPFFXb161fQs3MScOXMUEhKixMREff/993rxxReJFAD4HxkKlfj4eJUuXfqG80uUKKFTp07d8SjcmcKFC2v+/PmKjo7W7NmzVaNGDe3fv9/0LPzXpUuX1K1bN3Xr1k3t2rXTtm3bVK1aNdOzAMBKGQoVX19fbdy48YbzN27cKB8fnzsehTvncrn0wgsvaMuWLbp+/bpCQkI0e/ZsOY5jelqutnPnToWGhmrBggWaM2eOZs+eLS8vL9OzAMBaGQqV5557ThEREZo1a5aOHTumY8eO6YMPPlD//v1vOG4FZgUEBGjbtm3q0KGDevTooWeffVYXL140PSvXcRxH06ZNU40aNeTh4aGYmBh17drV9CwAsF6GDqYdOHCgzp49q7/97W+6du2apP+8rPugQYM0ePDguzoQd65gwYL64IMP1KhRI/Xp00fff/+95s+fr6CgINPTcoXz58/rueee04IFCxQeHq7x48fzkvgAcJsy9IiKy+XS2LFj9csvv2jz5s3auXOnzp49q2HDht3tfbiLunTpou3bt8vLy0s1a9bU1KlTeSook23evFmBgYFatWqVFixYoKlTpxIpAPAXZChUfuPl5aXq1aurSpUqcnd3v1ubkIkqVKigTZs2qXfv3urbt6/atm2rc+fOmZ6V46SmpmrcuHGqW7euSpcurR07duipp54yPQsAsp07ChVkT+7u7nr77be1aNEirVmzRoGBgfruu+9Mz8oxTp8+rRYtWmjQoEF65ZVXtG7dOpUrV870LADIlgiVXKx169aKjY1VmTJlVK9ePY0ZM0apqammZ2Vrq1evVmBgoGJiYrR8+XKNGTNG+fLlMz0LALItQiWXu//++7VmzRoNGjRIr732msLCwvTzzz+bnpXtJCcna9iwYXr88cdVsWJF7dy5U02bNjU9CwCyPUIFypcvn0aPHq0VK1YoNjY27eBP3J4ff/xRDRs21OjRozVy5Eh9/fXXN31BRADAX0eoIE3jxo21c+dOValSRY0bN9bQoUOVnJxsepbVvvzySwUGBurIkSNas2aNhg4dKjc3N9OzACDHIFSQzr333qsVK1Zo1KhRGjNmjB577DGdOHHC9CzrXLt2TZGRkWrZsqVq166t2NhY1a1b1/QsAMhxjIZKVFSUqlevrkKFCqlkyZJq3bq1Dhw4YHIS9J/fxPzaa69p7dq1OnbsmAIDA7VkyRLTs6xx+PBhPfroo5o6daomTpyof/3rXypWrJjpWQCQIxkNlbVr1yo8PFybN2/WypUrlZycrCZNmujy5csmZ+G/Hn30UcXGxqpOnTpq1aqVIiIilJSUZHqWUfPnz1dwcLDOnTun7777ThEREfzGYwDIRBl6Cf27Zfny5elOz5o1SyVLllRMTIzq1atnaBV+75577tHixYs1ZcoUvfrqq1q/fr3mz5+v8uXLm56Wpa5evaqIiAjNmDFDHTt2VHR0tLy9vU3PQg6UnJysAwcO5LpXjT506FDa/+bPn9/wmqxVpkwZFSlSxPQMaxkNlf914cIFSf/54ngzSUlJ6b6jT0hIyJJduZ3L5dLLL7+se+65Ry/06KqJA7vrnalTTc/KUks+/VRbl7yvZzq01ZxPPuFRFGSaqKioXP3rSDp06GB6QpYLDQ3V1q1bTc+wljWh4jiOIiMjVadOHVWpUuWm14mKitKIESOyeBkcx9GsWbP00ksvqfkj5fVO1d3SjPqmZ2WpDpI69PZScPQiPf/883r77bfl6elpehZyoISEBPn6+uqzzz4zPSVL7d+/X927d9fs2bPl7+9vek6Wefvtt7Vt2zbTM6xmTai89NJL2rVrlzZs2PCH1xk8eLAiIyPTTv92h0bmSUhI0IsvvqhPPvlEzz33nCaPj5Iu/2h6lhGOHEX4btGLL0dq06ZN+vTTT1W5cmXTs5ADFShQQDVq1DA9I0sFBAQoICBA/v7+ueqbAB8fH9MTrGdFqPTt21dLlizRunXrVKZMmT+8nru7O7/8MAvFxMSoY8eO+vnnnzV37lx17NjxPxcULm52mCEuSc/2ClJorbrq0KGDqlevrsmTJ+u5557jqSDgDnl6eio4ONj0DFjI6E/9OI6jl156SQsXLtTq1av1wAMPmJyD/3IcR5MnT1atWrVUuHBhbd++/f8jBapUqZK2bNmirl276oUXXlCnTp04XgoAMonRUAkPD9fHH3+sTz75RIUKFVJ8fLzi4+N19epVk7NytbNnz6p169aKiIhQeHi4Nm7cmOt+wud2FChQQNHR0Zo/f76WLVumoKAgnmcGgExgNFSmT5+uCxcuqEGDBipdunTa2/z5803OyrU2bNigwMBAbdiwQUuWLNHEiRN5qu1PPP3009qxY4eKFSum2rVra+LEibnux0oBIDMZf+rnZm/du3c3OSvXSUlJ0ejRo9WgQQOVLVtWsbGxatmypelZ2caDDz6oDRs26OWXX1ZkZKSefPJJ/frrr6ZnAUCOwO/6yeXi4+PVtGlTvf766xo8eLC+/fZbfpIqA/Lnz6/x48fryy+/1KZNm1StWjWtX7/e9CwAyPYIlVxs5cqVqlatmvbs2aOVK1fqzTffVN68VvwgWLbVokULxcbG6qGHHlKDBg305ptvKiUlxfQsAMi2CJVc6Pr163rttdfUtGlTBQYGKjY2Vo0aNTI9K8coU6aMVq9erddff13Dhw9XkyZNdOrUKdOzACBbIlRymWPHjqlBgwYaN26coqKitGzZMpUqVcr0rBzHzc1Nb7zxhlatWqV9+/apWrVqWrFihelZAJDtECq5yOLFixUYGKiffvpJ69ev16BBg5QnD38FMtNjjz2m2NhYhYSEqFmzZho0aJCuX79uehYAZBt8lcoFEhMT1bdvX7Vp00aPPfaYduzYoVq1apmelWuULFlSX331lf75z39qwoQJqlevno4ePWp6FgBkC4RKDnfw4EHVrl1bM2bM0NSpU7VgwQIVLVrU9KxcJ0+ePBowYIA2bNig+Ph4BQUFaeHChaZnAYD1CJUc7OOPP1ZISIguX76s77//XuHh4fxOGsNq1KihHTt26PHHH1fbtm0VHh6uxMRE07MAwFqESg50+fJl9ejRQ127dlWbNm0UExOjwMBA07PwX0WKFNGnn36qadOmaebMmapZs6YOHDhgehYAWIlQyWF27dql0NBQffbZZ/rwww81Z84ceXl5mZ6F/+FyufTiiy/q+++/V2JiokJCQjRnzhzTswDAOoRKDuE4jqKjo1WjRg3lz59f27Zt07PPPmt6Fv5EtWrVFBMTo/bt26tbt27q1q2bLl26ZHoWAFiDUMkBzp8/rw4dOqhPnz7q0aOHNm/eLH9/f9OzcJsKFiyoWbNmac6cOVqwYIFCQ0O1c+dO07MAwAqESja3ZcsWBQUF6euvv9bnn3+uadOmqUCBAqZnIQO6du2qmJgYeXh4qEaNGpo2bRq/iRlArkeoZFOpqal666239Oijj6pUqVKKjY1V27ZtTc/CHXr44Ye1efNmPf/88woPD1f79u11/vx507MAwBhCJRs6e/asWrZsqQEDBigyMlLr169XuXLlTM/CXeLh4aEpU6Zo4cKFWrVqlQIDA7VlyxbTswDACEIlG5o8ebLWrFmjZcuWaezYscqXL5/pScgEbdq0UWxsrIoUKaLevXubngMARhAq2dCVK1dUpkwZNWvWzPQUZLKyZcuqcePGunLliukpAGAEoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAljl//ryOHz9uegYy2dWrV3Xw4EHTM6xHqACARZ5++mnlyZNH/v7+Gj58uC5fvmx6Eu4yx3H02WefqWLFilq6dKlefPFF05OsRqgAgEWqV6+ugwcPKiIiQmPHjtXDDz+sjz/+WKmpqaan4S7YsWOH6tevr6effloBAQHas2ePIiIiTM+yGqECAJYpVKiQ/vGPf2jfvn2qWbOmunbtqtq1a+v77783PQ0ZFB8fr169eikkJES//vqrVqxYoS+++EJ+fn6mp1mPUAEASz3wwAP6/PPPtWbNGiUmJqZFy48//mh6Gm5TUlKSxo4dKz8/Py1evFhTpkzRzp071aRJE9PTsg1CBQAsV79+fcXExOi9997T119/rYcfflgjR47UlStXTE/DH3AcRwsXLlSlSpU0ZMgQ9ejRQ3FxcQoPD1fevHlNz8tWCBUAyAbc3Nz03HPPpX2xGzVqlPz9/TVv3jw5jmN6Hn5n586datiwodq2bSs/Pz/t3r1bkydP1j333GN6WrZEqABANuLt7a1x48Zp7969CgkJUadOnVSnTh1t3brV9LRc7/Tp0+rdu7eCg4MVHx+vpUuXatmyZapYsaLpadkaoQIA2VD58uW1aNEirVq1ShcvXtQjjzyi7t276+TJk6an5TrXrl3T+PHjVaFCBX366aeaMGGCdu3apbCwMNPTcgRCBQCysYYNG2r79u1699139dVXX8nPz0+jR4/W1atXTU/L8RzH0ZIlS1S5cmX9/e9/V9euXRUXF6d+/fopX758puflGIQKAGRzefPmVe/evRUXF6fevXvrjTfeUMWKFfXZZ59x/Eom2b17t5o0aaJWrVrpgQce0M6dOzV16lQVL17c9LQch1ABgByiSJEieuutt7Rnzx5VrVpVTz/9tOrXr6/t27ebnpZjnDlzRn/7298UGBioY8eO6YsvvtCKFStUuXJl09NyLEIFAHIYPz8/LVmyRCtWrNDZs2cVGhqqXr16KT4+3vS0bOvatWuaNGmSKlSooE8++UT//Oc/9e9//1tPPPGEXC6X6Xk5GqECADlUkyZNFBsbqylTpmjx4sXy8/PT2LFjlZiYaHpatuE4jr766isFBATolVdeUYcOHRQXF6fIyEjlz5/f9LxcgVABgBwsb968Cg8PV1xcnHr27KmhQ4eqUqVKWrhwIcev/Im9e/cqLCxMTzzxhMqUKaMdO3bo3XffVYkSJUxPy1WMhsq6devUsmVL+fj4yOVyafHixSbnAECOdc8992jSpEnavXu3KlasqLZt26phw4aKjY01Pc06Z8+eVd++fVW1alUdOnRIixcv1jfffKOqVauanpYrGQ2Vy5cvq1q1apo6darJGQCQa/j7++urr77SsmXLFB8fr+DgYL3wwgs6ffq06WnGXb9+XVOmTFH58uX14YcfKioqSnv27FGrVq04DsUgo6ESFhamUaNG6amnnjI5AwBynWbNmmnXrl2aNGmSPv/8c1WoUEHjx4/XtWvXTE8zYvny5apWrZr69eundu3aKS4uTq+++qrc3d1NT8v1stVvRkpKSlJSUlLa6YSEBINrzBk/frwk5bo7kOM4chxHLpcrV313k5ycLH9/f9MzkAPly5dPL7/8srp06aKePXtq2OBXdXrnSo0bO9b0tCz11dKlev3113XdvbS2b9+uwMBA05PwO9kqVKKiojRixAjTM4wbN26cFi1apC5dupiekqVOnDihsWPHatCgQfL19TU9J0vxDycyy7lz5zRq1CgtXbpUTauW0biHNksz6puelaVaSGrR20vB0ac0YMAATZw4UQEBAaZn4b9cjiWHfbtcLi1atEitW7f+w+vc7BEVX19fXbhwQd7e3lmwEiZt375dISEhiomJUXBwsOk5QLaWnJysGTNmaNiwYUpKStKQIUMUEf6CPC4eNz3NCEeOlm87oohXB+vQoUN64YUXNHLkSH7CJ5MkJCSocOHCt/X1O1s9ouLu7p7rnu4AgLtt5cqV6t+/v/bu3asePXpo1KhRKl269H8uLHSP2XGGuCSFPRmkRs2e0DvvvKMRI0Zo7ty5Gj58uMLDw3nNFIN4HRUAyCXi4uL05JNPqkmTJipatKi2bt2qmTNn/n+kQPnz51f//v0VFxenzp07a8CAAQoICNCXX37J684YYjRULl26pNjY2LSf4z9y5IhiY2N1/HjufOgRADLD+fPnNWDAAFWuXFm7du3S/PnztW7dOoWEhJieZq0SJUpo2rRpio2Nla+vr1q2bKlmzZpp7969pqflOkZDZdu2bQoKClJQUJAkKTIyUkFBQRo2bJjJWQCQI6SkpCg6Olp+fn569913NXz4cO3bt09PP/10rvrJuTsREBCglStXavHixTp8+LCqVq2qvn376tdffzU9LdcwGioNGjRI+5HT37/Nnj3b5CwAyPa+/fZbBQcHq0+fPgoLC9PBgwc1ZMgQFShQwPS0bMflcqlVq1bas2ePxowZozlz5qhChQp6++23df36ddPzcjyOUQGAHOTw4cN66qmn1LBhQ3l5eWnLli368MMP5ePjY3patufu7q4BAwYoLi5O7dq1U0REhKpVq6bly5ebnpajESoAkAMkJCRo0KBBqlSpkrZu3apPPvlEGzZsUPXq1U1Py3FKliypGTNmaPv27SpVqpTCwsLUvHlz7d+/3/S0HIlQAYBsLCUlRe+//74qVKigKVOmaMiQITpw4IA6derEcSiZLDAwUKtXr9aCBQu0f/9+BQQEKCIiQufOnTM9LUchVAAgm1q3bp2qV6+u559/Xo0bN9bBgwc1bNgweXp6mp6Wa7hcLj311FPau3ev3nzzTc2cOVPly5fXO++8o+TkZNPzcgRCBQCymSNHjqh9+/aqX7++8uXLp02bNunjjz9WmTJlTE/LtTw8PPT3v/9dcXFxat26tfr27avAwECtXLnS9LRsj1ABgGzi4sWLeu2111SxYkVt2rRJH330kTZt2qSaNWuanob/uvfeezVz5kxt3bpVRYsWVZMmTfTkk0/q4MGDpqdlW4QKAFguNTVVs2fPlp+fnyZOnKiBAwfqwIEDeuaZZ5QnD/+M2ygkJETr1q3Tp59+ql27dqlKlSp65ZVXdP78edPTsh3+hgOAxTZs2KBHHnlEPXr0UIMGDXTgwAGNHDlSBQsWND0Nf8Llcql9+/bat2+fhg8frujoaFWoUEHvvvuuUlJSTM/LNggVALDQsWPH1LFjR9WtW1fSf4Jl7ty5uv/++w0vw19VoEABDRkyRAcPHlTz5s314osvKigoSKtXrzY9LVsgVADAIleuXNHrr78uf39/rVu3TrNnz9aWLVv06KOPmp6GO+Tj46MPP/xQW7ZsUaFChdSoUSO1adNGP/zwg+lpViNUAMAikyZN0pgxYxQZGamDBw+qW7duHIeSw1SvXl0bNmzQJ598oi1btqhz586mJ1mNv/0AYJFz587pwQcf1OjRo+Xl5WV6DjKJy+VSp06d1LlzZ14g7k8QKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrESoAAMBahAoAALAWoQIAAKxFqAAAAGsRKgAAwFqECgAAsBahAgAArEWoAAAAaxEqAADAWoQKAACwFqECAACsRagAAABrGQ+VadOm6YEHHpCHh4dCQkK0fv1605MAAIAljIbK/PnzFRERoSFDhmjHjh2qW7euwsLCdPz4cZOzAACAJYyGyoQJE9SrVy8999xzqlixoiZNmiRfX19Nnz7d5CwAAGCJvKZu+Nq1a4qJidHf//73dOc3adJE33333U3fJykpSUlJSWmnExISMnUjbnTlyhXt37/fyG3v27cv3f+a4O/vL09PT2O3j5xv6tSpSkxMVO3atbP8tlNSUnT16tUsv11bFChQQG5ubll6m0ePHlWhQoWy9DazG2OhcubMGaWkpKhUqVLpzi9VqpTi4+Nv+j5RUVEaMWJEVszDH9i/f79CQkKMbnjmmWeM3XZMTIyCg4ON3T5yvhYtWmjBggXy9/fP8ts+c+aMvvjiiyy/XVu0bNlSxYsXz9Lb9Pf3V506dbL0NrMbY6HyG5fLle604zg3nPebwYMHKzIyMu10QkKCfH19M3Uf0vP391dMTIyR27569aqOHj2qcuXKqUCBAkY2mPjigdzl888/N3bbJh8xtQGPmNrJWKgUL15cbm5uNzx6cvr06RseZfmNu7u73N3ds2Ie/oCnp6fRRxQeffRRY7cN5HSm79/AzRg7mDZ//vwKCQnRypUr052/cuVKI8/NAgAA+xh96icyMlJdu3ZVaGioatWqpRkzZuj48ePq06ePyVkAAMASRkOlQ4cO+vXXXzVy5EidOnVKVapU0dKlS1W2bFmTswAAgCVcjuM4pkdkVEJCggoXLqwLFy7I29vb9BwAAHAb/srXb+MvoQ8AAPBHCBUAAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtYy+hP6d+u1FdRMSEgwvAQAAt+u3r9u38+L42TpULl68KEny9fU1vAQAAPxVFy9eVOHChW95nWz9u35SU1N18uRJFSpUSC6Xy/QcZLKEhAT5+vrqxIkT/G4nIIfh/p27OI6jixcvysfHR3ny3PoolGz9iEqePHlUpkwZ0zOQxby9vfmHDMihuH/nHn/2SMpvOJgWAABYi1ABAADWIlSQbbi7u2v48OFyd3c3PQXAXcb9G38kWx9MCwAAcjYeUQEAANYiVAAAgLUIFQAAYC1CBdna0aNH5XK5FBsba3oKAAPKlSunSZMmmZ6BTESoIMt1795dLpdLffr0ueGyv/3tb3K5XOrevXvWDwNwS7/dd//37dChQ6anIQcjVGCEr6+v5s2bp6tXr6adl5iYqLlz5+r+++83uAzArTRr1kynTp1K9/bAAw+YnoUcjFCBEcHBwbr//vu1cOHCtPMWLlwoX19fBQUFpZ23fPly1alTR0WKFFGxYsX0xBNP6PDhw7f82Hv37lXz5s3l5eWlUqVKqWvXrjpz5kymfS5AbuLu7q5777033Zubm5u++OILhYSEyMPDQw8++KBGjBih5OTktPdzuVyKjo7WE088IU9PT1WsWFGbNm3SoUOH1KBBAxUsWFC1atVKd/8+fPiwWrVqpVKlSsnLy0vVq1fXN998c8t9Fy5c0AsvvKCSJUvK29tbDRs21M6dOzPtzwOZj1CBMT169NCsWbPSTn/wwQfq2bNnuutcvnxZkZGR2rp1q1atWqU8efKoTZs2Sk1NvenHPHXqlOrXr6/AwEBt27ZNy5cv188//6ynn346Uz8XIDdbsWKFnnnmGb388svau3evoqOjNXv2bI0ePTrd9d588009++yzio2Nlb+/vzp37qzevXtr8ODB2rZtmyTppZdeSrv+pUuX1Lx5c33zzTfasWOHmjZtqpYtW+r48eM33eE4jlq0aKH4+HgtXbpUMTExCg4OVqNGjXT27NnM+wNA5nKALNatWzenVatWzi+//OK4u7s7R44ccY4ePep4eHg4v/zyi9OqVSunW7duN33f06dPO5Kc3bt3O47jOEeOHHEkOTt27HAcx3Fef/11p0mTJune58SJE44k58CBA5n5aQE5Xrdu3Rw3NzenYMGCaW/t2rVz6tat6/zjH/9Id92PPvrIKV26dNppSc7QoUPTTm/atMmR5MycOTPtvLlz5zoeHh633FCpUiVnypQpaafLli3rTJw40XEcx1m1apXj7e3tJCYmpnufhx56yImOjv7Lny/skK1/ezKyt+LFi6tFixb68MMP074TKl68eLrrHD58WK+//ro2b96sM2fOpD2Scvz4cVWpUuWGjxkTE6Nvv/1WXl5eN1x2+PBh+fn5Zc4nA+QSjz32mKZPn552umDBgipfvry2bt2a7hGUlJQUJSYm6sqVK/L09JQkVa1aNe3yUqVKSZICAgLSnZeYmKiEhAR5e3vr8uXLGjFihL788kudPHlSycnJunr16h8+ohITE6NLly6pWLFi6c6/evXqnz5lDHsRKjCqZ8+eaQ/1vvPOOzdc3rJlS/n6+uq9996Tj4+PUlNTVaVKFV27du2mHy81NVUtW7bU2LFjb7isdOnSd3c8kAv9Fia/l5qaqhEjRuipp5664foeHh5p/50vX760/3a5XH943m/fkLz66qtasWKFxo8fr/Lly6tAgQJq167dLe//pUuX1po1a264rEiRIrf3CcI6hAqMatasWdo/Ok2bNk132a+//qp9+/YpOjpadevWlSRt2LDhlh8vODhYCxYsULly5ZQ3L3+9gawQHBysAwcO3BAwd2r9+vXq3r272rRpI+k/x6wcPXr0ljvi4+OVN29elStX7q5ugTkcTAuj3NzctG/fPu3bt09ubm7pLitatKiKFSumGTNm6NChQ1q9erUiIyNv+fHCw8N19uxZderUSVu2bNEPP/ygr7/+Wj179lRKSkpmfipArjVs2DDNmTNHb7zxhvbs2aN9+/Zp/vz5Gjp06B193PLly2vhwoWKjY3Vzp071blz5z88kF6SHn/8cdWqVUutW7fWihUrdPToUX333XcaOnRo2sG6yH4IFRjn7e0tb2/vG87PkyeP5s2bp5iYGFWpUkX9+/fXP//5z1t+LB8fH23cuFEpKSlq2rSpqlSpon79+qlw4cLKk4e/7kBmaNq0qb788kutXLlS1atXV82aNTVhwgSVLVv2jj7uxIkTVbRoUdWuXVstW7ZU06ZNFRwc/IfXd7lcWrp0qerVq6eePXvKz89PHTt21NGjR9OOiUH243IcxzE9AgAA4Gb4FhMAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtQgVAABgLUIFQLbUvXt3tW7d2vQMAJmMUAEAANYiVADkSo7jKDk52fQMAH+CUAFwRy5evKguXbqoYMGCKl26tCZOnKgGDRooIiJCknTt2jUNHDhQ9913nwoWLKgaNWpozZo1ae8/e/ZsFSlSRCtWrFDFihXl5eWlZs2a6dSpU2nXSUlJUWRkpIoUKaJixYpp4MCB+t9fU+Y4jsaNG6cHH3xQBQoUULVq1fT555+nXb5mzRq5XC6tWLFCoaGhcnd31/r16zP1zwbAnSNUANyRyMhIbdy4UUuWLNHKlSu1fv16bd++Pe3yHj16aOPGjZo3b5527dql9u3bq1mzZoqLi0u7zpUrVzR+/Hh99NFHWrdunY4fP64BAwakXf7WW2/pgw8+0MyZM7VhwwadPXtWixYtSrdj6NChmjVrlqZPn649e/aof//+euaZZ7R27dp01xs4cKCioqK0b98+Va1aNZP+VADcNQ4AZFBCQoKTL18+57PPPks77/z5846np6fTr18/59ChQ47L5XJ++umndO/XqFEjZ/DgwY7jOM6sWbMcSc6hQ4fSLn/nnXecUqVKpZ0uXbq0M2bMmLTT169fd8qUKeO0atXKcRzHuXTpkuPh4eF899136W6nV69eTqdOnRzHcZxvv/3WkeQsXrz47nzyALJEXtOhBCD7+uGHH3T9+nU98sgjaecVLlxYDz/8sCRp+/btchxHfn5+6d4vKSlJxYoVSzvt6emphx56KO106dKldfr0aUnShQsXdOrUKdWqVSvt8rx58yo0NDTt6Z+9e/cqMTFRjRs3Tnc7165dU1BQULrzQkND7+RTBpDFCBUAGfZbKLhcrpuen5qaKjc3N8XExMjNzS3ddby8vNL+O1++fOkuc7lcNxyDciupqamSpK+++kr33Xdfusvc3d3TnS5YsOBtf1wA5hEqADLsoYceUr58+bRlyxb5+vpKkhISEhQXF6f69esrKChIKSkpOn36tOrWrZuh2yhcuLBKly6tzZs3q169epKk5ORkxcTEKDg4WJJUqVIlubu76/jx46pfv/7d+eQAWIFQAZBhhQoVUrdu3fTqq6/qnnvuUcmSJTV8+HDlyZNHLpdLfn5+6tKli5599lm99dZbCgoK0pkzZ7R69WoFBASoefPmt3U7/fr105gxY1ShQgVVrFhREyZM0Pnz59PtGDBggPr376/U1FTVqVNHCQkJ+u677+Tl5aVu3bpl0p8AgMxGqAC4IxMmTFCfPn30xBNPyNvbWwMHDtSJEyfk4eEhSZo1a5ZGjRqlV155RT/99JOKFSumWrVq3XakSNIrr7yiU6dOqXv37sqTJ4969uypNm3a6MKFC2nXefPNN1WyZElFRUXphx9+UJEiRRQcHKzXXnvtrn/OALKOy/krTwQDwJ+4fPmy7rvvPr311lvq1auX6TkAsjkeUQFwR3bs2KH9+/frkUce0YULFzRy5EhJUqtWrQwvA5ATECoA7tj48eN14MAB5c+fXyEhIVq/fr2KFy9uehaAHICnfgAAgLV4CX0AAGAtQgUAAFiLUAEAANYiVAAAgLUIFQAAYC1CBQAAWItQAQAA1iJUAACAtf4PlKAnkKFbs8IAAAAASUVORK5CYII=", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "males = nutri[nutri.gender == 'Male']\n", "females = nutri[nutri.gender == 'Female']\n", "plt.boxplot([males.coffee, females.coffee], notch =True , widths=(0.5 ,0.5))\n", "plt.xlabel ('gender')\n", "plt.ylabel ('coffee')\n", "plt.xticks ([1 ,2] ,['Male','Female'])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can also show the distribution of the data using a histogram, after breaking the data into *bins* or *classes*" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(nutri.age,bins=9,facecolor='green',edgecolor='black',linewidth=1)\n", "plt.xlabel('Age')\n", "plt.ylabel('Quantity')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "alternatively, we can weigth the values to show $\\frac{counts}{total}$ by using the trick of multiplying all values times the quantity $1/266$." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "weigths = np.ones_like(nutri.age)/nutri.age.count()\n", "plt.hist(nutri.age,bins=9,weights=weigths,facecolor='green',edgecolor='black',linewidth=1)\n", "plt.xlabel('Age')\n", "plt.ylabel('Proportion of total')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *empirical cumulative distribution function*, $F_n$, is a step function that jupms $k/n$ at observation values, where $k$ is the fraction of tied observations at that value:\n", "$$F_n(x)=\\frac{\\mathrm{number \\; of \\; }x_i \\leq x}{n}$$\n", "The function can be defined and plotted in the same way for both discrete and continuous data." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = np.sort(nutri.age)\n", "y = np.linspace(0,1,len(nutri.age))\n", "plt.xlabel('Age')\n", "plt.ylabel('$F_n(x)$')\n", "plt.step(x,y)\n", "plt.xlim(x.min(),x.max())\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let us have a look at a plot that takes into account the contingency table shown above. We will make use now of the `seaborn` package, which simplifies the plotting of statistical information.\n", "\n", "TODO: add examples with sns.histplot()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "sns.countplot(x='situation',hue='gender',data=nutri,hue_order=['Male','Female'],palette=['green','red'],saturation=1,edgecolor='black')\n", "plt.legend(loc='upper right')\n", "plt.xlabel('Situation')\n", "plt.ylabel('Counts')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Scatter plots* are useful to visualize patterns between quantitative features." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(nutri.height,nutri.weight,s=12,marker='o')\n", "plt.xlabel('height')\n", "plt.ylabel('weight')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get more sophisticated plots and fit some lines to the data to visualy compare data. In this example, we will use data from Vincent Arel Bundock repo on birth weights of babies for smoking and non smoking mothers, and plot the data againts the age of the mother. We will deal with significance of the results in upcoming sessions." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lowagelwtrace...htuiftvbwt
00191822...0102523
10331553...0032551
20201051...0012557
30211081...0122594
40181071...0102600
..............................
184128951...0022466
1851141003...0022495
186123943...0002495
1871171422...1002495
1881211301...1032495
\n", "

189 rows × 10 columns

\n", "
" ], "text/plain": [ " low age lwt race ... ht ui ftv bwt\n", "0 0 19 182 2 ... 0 1 0 2523\n", "1 0 33 155 3 ... 0 0 3 2551\n", "2 0 20 105 1 ... 0 0 1 2557\n", "3 0 21 108 1 ... 0 1 2 2594\n", "4 0 18 107 1 ... 0 1 0 2600\n", ".. ... ... ... ... ... .. .. ... ...\n", "184 1 28 95 1 ... 0 0 2 2466\n", "185 1 14 100 3 ... 0 0 2 2495\n", "186 1 23 94 3 ... 0 0 0 2495\n", "187 1 17 142 2 ... 1 0 0 2495\n", "188 1 21 130 1 ... 1 0 3 2495\n", "\n", "[189 rows x 10 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "urlprefix = 'http://vincentarelbundock.github.io/Rdatasets/csv/'\n", "dataname = 'MASS/birthwt.csv'\n", "bwt = pd.read_csv(urlprefix + dataname)\n", "bwt = bwt.drop('rownames' ,axis=1) #drop unnamed column\n", "bwt\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "styles = {0: ['o','red'], 1: ['^','blue']}\n", "for k in styles:\n", " grp = bwt[bwt.smoke == k]\n", " m,b = np.polyfit(grp.age , grp.bwt , 1) # fit a straight line\n", " plt.scatter(grp.age , grp.bwt , c= styles[k][1] , s=15 , linewidth =0,\n", " marker = styles[k][0])\n", " plt.plot(grp.age , m*grp.age + b, '-', color = styles[k][1])\n", "\n", "plt.xlabel('age')\n", "plt.ylabel('birth weight (g)')\n", "plt.legend(['non - smokers','smokers'],prop={'size':8},loc=(0.7 ,0.3))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{bibliography}\n", ":style: unsrt\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }