{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# [UNIT 1. Import and Visualitzation of data with Python](#toc0_)\n",
"\n",
"This Unit includes some fast shortcuts to representing data in Python, closely following {cite:p}`kroese2020` and the corresponding [GitHub repo](https://github.com/DSML-book/).\n",
"\n",
"NOTE: being this a training tool, we assume the user will have installed all the needed requirements. In case some package is missing, use a terminal to install it apart using your favorite package manager.\n",
"\n",
"These notebooks have been tested in Linux Ubuntu using Anaconda as the package manager in most cases. The notebooks are self-containing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of contents** \n",
"- [UNIT 1. Import and Visualitzation of data with Python](#toc1_) \n",
" - [Retrieving data](#toc1_1_) \n",
" - [Reformating tables](#toc1_2_) \n",
" - [Structuring features](#toc1_3_) \n",
" - [Summary tables and statistics](#toc1_4_) \n",
" - [Data visualization](#toc1_5_) \n",
" - [Qualitative variables](#toc1_5_1_) \n",
" - [Quantitative variables](#toc1_5_2_) \n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Retrieving data](#toc0_)\n",
"\n",
"Typically the files containing data to be analyzed are stored in comma separated value (CSV) format. To work with them, the first thing to do is downloading the data. But before we ensure we have a proper folder to download the file."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"folder datasets exists\n",
"folder output exists\n",
"El fitxer ‘datasets/wine.zip’ ja existeix, no es baixa.\n",
"\n"
]
}
],
"source": [
"import os\n",
"\n",
"input_directory = \"datasets\"\n",
"output_directory = 'output'\n",
"\n",
"# Check if the directories exist\n",
"if not os.path.exists(input_directory):\n",
" # If it doesn't exist, create it\n",
" os.makedirs(input_directory)\n",
"else:\n",
" print('folder ',input_directory,' exists')\n",
"\n",
"if not os.path.exists(output_directory):\n",
" # If it doesn't exist, create it\n",
" os.makedirs(output_directory)\n",
"else:\n",
" print('folder ',output_directory,' exists')\n",
"\n",
"# now use wget to download the file into the datasets folder\n",
"!wget -P $input_directory -nc https://archive.ics.uci.edu/static/public/109/wine.zip "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we read the content of the zip file using the `pandas` module. Information of the data can be found in the [ML repository at UCI](https://archive.ics.uci.edu/dataset/109/wine)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
0
\n",
"
1
\n",
"
2
\n",
"
3
\n",
"
4
\n",
"
5
\n",
"
6
\n",
"
7
\n",
"
8
\n",
"
9
\n",
"
10
\n",
"
11
\n",
"
12
\n",
"
13
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
14.23
\n",
"
1.71
\n",
"
2.43
\n",
"
15.6
\n",
"
127
\n",
"
2.80
\n",
"
3.06
\n",
"
0.28
\n",
"
2.29
\n",
"
5.64
\n",
"
1.04
\n",
"
3.92
\n",
"
1065
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
13.20
\n",
"
1.78
\n",
"
2.14
\n",
"
11.2
\n",
"
100
\n",
"
2.65
\n",
"
2.76
\n",
"
0.26
\n",
"
1.28
\n",
"
4.38
\n",
"
1.05
\n",
"
3.40
\n",
"
1050
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
13.16
\n",
"
2.36
\n",
"
2.67
\n",
"
18.6
\n",
"
101
\n",
"
2.80
\n",
"
3.24
\n",
"
0.30
\n",
"
2.81
\n",
"
5.68
\n",
"
1.03
\n",
"
3.17
\n",
"
1185
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
14.37
\n",
"
1.95
\n",
"
2.50
\n",
"
16.8
\n",
"
113
\n",
"
3.85
\n",
"
3.49
\n",
"
0.24
\n",
"
2.18
\n",
"
7.80
\n",
"
0.86
\n",
"
3.45
\n",
"
1480
\n",
"
\n",
"
\n",
"
4
\n",
"
1
\n",
"
13.24
\n",
"
2.59
\n",
"
2.87
\n",
"
21.0
\n",
"
118
\n",
"
2.80
\n",
"
2.69
\n",
"
0.39
\n",
"
1.82
\n",
"
4.32
\n",
"
1.04
\n",
"
2.93
\n",
"
735
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 \\\n",
"0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 \n",
"1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 \n",
"2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 \n",
"3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 \n",
"4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 \n",
"\n",
" 13 \n",
"0 1065 \n",
"1 1050 \n",
"2 1185 \n",
"3 1480 \n",
"4 735 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from zipfile import ZipFile\n",
"\n",
"with ZipFile('datasets/wine.zip', 'r') as f:\n",
"\n",
"#extract in current directory\n",
" f.extractall(input_directory, members =['wine.names',\"wine.data\"])\n",
"\n",
"wine = pd.read_csv('datasets/wine.data',header=None)\n",
"wine.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"as we do not have the names of the columns, we can manually assign them based on the information contained in the web link."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
class
\n",
"
Alcohol
\n",
"
Maliacid
\n",
"
Ash
\n",
"
Alcalinity_of_ash
\n",
"
Magnesium
\n",
"
Total_phenols
\n",
"
Flavanoids
\n",
"
Nonflavonoid_phenols
\n",
"
Proanthocyanins
\n",
"
Color_intensity
\n",
"
Hue
\n",
"
0D280_0D315_of_diluted_wines
\n",
"
Proline
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
14.23
\n",
"
1.71
\n",
"
2.43
\n",
"
15.6
\n",
"
127
\n",
"
2.80
\n",
"
3.06
\n",
"
0.28
\n",
"
2.29
\n",
"
5.64
\n",
"
1.04
\n",
"
3.92
\n",
"
1065
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
13.20
\n",
"
1.78
\n",
"
2.14
\n",
"
11.2
\n",
"
100
\n",
"
2.65
\n",
"
2.76
\n",
"
0.26
\n",
"
1.28
\n",
"
4.38
\n",
"
1.05
\n",
"
3.40
\n",
"
1050
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
13.16
\n",
"
2.36
\n",
"
2.67
\n",
"
18.6
\n",
"
101
\n",
"
2.80
\n",
"
3.24
\n",
"
0.30
\n",
"
2.81
\n",
"
5.68
\n",
"
1.03
\n",
"
3.17
\n",
"
1185
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
14.37
\n",
"
1.95
\n",
"
2.50
\n",
"
16.8
\n",
"
113
\n",
"
3.85
\n",
"
3.49
\n",
"
0.24
\n",
"
2.18
\n",
"
7.80
\n",
"
0.86
\n",
"
3.45
\n",
"
1480
\n",
"
\n",
"
\n",
"
4
\n",
"
1
\n",
"
13.24
\n",
"
2.59
\n",
"
2.87
\n",
"
21.0
\n",
"
118
\n",
"
2.80
\n",
"
2.69
\n",
"
0.39
\n",
"
1.82
\n",
"
4.32
\n",
"
1.04
\n",
"
2.93
\n",
"
735
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" class Alcohol Maliacid Ash Alcalinity_of_ash Magnesium \\\n",
"0 1 14.23 1.71 2.43 15.6 127 \n",
"1 1 13.20 1.78 2.14 11.2 100 \n",
"2 1 13.16 2.36 2.67 18.6 101 \n",
"3 1 14.37 1.95 2.50 16.8 113 \n",
"4 1 13.24 2.59 2.87 21.0 118 \n",
"\n",
" Total_phenols Flavanoids Nonflavonoid_phenols Proanthocyanins \\\n",
"0 2.80 3.06 0.28 2.29 \n",
"1 2.65 2.76 0.26 1.28 \n",
"2 2.80 3.24 0.30 2.81 \n",
"3 3.85 3.49 0.24 2.18 \n",
"4 2.80 2.69 0.39 1.82 \n",
"\n",
" Color_intensity Hue 0D280_0D315_of_diluted_wines Proline \n",
"0 5.64 1.04 3.92 1065 \n",
"1 4.38 1.05 3.40 1050 \n",
"2 5.68 1.03 3.17 1185 \n",
"3 7.80 0.86 3.45 1480 \n",
"4 4.32 1.04 2.93 735 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wine.columns = ['class','Alcohol','Maliacid','Ash','Alcalinity_of_ash','Magnesium','Total_phenols','Flavanoids','Nonflavonoid_phenols','Proanthocyanins','Color_intensity','Hue','0D280_0D315_of_diluted_wines','Proline']\n",
"wine.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can directly read the CSV from its URL, without previously downloading it. We will use [Fisher's `iris` data](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html). Here we already have the name of the columns/features so we read them directly from the dataset.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa\n",
".. ... ... ... ... ...\n",
"145 6.7 3.0 5.2 2.3 virginica\n",
"146 6.3 2.5 5.0 1.9 virginica\n",
"147 6.5 3.0 5.2 2.0 virginica\n",
"148 6.2 3.4 5.4 2.3 virginica\n",
"149 5.9 3.0 5.1 1.8 virginica\n",
"\n",
"[150 rows x 5 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.drop('rownames', axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Reformating tables](#toc0_)\n",
"\n",
"Some times tha data comes in odd formats, not useful for analysis.\n",
"\n",
"For example, consider the table with scores given here, with values obtained before and after some particular training:\n",
"\n",
"\n",
"\n",
"We want to reformat it in such a way that the score data is in a single column. Let us create a table that uses three features: *Score* (continuous data), *Time* (before or after) and *Student* (integer values from 1 to 5). We will use the `pandas.melt` method, [speciffically devoted to this goal](https://pandas.pydata.org/docs/reference/api/pandas.melt.html).\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Student Time value\n",
"0 1 Before 75\n",
"1 2 Before 30\n",
"2 3 Before 100\n",
"3 4 Before 50\n",
"4 5 Before 60\n",
"5 1 After 85\n",
"6 2 After 50\n",
"7 3 After 100\n",
"8 4 After 52\n",
"9 5 After 65\n"
]
}
],
"source": [
"# manually create dataframe with data from table\n",
"values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]\n",
"import pandas as pd\n",
"df = pd. DataFrame (values , columns =['Student ','Before ', 'After '])\n",
"# format dataframe as required\n",
"df = pd.melt(df , id_vars =['Student '], var_name =\"Time\", value_vars =['Before ','After '])\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Structuring features](#toc0_)\n",
"\n",
"Types of features:\n",
"* Quantitive (with possible discrete or continuous values)\n",
"* Qualitative; can be eventually divided into a fixed number of categories: that is why they are referred as **categorical** or **factors**.\n",
"\n",
"Here we will work with data from {cite:p}`lafaye_de_micheaux_r_2013`. In particular, we will download the file containing nutritional measurements of thirteen features (columns) for 226 elderly individuals (rows). Note the [file](http://www.biostatisticien.eu/springeR/nutrition_elderly.xls) is now an excel file."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
gender
\n",
"
situation
\n",
"
tea
\n",
"
coffee
\n",
"
...
\n",
"
raw_fruit
\n",
"
cooked_fruit_veg
\n",
"
chocol
\n",
"
fat
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
1
\n",
"
4
\n",
"
5
\n",
"
6
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
...
\n",
"
5
\n",
"
5
\n",
"
1
\n",
"
4
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
1
\n",
"
0
\n",
"
4
\n",
"
...
\n",
"
5
\n",
"
2
\n",
"
5
\n",
"
4
\n",
"
\n",
"
\n",
"
3
\n",
"
2
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
4
\n",
"
0
\n",
"
3
\n",
"
2
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
1
\n",
"
2
\n",
"
1
\n",
"
...
\n",
"
5
\n",
"
5
\n",
"
3
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 13 columns
\n",
"
"
],
"text/plain": [
" gender situation tea coffee ... raw_fruit cooked_fruit_veg chocol \\\n",
"0 2 1 0 0 ... 1 4 5 \n",
"1 2 1 1 1 ... 5 5 1 \n",
"2 2 1 0 4 ... 5 2 5 \n",
"3 2 1 0 0 ... 4 0 3 \n",
"4 2 1 2 1 ... 5 5 3 \n",
"\n",
" fat \n",
"0 6 \n",
"1 4 \n",
"2 4 \n",
"3 2 \n",
"4 2 \n",
"\n",
"[5 rows x 13 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls'\n",
"nutri = pd.read_excel(xls)\n",
"\n",
"pd.set_option('display.max_columns', 8)\n",
"nutri.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"to check the structure of the data we can use the `info` attribute of the `pandas` dataframe, which matches the description that is given in the original source\n",
"\n",
"\n",
"\n",
"Note the features can be classified as\n",
"\n",
"* qualitative\n",
" * ordinal (meat, fish, raw_fruit, cooked_fruit_veg,chocol)\n",
" * nominal (gender, situation), fat\n",
"* quantitative\n",
" * discrete (tea, coffee)\n",
" * continuous (height, weight, age)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 226 entries, 0 to 225\n",
"Data columns (total 13 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 gender 226 non-null int64\n",
" 1 situation 226 non-null int64\n",
" 2 tea 226 non-null int64\n",
" 3 coffee 226 non-null int64\n",
" 4 height 226 non-null int64\n",
" 5 weight 226 non-null int64\n",
" 6 age 226 non-null int64\n",
" 7 meat 226 non-null int64\n",
" 8 fish 226 non-null int64\n",
" 9 raw_fruit 226 non-null int64\n",
" 10 cooked_fruit_veg 226 non-null int64\n",
" 11 chocol 226 non-null int64\n",
" 12 fat 226 non-null int64\n",
"dtypes: int64(13)\n",
"memory usage: 23.1 KB\n"
]
}
],
"source": [
"nutri.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"let us change now the data value and type of different features by means of the `replace` and the `astype` methods, and finally save the data in CSV format."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
gender
\n",
"
situation
\n",
"
tea
\n",
"
coffee
\n",
"
...
\n",
"
raw_fruit
\n",
"
cooked_fruit_veg
\n",
"
chocol
\n",
"
fat
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Female
\n",
"
single
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
less than once a week
\n",
"
4-6 times a week
\n",
"
every day
\n",
"
Mix of vegetable oils
\n",
"
\n",
"
\n",
"
1
\n",
"
Female
\n",
"
single
\n",
"
1
\n",
"
1
\n",
"
...
\n",
"
every day
\n",
"
every day
\n",
"
less than once a week
\n",
"
Sunflower oil
\n",
"
\n",
"
\n",
"
2
\n",
"
Female
\n",
"
single
\n",
"
0
\n",
"
4
\n",
"
...
\n",
"
every day
\n",
"
once a week
\n",
"
every day
\n",
"
Sunflower oil
\n",
"
\n",
"
\n",
"
3
\n",
"
Female
\n",
"
single
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
4-6 times a week
\n",
"
never
\n",
"
2-3 times a week
\n",
"
Margarine
\n",
"
\n",
"
\n",
"
4
\n",
"
Female
\n",
"
single
\n",
"
2
\n",
"
1
\n",
"
...
\n",
"
every day
\n",
"
every day
\n",
"
2-3 times a week
\n",
"
Margarine
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 13 columns
\n",
"
"
],
"text/plain": [
" gender situation tea coffee ... raw_fruit \\\n",
"0 Female single 0 0 ... less than once a week \n",
"1 Female single 1 1 ... every day \n",
"2 Female single 0 4 ... every day \n",
"3 Female single 0 0 ... 4-6 times a week \n",
"4 Female single 2 1 ... every day \n",
"\n",
" cooked_fruit_veg chocol fat \n",
"0 4-6 times a week every day Mix of vegetable oils \n",
"1 every day less than once a week Sunflower oil \n",
"2 once a week every day Sunflower oil \n",
"3 never 2-3 times a week Margarine \n",
"4 every day 2-3 times a week Margarine \n",
"\n",
"[5 rows x 13 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# gender, situation, meat, fish, raw_fruit, cooked_fruit_veg chocol and fat feature are categorical\n",
"DICT = {1:'Male',2:'Female'}\n",
"nutri['gender'] = nutri['gender'].replace(DICT).astype('category')\n",
"\n",
"DICT = {1:'single',2:'couple',3:'family',4:'other'}\n",
"nutri['situation'] = nutri['situation'].replace(DICT).astype('category')\n",
"\n",
"DICT = {0:'never',1:'less than once a week',2:'once a week',3:'2-3 times a week',4:'4-6 times a week', 5:'every day'}\n",
"nutri['meat'] = nutri['meat'].replace(DICT).astype('category')\n",
"nutri['fish'] = nutri['fish'].replace(DICT).astype('category')\n",
"nutri['raw_fruit'] = nutri['raw_fruit'].replace(DICT).astype('category')\n",
"nutri['cooked_fruit_veg'] = nutri['cooked_fruit_veg'].replace(DICT).astype('category')\n",
"nutri['chocol'] = nutri['chocol'].replace(DICT).astype('category')\n",
"\n",
"DICT = {1:'Butter',2:'Margarine',3:'Peanut oil', 4:'Sunflower oil', 5:'Olive oil', 6:'Mix of vegetable oils', 7:'Colza oil',8:'Duck or goose fat'}\n",
"nutri['fat'] = nutri['fat'].replace(DICT).astype('category')\n",
"\n",
"# tea and coffee are integer\n",
"nutri['tea'] = nutri['tea'].astype(int)\n",
"nutri['coffee'] = nutri['coffee'].astype(int)\n",
"\n",
"# height, weigth, age are float\n",
"nutri['height'] = nutri['height'].astype(float)\n",
"nutri['weight'] = nutri['weight'].astype(float)\n",
"nutri['age'] = nutri['age'].astype(float)\n",
"\n",
"nutri.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 226 entries, 0 to 225\n",
"Data columns (total 13 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 gender 226 non-null category\n",
" 1 situation 226 non-null category\n",
" 2 tea 226 non-null int64 \n",
" 3 coffee 226 non-null int64 \n",
" 4 height 226 non-null float64 \n",
" 5 weight 226 non-null float64 \n",
" 6 age 226 non-null float64 \n",
" 7 meat 226 non-null category\n",
" 8 fish 226 non-null category\n",
" 9 raw_fruit 226 non-null category\n",
" 10 cooked_fruit_veg 226 non-null category\n",
" 11 chocol 226 non-null category\n",
" 12 fat 226 non-null category\n",
"dtypes: category(8), float64(3), int64(2)\n",
"memory usage: 12.4 KB\n"
]
}
],
"source": [
"\n",
"nutri.info()\n",
"\n",
"nutri.to_csv('output/nutri.csv',index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Summary tables and statistics](#toc0_)\n",
"\n",
"It is extremely important to know your data before any analysis. Apart from the descriptive tools shown above, descriptive statistical measurements can be obtained from pandas with two simple methods: `describe` and `value_counts`. Interesting to note that the outcome of both `describe` and `value_counts` are pandas *series*, one-dimensional ndarrays with axis labels."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 226\n",
"unique 8\n",
"top Sunflower oil\n",
"freq 68\n",
"Name: fat, dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nutri = pd.read_csv('output/nutri.csv')\n",
"nutri['fat'].describe()\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fat\n",
"Sunflower oil 68\n",
"Peanut oil 48\n",
"Olive oil 40\n",
"Margarine 27\n",
"Mix of vegetable oils 23\n",
"Butter 15\n",
"Duck or goose fat 4\n",
"Colza oil 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nutri['fat'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also generate a contingey table by *cross tabulating* between two or more variables:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
situation
\n",
"
couple
\n",
"
family
\n",
"
single
\n",
"
All
\n",
"
\n",
"
\n",
"
gender
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Female
\n",
"
56
\n",
"
7
\n",
"
78
\n",
"
141
\n",
"
\n",
"
\n",
"
Male
\n",
"
63
\n",
"
2
\n",
"
20
\n",
"
85
\n",
"
\n",
"
\n",
"
All
\n",
"
119
\n",
"
9
\n",
"
98
\n",
"
226
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"situation couple family single All\n",
"gender \n",
"Female 56 7 78 141\n",
"Male 63 2 20 85\n",
"All 119 9 98 226"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(nutri.gender,nutri.situation,margins=True) # the margins attribute adds the rows/columns totals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can now turn to descriptive statistics like the *sample mean*, $\\bar{x}$:\n",
"$$\\bar{x}=\\frac{1}{n}\\sum_{i=1}^n x_i$$\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"163.96017699115043"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nutri.height.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and the *p-sample quantile* of $\\mathbf{x}$, being $0
[Data visualization](#toc0_)\n",
"\n",
"Depending on the data type, data visualization will differ. In order to use the visualization power of Python , first we will import the `matplotlib.pyplot` module, as well as `numpy`, in addition to the already imported `pandas` module."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [Qualitative variables](#toc0_)\n",
"\n",
"here we can show the data in a simple bar plot, taking itno account that the category (x-axis) is not per se a numerical value, so we should manually place the different categories in the plot. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAigAAAGdCAYAAAA44ojeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAjCklEQVR4nO3df1BVZeLH8c9V7AoKKJD3eidQLFJbzBLNCW3BVMzSdN3S1DVcs2w1E7XFGNT42ggjrUYjm7u6jdIPs2033aYfJqlp6maIuuWPdDUKMokwFlQQSM73j8a7ewVT9CIP+H7NnJk95zz3nOe0x3x37oVrsyzLEgAAgEFaNPYEAAAAzkegAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADCOT2NP4HLU1NTo22+/lb+/v2w2W2NPBwAAXALLsnTy5Em5XC61aPHzz0iaZKB8++23Cg0NbexpAACAy1BQUKAbbrjhZ8c0yUDx9/eX9NMFBgQENPJsAADApSgrK1NoaKj77/Gf0yQD5dzbOgEBAQQKAABNzKV8PIMPyQIAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwTr0DZevWrRo+fLhcLpdsNpvWrVvn3lddXa05c+aoR48eatOmjVwulx5++GF9++23HseorKzU9OnTFRISojZt2uj+++/XN998c8UXAwAAmod6B8rp06fVs2dPZWZm1tpXXl6u3bt3a968edq9e7feeustHT58WPfff7/HuISEBK1du1Zr1qzRtm3bdOrUKQ0bNkxnz569/CsBAADNhs2yLOuyX2yzae3atRo5cuQFx+Tk5OiOO+7Q119/rbCwMJWWlur666/XK6+8ojFjxkj673frvPfeexoyZMhFz1tWVqbAwECVlpbym2QBAGgi6vP3d4N/BqW0tFQ2m03t2rWTJOXm5qq6ulpxcXHuMS6XS5GRkdqxY0edx6isrFRZWZnHAgAAmq8GDZQzZ87o6aef1rhx49ylVFhYqOuuu07t27f3GOtwOFRYWFjncdLS0hQYGOhe+CZjAACatwYLlOrqaj300EOqqanRiy++eNHxlmVd8MuDkpKSVFpa6l4KCgq8PV0AAGCQBvk24+rqao0ePVp5eXnatGmTx/tMTqdTVVVVKikp8XiKUlRUpOjo6DqPZ7fbZbfbG2KqdcrPz1dxcfFVO19zFBISorCwsMaeBgCgifJ6oJyLk3//+9/avHmzgoODPfZHRUWpVatWys7O1ujRoyVJx48f1759+5Senu7t6dRbfn6+unbrrjMV5Y09lSatta+fDn1xkEgBAFyWegfKqVOndOTIEfd6Xl6e9u7dq6CgILlcLj3wwAPavXu33nnnHZ09e9b9uZKgoCBdd911CgwM1COPPKLZs2crODhYQUFBeuqpp9SjRw8NGjTIe1d2mYqLi3WmolzBw2arVTCfdbkc1ScKdOKdxSouLiZQAACXpd6BsmvXLg0YMMC9PmvWLElSfHy8UlJS9Pbbb0uSbrvtNo/Xbd68WbGxsZKk559/Xj4+Pho9erQqKio0cOBArVq1Si1btrzMy/C+VsGhsjtvauxpAABwTap3oMTGxurnfnXKpfxaldatW2vp0qVaunRpfU8PAACuAXwXDwAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDj1DpStW7dq+PDhcrlcstlsWrduncd+y7KUkpIil8slX19fxcbGav/+/R5jKisrNX36dIWEhKhNmza6//779c0331zRhQAAgOaj3oFy+vRp9ezZU5mZmXXuT09P15IlS5SZmamcnBw5nU4NHjxYJ0+edI9JSEjQ2rVrtWbNGm3btk2nTp3SsGHDdPbs2cu/EgAA0Gz41PcFQ4cO1dChQ+vcZ1mWMjIylJycrFGjRkmSsrKy5HA4tHr1ak2ZMkWlpaV66aWX9Morr2jQoEGSpFdffVWhoaH68MMPNWTIkCu4HAAA0Bx49TMoeXl5KiwsVFxcnHub3W5XTEyMduzYIUnKzc1VdXW1xxiXy6XIyEj3mPNVVlaqrKzMYwEAAM2XVwOlsLBQkuRwODy2OxwO977CwkJdd911at++/QXHnC8tLU2BgYHuJTQ01JvTBgAAhmmQn+Kx2Wwe65Zl1dp2vp8bk5SUpNLSUvdSUFDgtbkCAADzeDVQnE6nJNV6ElJUVOR+quJ0OlVVVaWSkpILjjmf3W5XQECAxwIAAJovrwZKeHi4nE6nsrOz3duqqqq0ZcsWRUdHS5KioqLUqlUrjzHHjx/Xvn373GMAAMC1rd4/xXPq1CkdOXLEvZ6Xl6e9e/cqKChIYWFhSkhIUGpqqiIiIhQREaHU1FT5+flp3LhxkqTAwEA98sgjmj17toKDgxUUFKSnnnpKPXr0cP9UDwAAuLbVO1B27dqlAQMGuNdnzZolSYqPj9eqVauUmJioiooKTZ06VSUlJerbt682bNggf39/92uef/55+fj4aPTo0aqoqNDAgQO1atUqtWzZ0guXBOBi8vPzVVxc3NjTaNJCQkIUFhbW2NMAmi2bZVlWY0+ivsrKyhQYGKjS0lKvfx5l9+7dioqKkjM+Q3bnTV499rWisvCICrMSlJubq169ejX2dHCe/Px8de3WXWcqyht7Kk1aa18/HfriIJEC1EN9/v6u9xMUAE1bcXGxzlSUK3jYbLUK5kf2L0f1iQKdeGexiouLCRSggRAowDWqVXAoTwkBGItvMwYAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxvB4oP/74o+bOnavw8HD5+vqqS5cuWrBggWpqatxjLMtSSkqKXC6XfH19FRsbq/3793t7KgAAoInyeqAsWrRIf/rTn5SZmamDBw8qPT1dzz33nJYuXeoek56eriVLligzM1M5OTlyOp0aPHiwTp486e3pAACAJsjrgfLPf/5TI0aM0H333afOnTvrgQceUFxcnHbt2iXpp6cnGRkZSk5O1qhRoxQZGamsrCyVl5dr9erV3p4OAABogrweKP3799fGjRt1+PBhSdK//vUvbdu2Tffee68kKS8vT4WFhYqLi3O/xm63KyYmRjt27PD2dAAAQBPk4+0DzpkzR6WlperWrZtatmyps2fPauHChRo7dqwkqbCwUJLkcDg8XudwOPT111/XeczKykpVVla618vKyrw9bQAAYBCvP0F544039Oqrr2r16tXavXu3srKy9Ic//EFZWVke42w2m8e6ZVm1tp2TlpamwMBA9xIaGurtaQMAAIN4PVB+//vf6+mnn9ZDDz2kHj16aMKECZo5c6bS0tIkSU6nU9J/n6ScU1RUVOupyjlJSUkqLS11LwUFBd6eNgAAMIjXA6W8vFwtWngetmXLlu4fMw4PD5fT6VR2drZ7f1VVlbZs2aLo6Og6j2m32xUQEOCxAACA5svrn0EZPny4Fi5cqLCwMP3iF7/Qnj17tGTJEk2aNEnST2/tJCQkKDU1VREREYqIiFBqaqr8/Pw0btw4b08HAAA0QV4PlKVLl2revHmaOnWqioqK5HK5NGXKFM2fP989JjExURUVFZo6dapKSkrUt29fbdiwQf7+/t6eDgAAaIK8Hij+/v7KyMhQRkbGBcfYbDalpKQoJSXF26cHAADNAN/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjNMggXLs2DH95je/UXBwsPz8/HTbbbcpNzfXvd+yLKWkpMjlcsnX11exsbHav39/Q0wFAAA0QV4PlJKSEvXr10+tWrXS+++/rwMHDmjx4sVq166de0x6erqWLFmizMxM5eTkyOl0avDgwTp58qS3pwMAAJogH28fcNGiRQoNDdXKlSvd2zp37uz+35ZlKSMjQ8nJyRo1apQkKSsrSw6HQ6tXr9aUKVO8PSUAANDEeP0Jyttvv63evXvrwQcfVIcOHXT77bdrxYoV7v15eXkqLCxUXFyce5vdbldMTIx27NhR5zErKytVVlbmsQAAgObL64Hy5ZdfatmyZYqIiNAHH3ygxx9/XE8++aRefvllSVJhYaEkyeFweLzO4XC4950vLS1NgYGB7iU0NNTb0wYAAAbxeqDU1NSoV69eSk1N1e23364pU6bo0Ucf1bJlyzzG2Ww2j3XLsmptOycpKUmlpaXupaCgwNvTBgAABvF6oHTs2FG33HKLx7bu3bsrPz9fkuR0OiWp1tOSoqKiWk9VzrHb7QoICPBYAABA8+X1QOnXr58OHTrkse3w4cPq1KmTJCk8PFxOp1PZ2dnu/VVVVdqyZYuio6O9PR0AANAEef2neGbOnKno6GilpqZq9OjR+vTTT7V8+XItX75c0k9v7SQkJCg1NVURERGKiIhQamqq/Pz8NG7cOG9PBwAANEFeD5Q+ffpo7dq1SkpK0oIFCxQeHq6MjAyNHz/ePSYxMVEVFRWaOnWqSkpK1LdvX23YsEH+/v7eng4AAGiCvB4okjRs2DANGzbsgvttNptSUlKUkpLSEKcHAABNHN/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAME6DB0paWppsNpsSEhLc2yzLUkpKilwul3x9fRUbG6v9+/c39FQAAEAT0aCBkpOTo+XLl+vWW2/12J6enq4lS5YoMzNTOTk5cjqdGjx4sE6ePNmQ0wEAAE1EgwXKqVOnNH78eK1YsULt27d3b7csSxkZGUpOTtaoUaMUGRmprKwslZeXa/Xq1Q01HQAA0IQ0WKBMmzZN9913nwYNGuSxPS8vT4WFhYqLi3Nvs9vtiomJ0Y4dO+o8VmVlpcrKyjwWAADQfPk0xEHXrFmj3bt3Kycnp9a+wsJCSZLD4fDY7nA49PXXX9d5vLS0NP3f//2f9ycKAACM5PUnKAUFBZoxY4ZeffVVtW7d+oLjbDabx7plWbW2nZOUlKTS0lL3UlBQ4NU5AwAAs3j9CUpubq6KiooUFRXl3nb27Flt3bpVmZmZOnTokKSfnqR07NjRPaaoqKjWU5Vz7Ha77Ha7t6cKAAAM5fUnKAMHDtTnn3+uvXv3upfevXtr/Pjx2rt3r7p06SKn06ns7Gz3a6qqqrRlyxZFR0d7ezoAAKAJ8voTFH9/f0VGRnpsa9OmjYKDg93bExISlJqaqoiICEVERCg1NVV+fn4aN26ct6cDAACaoAb5kOzFJCYmqqKiQlOnTlVJSYn69u2rDRs2yN/fvzGmAwAADHNVAuWjjz7yWLfZbEpJSVFKSsrVOD0AAGhi+C4eAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcbweKGlpaerTp4/8/f3VoUMHjRw5UocOHfIYY1mWUlJS5HK55Ovrq9jYWO3fv9/bUwEAAE2U1wNly5YtmjZtmj755BNlZ2frxx9/VFxcnE6fPu0ek56eriVLligzM1M5OTlyOp0aPHiwTp486e3pAACAJsjH2wdcv369x/rKlSvVoUMH5ebm6pe//KUsy1JGRoaSk5M1atQoSVJWVpYcDodWr16tKVOmeHtKAACgiWnwz6CUlpZKkoKCgiRJeXl5KiwsVFxcnHuM3W5XTEyMduzYUecxKisrVVZW5rEAAIDmq0EDxbIszZo1S/3791dkZKQkqbCwUJLkcDg8xjocDve+86WlpSkwMNC9hIaGNuS0AQBAI2vQQHniiSf02Wef6fXXX6+1z2azeaxbllVr2zlJSUkqLS11LwUFBQ0yXwAAYAavfwblnOnTp+vtt9/W1q1bdcMNN7i3O51OST89SenYsaN7e1FRUa2nKufY7XbZ7faGmioAADCM15+gWJalJ554Qm+99ZY2bdqk8PBwj/3h4eFyOp3Kzs52b6uqqtKWLVsUHR3t7ekAAIAmyOtPUKZNm6bVq1frH//4h/z9/d2fKwkMDJSvr69sNpsSEhKUmpqqiIgIRUREKDU1VX5+fho3bpy3pwMAAJogrwfKsmXLJEmxsbEe21euXKmJEydKkhITE1VRUaGpU6eqpKREffv21YYNG+Tv7+/t6QAAgCbI64FiWdZFx9hsNqWkpCglJcXbpwcAAM0A38UDAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM49PYEwAAwNvy8/NVXFzc2NNo0kJCQhQWFtZo5ydQAADNSn5+vrp2664zFeWNPZUmrbWvnw59cbDRIoVAAQA0K8XFxTpTUa7gYbPVKji0safTJFWfKNCJdxaruLiYQAEAwJtaBYfK7rypsaeBy8SHZAEAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYp1ED5cUXX1R4eLhat26tqKgoffzxx405HQAAYIhGC5Q33nhDCQkJSk5O1p49e3TXXXdp6NChys/Pb6wpAQAAQzRaoCxZskSPPPKIJk+erO7duysjI0OhoaFatmxZY00JAAAYwqcxTlpVVaXc3Fw9/fTTHtvj4uK0Y8eOWuMrKytVWVnpXi8tLZUklZWVeX1up06d+umchUdUU3XG68e/FlT/8I2kn/5ZNsT/R7gy3ONXjnvcbNzjV66h7vFzx7Is6+KDrUZw7NgxS5K1fft2j+0LFy60br755lrjn3nmGUsSCwsLCwsLSzNYCgoKLtoKjfIE5RybzeaxbllWrW2SlJSUpFmzZrnXa2pq9MMPPyg4OLjO8c1ZWVmZQkNDVVBQoICAgMaeDuB13ONo7q7le9yyLJ08eVIul+uiYxslUEJCQtSyZUsVFhZ6bC8qKpLD4ag13m63y263e2xr165dQ07ReAEBAdfcjY1rC/c4mrtr9R4PDAy8pHGN8iHZ6667TlFRUcrOzvbYnp2drejo6MaYEgAAMEijvcUza9YsTZgwQb1799add96p5cuXKz8/X48//nhjTQkAABii0QJlzJgxOnHihBYsWKDjx48rMjJS7733njp16tRYU2oS7Ha7nnnmmVpveQHNBfc4mjvu8Utjs6xL+VkfAACAq4fv4gEAAMYhUAAAgHEIFAAAYBwC5Rpgs9m0bt26xp4GrmETJ07UyJEjvXrMr776SjabTXv37vXqcYG6WJalxx57TEFBQQ16353/ZyU2NlYJCQkNci7TNepvkgVwbXjhhRcu7bs3AEOtX79eq1at0kcffaQuXbooJCSkQc7Dn5X/IlAANLhL/c2RgKmOHj2qjh07NvgvE+XPyn/xFk8Dq6mp0aJFi3TTTTfJbrcrLCxMCxculCR9/vnnuvvuu+Xr66vg4GA99thj7m/hlOp+tDdy5EhNnDjRvd65c2c9++yzGjdunNq2bSuXy6WlS5f+7JyOHTumMWPGqH379goODtaIESP01VdfeeuScQ3729/+ph49erjv6UGDBun06dN1PrZ+8sknlZiYqKCgIDmdTqWkpHgc64svvlD//v3VunVr3XLLLfrwww8v+nblgQMHdO+996pt27ZyOByaMGGCiouLG+Zicc2YOHGipk+frvz8fNlsNnXu3Fnr169X//791a5dOwUHB2vYsGE6evSo+zXn3oL861//qrvuuku+vr7q06ePDh8+rJycHPXu3Vtt27bVPffco++//97jXBd6O3TBggXq0aNHre1RUVGaP3++16+7sREoDSwpKUmLFi3SvHnzdODAAa1evVoOh0Pl5eW655571L59e+Xk5OjNN9/Uhx9+qCeeeKLe53juued06623avfu3UpKStLMmTNrfY3AOeXl5RowYIDatm2rrVu3atu2be4/JFVVVVd6ubiGHT9+XGPHjtWkSZN08OBBffTRRxo1atQFH1dnZWWpTZs22rlzp9LT07VgwQL3fVtTU6ORI0fKz89PO3fu1PLly5WcnHzR88fExOi2227Trl27tH79en333XcaPXq0168V15YXXnhBCxYs0A033KDjx48rJydHp0+f1qxZs5STk6ONGzeqRYsW+tWvfqWamhqP1z7zzDOaO3eudu/eLR8fH40dO1aJiYl64YUX9PHHH+vo0aOXHBeTJk3SgQMHlJOT49722Wefac+ePR7/4dpsXPT7jnHZysrKLLvdbq1YsaLWvuXLl1vt27e3Tp065d727rvvWi1atLAKCwsty7KsmJgYa8aMGR6vGzFihBUfH+9e79Spk3XPPfd4jBkzZow1dOhQ97oka+3atZZlWdZLL71kde3a1aqpqXHvr6ystHx9fa0PPvjgci8VsHJzcy1J1ldffVVrX3x8vDVixAj3ekxMjNW/f3+PMX369LHmzJljWZZlvf/++5aPj491/Phx9/7s7GyPezkvL8+SZO3Zs8eyLMuaN2+eFRcX53HMgoICS5J16NAhL1whrmXPP/+81alTpwvuLyoqsiRZn3/+uWVZ/70///KXv7jHvP7665Yka+PGje5taWlpVteuXd3rdf1Z+d+/B4YOHWr97ne/c68nJCRYsbGxV3Bl5uIJSgM6ePCgKisrNXDgwDr39ezZU23atHFv69evn2pqanTo0KF6nefOO++stX7w4ME6x+bm5urIkSPy9/dX27Zt1bZtWwUFBenMmTMejyeB+urZs6cGDhyoHj166MEHH9SKFStUUlJywfG33nqrx3rHjh1VVFQkSTp06JBCQ0PldDrd+++4446fPX9ubq42b97svq/btm2rbt26SRL3Nrzu6NGjGjdunLp06aKAgACFh4dLkvLz8z3G/e997nA4JMnjbRqHw+G+7y/Fo48+qtdff11nzpxRdXW1XnvtNU2aNOlKLsVYfEi2Afn6+l5wn2VZstlsde47t71Fixa1Ho9XV1df0rkvdOyamhpFRUXptddeq7Xv+uuvv6RjA3Vp2bKlsrOztWPHDm3YsEFLly5VcnKydu7cWef4Vq1aeazbbDb34/Gf+/NxITU1NRo+fLgWLVpUa1/Hjh3rdSzgYoYPH67Q0FCtWLFCLpdLNTU1ioyMrPVW+f/e5+fu6fO3nf+20MXOa7fbtXbtWtntdlVWVurXv/71FV6NmQiUBhQRESFfX19t3LhRkydP9th3yy23KCsrS6dPn3Y/Rdm+fbtatGihm2++WdJPwXD8+HH3a86ePat9+/ZpwIABHsf65JNPaq2f+y/H8/Xq1UtvvPGGOnTooICAgCu+RuB/2Ww29evXT/369dP8+fPVqVMnrV27tt7H6datm/Lz8/Xdd9+5/6vzf993r0uvXr3097//XZ07d5aPD/9qQ8M5ceKEDh48qD//+c+66667JEnbtm27Kuf28fFRfHy8Vq5cKbvdroceekh+fn5X5dxXG2/xNKDWrVtrzpw5SkxM1Msvv6yjR4/qk08+0UsvvaTx48erdevWio+P1759+7R582ZNnz5dEyZMcP8L+e6779a7776rd999V1988YWmTp2q//znP7XOs337dqWnp+vw4cP64x//qDfffFMzZsyoc07jx49XSEiIRowYoY8//lh5eXnasmWLZsyYoW+++aYh/3Ggmdu5c6dSU1O1a9cu5efn66233tL333+v7t271/tYgwcP1o033qj4+Hh99tln2r59u/tDshd6sjJt2jT98MMPGjt2rD799FN9+eWX2rBhgyZNmqSzZ89e0bUB/+vcT0AuX75cR44c0aZNmzRr1qyrdv7Jkydr06ZNev/995vt2zsSgdLg5s2bp9mzZ2v+/Pnq3r27xowZo6KiIvn5+emDDz7QDz/8oD59+uiBBx7QwIEDlZmZ6X7tpEmTFB8fr4cfflgxMTEKDw+v9fREkmbPnq3c3FzdfvvtevbZZ7V48WINGTKkzvn4+flp69atCgsL06hRo9S9e3dNmjRJFRUVPFHBFQkICNDWrVt177336uabb9bcuXO1ePFiDR06tN7HatmypdatW6dTp06pT58+mjx5subOnSvpp/Cvi8vl0vbt23X27FkNGTJEkZGRmjFjhgIDA9WiBf+qg/e0aNFCa9asUW5uriIjIzVz5kw999xzV+38ERERio6OVteuXdW3b9+rdt6rzWad/yEHNCmdO3dWQkLCNfurkHHt2L59u/r3768jR47oxhtvbOzpAI3Gsix169ZNU6ZMuapPbq423qgFYKS1a9eqbdu2ioiI0JEjRzRjxgz169ePOME1raioSK+88oqOHTum3/72t409nQZFoAAw0smTJ5WYmKiCggKFhIRo0KBBWrx4cWNPC2hUDodDISEhWr58udq3b9/Y02lQvMUDAACMwyfHAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHH+H2Bz3J/f1CN2AAAAAElFTkSuQmCC",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"width = 0.35 # the width of the bars\n",
"x = [0, 0.8, 1.6] # position of the bars\n",
"situation_counts = nutri.situation.value_counts() # note that situation_counts is a pandas series\n",
"plt.bar(x,situation_counts,width,edgecolor='black')\n",
"plt.xticks(x,situation_counts.index)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [Quantitative variables](#toc0_)\n",
"\n",
"Such type oif variables allow for more complex graphical representation. We will see some possibilities. We are in particular interested in visualizing the location, dispersion and shae of the data.\n",
"\n",
"The first example is visualizing data with a boxplot, that gives information on the location and dispersion, showing also outliers (data $x_i$ that is beyond the whiskers of the boxplot, this is, $x_iQ_3+1.5 (Q_3-Q_1)$, being $Q_3-Q_1$ called *interquantile range* or IQR)."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhYAAAGwCAYAAAD16iy9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAWXklEQVR4nO3dbWzV9fn48aulUoq0bLIBRQoK0bQGJzJnvJu6PdB5Q9yWqFPZYEZnotkUFmGbDjfF6HSJe2DchP0yYyBuWTBmkky8mdOgiTegjmzlVrxhiG46bRWFQD//BwsnfyYIlKs9bX29Eh70fE/P9+rHDz1vv+e01JRSSgAAJKit9gAAwMAhLACANMICAEgjLACANMICAEgjLACANMICAEhT19sn7Orqik2bNkVjY2PU1NT09ukBgG4opURnZ2eMGTMmamv3fF2i18Ni06ZN0dLS0tunBQASvP766zF27Ng9Hu/1sGhsbIyI/w7W1NTU26cHALqho6MjWlpaKs/je9LrYbHz5Y+mpiZhAQD9zN7exuDNmwBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKSpq/YA9A9r166Nzs7Oao/RZ9Rs/yiGvP9afDRsXJS6IdUep09pbGyMI444otpjAFUiLNirtWvXxpFHHlntMfqUY0fXxoorhsWUu9+PFzZ3VXucPmfNmjXiAj6lhAV7tfNKxcKFC6Otra3K0/QNDe+uiXjyili0aFF8+BnRtVN7e3tMmzbN1S34FBMW7LO2traYMmVKtcfoGzbVRjwZ0dbaGjFmcrWnAegzvHkTAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgzYMJiy5YtsWLFitiyZUu1RwGA/TZQnscGTFisWrUqvvjFL8aqVauqPQoA7LeB8jw2YMICAKg+YQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApKmr9gAAwIHbtm1b3HXXXbF+/fqYOHFiXHnllTF48OBen2O/r1g8+eSTMXXq1BgzZkzU1NTEAw880ANjAQD7avbs2XHwwQfHzJkz484774yZM2fGwQcfHLNnz+71WfY7LD744IM45phj4s477+yJeQCA/TB79uy4/fbbY8SIEbFgwYJ44403YsGCBTFixIi4/fbbez0u9vulkLPOOivOOuusnpgFANgP27ZtizvuuCNGjRoVGzdujLq6/z6tX3bZZTFjxowYO3Zs3HHHHTFv3rxee1mkx99jsXXr1ti6dWvl446Ojh45z4cffhgREe3t7T3y+J9mO9d05xrDnvh7CN3Xne+1d911V2zfvj3mzZtXiYqd6urq4sYbb4wrrrgi7rrrrrjmmmsyx92jHg+LW265JX7+85/39GnilVdeiYiIadOm9fi5Pq1eeeWVOPnkk6s9Bn2Yv4dw4Pbne+369esjIuLcc8/d7fGdt++8X2/o8bD48Y9/HLNmzap83NHRES0tLennOeywwyIiYuHChdHW1pb++J9m7e3tMW3atMoaw574ewjd153vtRMnToyIiCVLlsRll132seNLlizZ5X69ocfDor6+Purr63v6NNHQ0BAREW1tbTFlypQeP9+n0c41hj3x9xAO3P58r73yyivj2muvjeuvvz5mzJixy8sh27dvj7lz50ZdXV1ceeWVPTHqbvkFWQDQTw0ePDhmzpwZb775ZowdOzbmz58fmzZtivnz58fYsWPjzTffjJkzZ/bq77PY7ysW77//fqxbt67y8YYNG+LFF1+MQw45JMaNG5c6HADwyW677baIiLjjjjviiiuuqNxeV1cX1157beV4b9nvsHj++efjK1/5SuXjne+fmD59etxzzz1pgwEA++a2226LefPm9YnfvLnfYXH66adHKaUnZgEAumnw4MG99iOln8R7LACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgjLACANMICAEgzYMKitbU1li9fHq2trdUeBQD220B5Hqur9gBZhg4dGlOmTKn2GADQLQPleWzAXLEAAKpPWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJCmrtoD0Pdt2bIlIiJWrFhR5Un6joZ310RbRLSvWhUfbu6q9jh9Rnt7e7VHAKpMWLBXq1atioiIyy+/vMqT9B3Hjq6NFVcMi0suuSReEBYf09jYWO0RgCoRFuzV17/+9YiIaG1tjaFDh1Z3mD6iZvtH0f7+a/F/Z4+LUjek2uP0KY2NjXHEEUdUewygSmpKKaU3T9jR0RHDhw+P9957L5qamnrz1ABAN+3r87c3bwIAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJBGWAAAaYQFAJCmrrdPWEqJiIiOjo7ePjUA0E07n7d3Po/vSa+HRWdnZ0REtLS09PapAYAD1NnZGcOHD9/j8Zqyt/RI1tXVFZs2bYrGxsaoqalJe9yOjo5oaWmJ119/PZqamtIeF2vbk6xtz7CuPcfa9py+vrallOjs7IwxY8ZEbe2e30nR61csamtrY+zYsT32+E1NTX3yP8hAYG17jrXtGda151jbntOX1/aTrlTs5M2bAEAaYQEApBkwYVFfXx833HBD1NfXV3uUAcfa9hxr2zOsa8+xtj1noKxtr795EwAYuAbMFQsAoPqEBQCQRlgAAGmEBQCQpt+FxT//+c+YNm1ajBgxIoYOHRqTJ0+O5cuXV47PmDEjampqdvlzwgknVHHi/uGwww772LrV1NTEVVddFRH//Y1rP/vZz2LMmDHR0NAQp59+evz973+v8tT9w97W1p7tnu3bt8f1118fhx9+eDQ0NMSECRPixhtvjK6ursp97Nvu2Ze1tW+7r7OzM6655poYP358NDQ0xEknnRTPPfdc5Xi/37elH3nnnXfK+PHjy4wZM8ozzzxTNmzYUB599NGybt26yn2mT59evva1r5U33nij8uftt9+u4tT9w1tvvbXLmj3yyCMlIsrjjz9eSinl1ltvLY2NjWXx4sVl5cqV5cILLyzNzc2lo6OjuoP3A3tbW3u2e+bNm1dGjBhRlixZUjZs2FD++Mc/lmHDhpVf/epXlfvYt92zL2tr33bfBRdcUI466qjyxBNPlLVr15YbbrihNDU1lY0bN5ZS+v++7VdhMWfOnHLKKad84n2mT59ezjvvvN4ZaAC7+uqry8SJE0tXV1fp6uoqo0ePLrfeemvl+EcffVSGDx9efvOb31Rxyv7p/1/bUuzZ7jrnnHPKpZdeustt3/zmN8u0adNKKcW+PQB7W9tS7Nvu2rJlSxk0aFBZsmTJLrcfc8wx5brrrhsQ+7ZfvRTypz/9KY477rg4//zzY+TIkXHsscfGggULPna/v/71rzFy5Mg48sgj4/LLL4+33nqrCtP2X9u2bYuFCxfGpZdeGjU1NbFhw4bYvHlznHHGGZX71NfXx2mnnRZPP/10FSftf/53bXeyZ/ffKaecEo899lisWbMmIiJeeumlWLZsWZx99tkREfbtAdjb2u5k3+6/7du3x44dO2LIkCG73N7Q0BDLli0bEPu21/8RsgPx8ssvx69//euYNWtW/OQnP4lnn302fvCDH0R9fX185zvfiYiIs846K84///wYP358bNiwIX7605/GV7/61Vi+fHm//21mveWBBx6Id999N2bMmBEREZs3b46IiFGjRu1yv1GjRsWrr77a2+P1a/+7thH2bHfNmTMn3nvvvWhtbY1BgwbFjh074uabb46LLrooIuzbA7G3tY2wb7ursbExTjzxxLjpppuira0tRo0aFffdd18888wzccQRRwyMfVvtSyb746CDDionnnjiLrd9//vfLyeccMIeP2fTpk3loIMOKosXL+7p8QaMM844o5x77rmVj5966qkSEWXTpk273O+yyy4rZ555Zm+P16/979rujj27b+67774yduzYct9995W//e1v5d577y2HHHJIueeee0op9u2B2Nva7o59u+/WrVtXTj311BIRZdCgQeVLX/pSueSSS0pbW9uA2Lf96opFc3NzHHXUUbvc1tbWFosXL/7Ezxk/fnysXbu2p8cbEF599dV49NFH4/7776/cNnr06Ij47/8BNjc3V25/6623PlbV7Nnu1nZ37Nl9c+2118aPfvSj+Na3vhUREUcffXS8+uqrccstt8T06dPt2wOwt7XdHft2302cODGeeOKJ+OCDD6KjoyOam5vjwgsvjMMPP3xA7Nt+9R6Lk08+OVavXr3LbWvWrInx48fv8XPefvvteP3113f5D8Se/e53v4uRI0fGOeecU7lt52Z/5JFHKrdt27YtnnjiiTjppJOqMWa/tLu13R17dt9s2bIlamt3/RY2aNCgyo9E2rfdt7e13R37dv8dfPDB0dzcHP/5z39i6dKlcd555w2MfVvtSyb749lnny11dXXl5ptvLmvXri2LFi0qQ4cOLQsXLiyllNLZ2Vl++MMflqeffrps2LChPP744+XEE08shx56aL/5MZ1q2rFjRxk3blyZM2fOx47deuutZfjw4eX+++8vK1euLBdddFG/+vGnatvT2tqz3Td9+vRy6KGHVn4k8v777y+f+9znyuzZsyv3sW+7Z29ra98emIceeqj8+c9/Li+//HJ5+OGHyzHHHFOOP/74sm3btlJK/9+3/SosSinlwQcfLJMmTSr19fWltbW1zJ8/v3Jsy5Yt5Ywzziif//zny0EHHVTGjRtXpk+fXl577bUqTtx/LF26tEREWb169ceOdXV1lRtuuKGMHj261NfXl1NPPbWsXLmyClP2T3taW3u2+zo6OsrVV19dxo0bV4YMGVImTJhQrrvuurJ169bKfezb7tnb2tq3B+YPf/hDmTBhQhk8eHAZPXp0ueqqq8q7775bOd7f961/Nh0ASNOv3mMBAPRtwgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgL4RA899FCccsop8ZnPfCZGjBgR5557bqxfv75y/Omnn47JkyfHkCFD4rjjjosHHnggampq4sUXX6zc5x//+EecffbZMWzYsBg1alR8+9vfjn//+99V+GqAniYsgE/0wQcfxKxZs+K5556Lxx57LGpra+Mb3/hGdHV1RWdnZ0ydOjWOPvroWLFiRdx0000xZ86cXT7/jTfeiNNOOy0mT54czz//fDz00EPx5ptvxgUXXFClrwjoSf51U2C//Otf/4qRI0fGypUrY9myZXH99dfHxo0bY8iQIRER8dvf/jYuv/zyeOGFF2Ly5Mkxd+7ceOaZZ2Lp0qWVx9i4cWO0tLTE6tWr48gjj6zWlwL0AFcsgE+0fv36uPjii2PChAnR1NQUhx9+eEREvPbaa7F69er4whe+UImKiIjjjz9+l89fvnx5PP744zFs2LDKn9bW1spjAwNLXbUHAPq2qVOnRktLSyxYsCDGjBkTXV1dMWnSpNi2bVuUUqKmpmaX+//vRdCurq6YOnVq/OIXv/jYYzc3N/fo7EDvExbAHr399tvR3t4ed999d3z5y1+OiIhly5ZVjre2tsaiRYti69atUV9fHxERzz///C6PMWXKlFi8eHEcdthhUVfnWw4MdF4KAfbos5/9bIwYMSLmz58f69ati7/85S8xa9asyvGLL744urq64nvf+160t7fH0qVL45e//GVEROVKxlVXXRXvvPNOXHTRRfHss8/Gyy+/HA8//HBceumlsWPHjqp8XUDPERbAHtXW1sbvf//7WL58eUyaNClmzpwZt99+e+V4U1NTPPjgg/Hiiy/G5MmT47rrrou5c+dGRFTedzFmzJh46qmnYseOHXHmmWfGpEmT4uqrr47hw4dHba1vQTDQ+KkQINWiRYviu9/9brz33nvR0NBQ7XGAXuYFT+CA3HvvvTFhwoQ49NBD46WXXoo5c+bEBRdcICrgU0pYAAdk8+bNMXfu3Ni8eXM0NzfH+eefHzfffHO1xwKqxEshAEAa75wCANIICwAgjbAAANIICwAgjbAAANIICwAgjbAAANIICwAgzf8D2bsRZQItCXYAAAAASUVORK5CYII=",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.boxplot(nutri.age,widths=width,vert=False) # the vert controls the verticality of the plot\n",
"plt.xlabel('age')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a slightly more complex setup, we can compare two different categories in the boxplot:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"males = nutri[nutri.gender == 'Male']\n",
"females = nutri[nutri.gender == 'Female']\n",
"plt.boxplot([males.coffee, females.coffee], notch =True , widths=(0.5 ,0.5))\n",
"plt.xlabel ('gender')\n",
"plt.ylabel ('coffee')\n",
"plt.xticks ([1 ,2] ,['Male','Female'])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can also show the distribution of the data using a histogram, after breaking the data into *bins* or *classes*"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.hist(nutri.age,bins=9,facecolor='green',edgecolor='black',linewidth=1)\n",
"plt.xlabel('Age')\n",
"plt.ylabel('Quantity')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"alternatively, we can weigth the values to show $\\frac{counts}{total}$ by using the trick of multiplying all values times the quantity $1/266$."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"weigths = np.ones_like(nutri.age)/nutri.age.count()\n",
"plt.hist(nutri.age,bins=9,weights=weigths,facecolor='green',edgecolor='black',linewidth=1)\n",
"plt.xlabel('Age')\n",
"plt.ylabel('Proportion of total')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The *empirical cumulative distribution function*, $F_n$, is a step function that jupms $k/n$ at observation values, where $k$ is the fraction of tied observations at that value:\n",
"$$F_n(x)=\\frac{\\mathrm{number \\; of \\; }x_i \\leq x}{n}$$\n",
"The function can be defined and plotted in the same way for both discrete and continuous data."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"x = np.sort(nutri.age)\n",
"y = np.linspace(0,1,len(nutri.age))\n",
"plt.xlabel('Age')\n",
"plt.ylabel('$F_n(x)$')\n",
"plt.step(x,y)\n",
"plt.xlim(x.min(),x.max())\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us have a look at a plot that takes into account the contingency table shown above. We will make use now of the `seaborn` package, which simplifies the plotting of statistical information.\n",
"\n",
"TODO: add examples with sns.histplot()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(nutri.height,nutri.weight,s=12,marker='o')\n",
"plt.xlabel('height')\n",
"plt.ylabel('weight')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get more sophisticated plots and fit some lines to the data to visualy compare data. In this example, we will use data from Vincent Arel Bundock repo on birth weights of babies for smoking and non smoking mothers, and plot the data againts the age of the mother. We will deal with significance of the results in upcoming sessions."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"