2. UNIT 1. Import and Visualitzation of data with Python#

This Unit includes some fast shortcuts to representing data in Python, closely following [1] and the corresponding GitHub repo.

NOTE: being this a training tool, we assume the user will have installed all the needed requirements. In case some package is missing, use a terminal to install it apart using your favorite package manager.

These notebooks have been tested in Linux Ubuntu using Anaconda as the package manager in most cases. The notebooks are self-containing.

Table of contents

  • UNIT 1. Import and Visualitzation of data with Python

    • Retrieving data

    • Reformating tables

    • Structuring features

    • Summary tables and statistics

    • Data visualization

      • Qualitative variables

      • Quantitative variables

2.1. Retrieving data#

Typically the files containing data to be analyzed are stored in comma separated value (CSV) format. To work with them, the first thing to do is downloading the data. But before we ensure we have a proper folder to download the file.

import os

input_directory = "datasets"
output_directory = 'output'

# Check if the directories exist
if not os.path.exists(input_directory):
    # If it doesn't exist, create it
    os.makedirs(input_directory)
else:
    print('folder ',input_directory,' exists')

if not os.path.exists(output_directory):
    # If it doesn't exist, create it
    os.makedirs(output_directory)
else:
    print('folder ',output_directory,' exists')

# now use wget to download the file into the datasets folder
!wget -P $input_directory -nc https://archive.ics.uci.edu/static/public/109/wine.zip 
folder  datasets  exists
folder  output  exists
--2023-09-25 14:48:36--  https://archive.ics.uci.edu/static/public/109/wine.zip
S'està resolent archive.ics.uci.edu (archive.ics.uci.edu)… 
128.195.10.252
S'està connectant a archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443… 
conectat.
HTTP: s'ha enviat la petició, s'està esperant una resposta… 
200 OK
Mida: no especificat
S'està desant a: ‘datasets/wine.zip’


wine.zip                [<=>                 ]       0  --.-KB/s               
wine.zip                [ <=>                ]   5,90K  --.-KB/s    in 0s      

2023-09-25 14:48:37 (68,9 MB/s) - s'ha desat ‘datasets/wine.zip’ [6038]

Now, we read the content of the zip file using the pandas module. Information of the data can be found in the ML repository at UCI.

import pandas as pd
from zipfile import ZipFile

with ZipFile('datasets/wine.zip', 'r') as f:

#extract in current directory
    f.extractall(input_directory, members =['wine.names',"wine.data"])

wine = pd.read_csv('datasets/wine.data',header=None)
wine.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

as we do not have the names of the columns, we can manually assign them based on the information contained in the web link.

wine.columns = ['class','Alcohol','Maliacid','Ash','Alcalinity_of_ash','Magnesium','Total_phenols','Flavanoids','Nonflavonoid_phenols','Proanthocyanins','Color_intensity','Hue','0D280_0D315_of_diluted_wines','Proline']
wine.head()
class Alcohol Maliacid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavonoid_phenols Proanthocyanins Color_intensity Hue 0D280_0D315_of_diluted_wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Alternatively, we can directly read the CSV from its URL, without previously downloading it. We will use Fisher’s iris data. Here we already have the name of the columns/features so we read them directly from the dataset.

dataname = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'

iris = pd.read_csv(dataname)
iris.head()
rownames Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa

as the first column is a duplicated index column, we can remove it

iris.drop('rownames', axis=1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

2.2. Reformating tables#

Some times tha data comes in odd formats, not useful for analysis.

For example, consider the table with scores given here, with values obtained before and after some particular training:

scores

We want to reformat it in such a way that the score data is in a single column. Let us create a table that uses three features: Score (continuous data), Time (before or after) and Student (integer values from 1 to 5). We will use the pandas.melt method, speciffically devoted to this goal.

# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]
import pandas as pd
df = pd. DataFrame (values , columns =['Student ','Before ', 'After '])
# format dataframe as required
df = pd.melt(df , id_vars =['Student '], var_name ="Time", value_vars =['Before ','After '])
print(df)
   Student      Time  value
0         1  Before      75
1         2  Before      30
2         3  Before     100
3         4  Before      50
4         5  Before      60
5         1   After      85
6         2   After      50
7         3   After     100
8         4   After      52
9         5   After      65

2.3. Structuring features#

Types of features:

  • Quantitive (with possible discrete or continuous values)

  • Qualitative; can be eventually divided into a fixed number of categories: that is why they are referred as categorical or factors.

Here we will work with data from []. In particular, we will download the file containing nutritional measurements of thirteen features (columns) for 226 elderly individuals (rows). Note the file is now an excel file.

xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls'
nutri = pd.read_excel(xls)

pd.set_option('display.max_columns', 8)
nutri.head()
gender situation tea coffee ... raw_fruit cooked_fruit_veg chocol fat
0 2 1 0 0 ... 1 4 5 6
1 2 1 1 1 ... 5 5 1 4
2 2 1 0 4 ... 5 2 5 4
3 2 1 0 0 ... 4 0 3 2
4 2 1 2 1 ... 5 5 3 2

5 rows × 13 columns

to check the structure of the data we can use the info attribute of the pandas dataframe, which matches the description that is given in the original source

description

Note the features can be classified as

  • qualitative

    • ordinal (meat, fish, raw_fruit, cooked_fruit_veg,chocol)

    • nominal (gender, situation), fat

  • quantitative

    • discrete (tea, coffee)

    • continuous (height, weight, age)

nutri.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   gender            226 non-null    int64
 1   situation         226 non-null    int64
 2   tea               226 non-null    int64
 3   coffee            226 non-null    int64
 4   height            226 non-null    int64
 5   weight            226 non-null    int64
 6   age               226 non-null    int64
 7   meat              226 non-null    int64
 8   fish              226 non-null    int64
 9   raw_fruit         226 non-null    int64
 10  cooked_fruit_veg  226 non-null    int64
 11  chocol            226 non-null    int64
 12  fat               226 non-null    int64
dtypes: int64(13)
memory usage: 23.1 KB

let us change now the data value and type of different features by means of the replace and the astype methods, and finally save the data in CSV format.

# gender, situation, meat, fish, raw_fruit, cooked_fruit_veg chocol and fat feature are categorical
DICT = {1:'Male',2:'Female'}
nutri['gender'] = nutri['gender'].replace(DICT).astype('category')

DICT = {1:'single',2:'couple',3:'family',4:'other'}
nutri['situation'] = nutri['situation'].replace(DICT).astype('category')

DICT = {0:'never',1:'less than once a week',2:'once a week',3:'2-3 times a week',4:'4-6 times a week', 5:'every day'}
nutri['meat'] = nutri['meat'].replace(DICT).astype('category')
nutri['fish'] = nutri['fish'].replace(DICT).astype('category')
nutri['raw_fruit'] = nutri['raw_fruit'].replace(DICT).astype('category')
nutri['cooked_fruit_veg'] = nutri['cooked_fruit_veg'].replace(DICT).astype('category')
nutri['chocol'] = nutri['chocol'].replace(DICT).astype('category')

DICT = {1:'Butter',2:'Margarine',3:'Peanut oil', 4:'Sunflower oil', 5:'Olive oil', 6:'Mix of vegetable oils', 7:'Colza oil',8:'Duck or goose fat'}
nutri['fat'] = nutri['fat'].replace(DICT).astype('category')

# tea and coffee are integer
nutri['tea'] = nutri['tea'].astype(int)
nutri['coffee'] = nutri['coffee'].astype(int)

# height, weigth, age are float
nutri['height'] = nutri['height'].astype(float)
nutri['weight'] = nutri['weight'].astype(float)
nutri['age'] = nutri['age'].astype(float)

nutri.head()
gender situation tea coffee ... raw_fruit cooked_fruit_veg chocol fat
0 Female single 0 0 ... less than once a week 4-6 times a week every day Mix of vegetable oils
1 Female single 1 1 ... every day every day less than once a week Sunflower oil
2 Female single 0 4 ... every day once a week every day Sunflower oil
3 Female single 0 0 ... 4-6 times a week never 2-3 times a week Margarine
4 Female single 2 1 ... every day every day 2-3 times a week Margarine

5 rows × 13 columns

nutri.info()

nutri.to_csv('output/nutri.csv',index=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   gender            226 non-null    category
 1   situation         226 non-null    category
 2   tea               226 non-null    int64   
 3   coffee            226 non-null    int64   
 4   height            226 non-null    float64 
 5   weight            226 non-null    float64 
 6   age               226 non-null    float64 
 7   meat              226 non-null    category
 8   fish              226 non-null    category
 9   raw_fruit         226 non-null    category
 10  cooked_fruit_veg  226 non-null    category
 11  chocol            226 non-null    category
 12  fat               226 non-null    category
dtypes: category(8), float64(3), int64(2)
memory usage: 12.4 KB

2.4. Summary tables and statistics#

It is extremely important to know your data before any analysis. Apart from the descriptive tools shown above, descriptive statistical measurements can be obtained from pandas with two simple methods: describe and value_counts. Interesting to note that the outcome of both describe and value_counts are pandas series, one-dimensional ndarrays with axis labels.

nutri = pd.read_csv('output/nutri.csv')
nutri['fat'].describe()
count               226
unique                8
top       Sunflower oil
freq                 68
Name: fat, dtype: object
nutri['fat'].value_counts()
fat
Sunflower oil            68
Peanut oil               48
Olive oil                40
Margarine                27
Mix of vegetable oils    23
Butter                   15
Duck or goose fat         4
Colza oil                 1
Name: count, dtype: int64

We can also generate a contingey table by cross tabulating between two or more variables:

pd.crosstab(nutri.gender,nutri.situation,margins=True) # the margins attribute adds the rows/columns totals
situation couple family single All
gender
Female 56 7 78 141
Male 63 2 20 85
All 119 9 98 226

we can now turn to descriptive statistics like the sample mean, \(\bar{x}\): $\(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\)$

nutri.height.mean()
163.96017699115043

and the p-sample quantile of \(\mathbf{x}\), being \(0 <p <1\), which refers to the value \(x\) such that at least a fraction \(p\) of the data is less than or equal to \(x\) and at least a fraction \(1-p\) is greater or equals to \(x\). Thus, the sample median is the sample 0.5-quatile. For our nutri data the first (25), second (50) and third (75% percentile of the data) can be obtained as:

nutri.height.quantile(q=[0.25,0.5,0.75])
0.25    157.0
0.50    163.0
0.75    170.0
Name: height, dtype: float64

If, in addition to the location of the data we are interested in its dispersions, we can start by obtaining the range

nutri.height.max()-nutri.height.min()
48.0

Next, we can calculate the sample variance as $\(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\)$

round(nutri.height.var(),2)
81.06

or the standard deviation as \(s=\sqrt{s^2}\) or, better:

round(nutri.height.std(),2)
9.0

or simply make use of the describe method shown above, taking profit of the quantitative type of the height data

nutri.height.describe()
count    226.000000
mean     163.960177
std        9.003368
min      140.000000
25%      157.000000
50%      163.000000
75%      170.000000
max      188.000000
Name: height, dtype: float64

2.5. Data visualization#

Depending on the data type, data visualization will differ. In order to use the visualization power of Python , first we will import the matplotlib.pyplot module, as well as numpy, in addition to the already imported pandas module.

import matplotlib.pyplot as plt
import numpy as np

2.5.1. Qualitative variables#

here we can show the data in a simple bar plot, taking itno account that the category (x-axis) is not per se a numerical value, so we should manually place the different categories in the plot.

width = 0.35  # the width of the bars
x = [0, 0.8, 1.6]    # position of the bars
situation_counts = nutri.situation.value_counts()  # note that situation_counts is a pandas series
plt.bar(x,situation_counts,width,edgecolor='black')
plt.xticks(x,situation_counts.index)
plt.show()
../_images/506619d4c523f0412c455859d9db9e195a45e93852629ef2e7decbc3321f8015.png

2.5.2. Quantitative variables#

Such type oif variables allow for more complex graphical representation. We will see some possibilities. We are in particular interested in visualizing the location, dispersion and shae of the data.

The first example is visualizing data with a boxplot, that gives information on the location and dispersion, showing also outliers (data \(x_i\) that is beyond the whiskers of the boxplot, this is, \(x_i<Q_1-1.5 (Q_3-Q_1)\) or \(x_i>Q_3+1.5 (Q_3-Q_1)\), being \(Q_3-Q_1\) called interquantile range or IQR).

plt.boxplot(nutri.age,widths=width,vert=False)  # the vert controls the verticality of the plot
plt.xlabel('age')
plt.show()
../_images/ac8d5d13cd1b2a9ce51db49e2178e3afcf41403b7f7982a81d6cdf3f54221e80.png

In a slightly more complex setup, we can compare two different categories in the boxplot:

males = nutri[nutri.gender == 'Male']
females = nutri[nutri.gender == 'Female']
plt.boxplot([males.coffee, females.coffee], notch =True , widths=(0.5 ,0.5))
plt.xlabel ('gender')
plt.ylabel ('coffee')
plt.xticks ([1 ,2] ,['Male','Female'])
plt.show()
../_images/9808ce2064fffb0982d9eaeb92fd29c06a6807b060329679f7118d2cd761d50f.png

we can also show the distribution of the data using a histogram, after breaking the data into bins or classes

plt.hist(nutri.age,bins=9,facecolor='green',edgecolor='black',linewidth=1)
plt.xlabel('Age')
plt.ylabel('Quantity')
plt.show()
../_images/de58b866c659774a45f452e8661b7bee4e9fde5c0175189478ec14ddc0f02306.png

alternatively, we can weigth the values to show \(\frac{counts}{total}\) by using the trick of multiplying all values times the quantity \(1/266\).

weigths = np.ones_like(nutri.age)/nutri.age.count()
plt.hist(nutri.age,bins=9,weights=weigths,facecolor='green',edgecolor='black',linewidth=1)
plt.xlabel('Age')
plt.ylabel('Proportion of total')
plt.show()
../_images/8e81a3dc123628df0b758f11a9146248f3f78281c742f7ab9fd27dd55aff7b8c.png

The empirical cumulative distribution function, \(F_n\), is a step function that jupms \(k/n\) at observation values, where \(k\) is the fraction of tied observations at that value: $\(F_n(x)=\frac{\mathrm{number \; of \; }x_i \leq x}{n}\)$ The function can be defined and plotted in the same way for both discrete and continuous data.

x = np.sort(nutri.age)
y = np.linspace(0,1,len(nutri.age))
plt.xlabel('Age')
plt.ylabel('$F_n(x)$')
plt.step(x,y)
plt.xlim(x.min(),x.max())
plt.show()
../_images/5c09c0171d2460204a3c45e1fac44b9fb9b6778d7af953e12ef3224554383f20.png

Next, let us have a look at a plot that takes into account the contingency table shown above. We will make use now of the seaborn package, which simplifies the plotting of statistical information.

TODO: add examples with sns.histplot()

import seaborn as sns
sns.countplot(x='situation',hue='gender',data=nutri,hue_order=['Male','Female'],palette=['green','red'],saturation=1,edgecolor='black')
plt.legend(loc='upper right')
plt.xlabel('Situation')
plt.ylabel('Counts')
plt.show()
../_images/e9a26abe38a2259ab26dc8047e62afbdbe5759286d3b586397751f4b4f3d8679.png

Scatter plots are useful to visualize patterns between quantitative features.

plt.scatter(nutri.height,nutri.weight,s=12,marker='o')
plt.xlabel('height')
plt.ylabel('weight')
plt.show()
../_images/7133db2f7c837002dfa12547e2cf32443eee0e79b8af73aa4e13de874966a53f.png

We can get more sophisticated plots and fit some lines to the data to visualy compare data. In this example, we will use data from Vincent Arel Bundock repo on birth weights of babies for smoking and non smoking mothers, and plot the data againts the age of the mother. We will deal with significance of the results in upcoming sessions.

urlprefix = 'http://vincentarelbundock.github.io/Rdatasets/csv/'
dataname = 'MASS/birthwt.csv'
bwt = pd.read_csv(urlprefix + dataname)
bwt = bwt.drop('rownames' ,axis=1)  #drop unnamed column
bwt
low age lwt race ... ht ui ftv bwt
0 0 19 182 2 ... 0 1 0 2523
1 0 33 155 3 ... 0 0 3 2551
2 0 20 105 1 ... 0 0 1 2557
3 0 21 108 1 ... 0 1 2 2594
4 0 18 107 1 ... 0 1 0 2600
... ... ... ... ... ... ... ... ... ...
184 1 28 95 1 ... 0 0 2 2466
185 1 14 100 3 ... 0 0 2 2495
186 1 23 94 3 ... 0 0 0 2495
187 1 17 142 2 ... 1 0 0 2495
188 1 21 130 1 ... 1 0 3 2495

189 rows × 10 columns

styles = {0: ['o','red'], 1: ['^','blue']}
for k in styles:
    grp = bwt[bwt.smoke == k]
    m,b = np.polyfit(grp.age , grp.bwt , 1) # fit a straight line
    plt.scatter(grp.age , grp.bwt , c= styles[k][1] , s=15 , linewidth =0,
                 marker = styles[k][0])
    plt.plot(grp.age , m*grp.age + b, '-', color = styles[k][1])

plt.xlabel('age')
plt.ylabel('birth weight (g)')
plt.legend(['non - smokers','smokers'],prop={'size':8},loc=(0.7 ,0.3))
plt.show()
../_images/790eb0bba06ebe0c146cc502a8aa2bd8ac7f26ddd312c655623d05b07e964304.png
[1]

Dirk P. Kroese, Zdravko Botev, Thomas Taimre, and Radislav: Vaisman. Data Science and Machine Learning: Mathematical and Statistical Methods. Machine Learning & Pattern Recognition. Chapman & Hall/CRC, 2020. URL: https://acems.org.au/data-science-machine-learning-book-available-download (visited on 2023-08-15).