UNIT 1. Import and Visualitzation of data with Python

2. UNIT 1. Import and Visualitzation of data with Python#

This Unit includes some fast shortcuts to representing data in Python, closely following [1] and the corresponding GitHub repo.

NOTE: being this a training tool, we assume the user will have installed all the needed requirements. In case some package is missing, use a terminal to install it apart using your favorite package manager.

These notebooks have been tested in Linux Ubuntu using Anaconda as the package manager in most cases. The notebooks are self-containing.

Table of contents

UNIT 1. Import and Visualitzation of data with Python
- Retrieving data
- Reformating tables
- Structuring features
- Summary tables and statistics
- Data visualization
  - Qualitative variables
  - Quantitative variables

2.1. Retrieving data#

Typically the files containing data to be analyzed are stored in comma separated value (CSV) format. To work with them, the first thing to do is downloading the data. But before we ensure we have a proper folder to download the file.

import os

input_directory = "datasets"
output_directory = 'output'

# Check if the directories exist
if not os.path.exists(input_directory):
    # If it doesn't exist, create it
    os.makedirs(input_directory)
else:
    print('folder ',input_directory,' exists')

if not os.path.exists(output_directory):
    # If it doesn't exist, create it
    os.makedirs(output_directory)
else:
    print('folder ',output_directory,' exists')

# now use wget to download the file into the datasets folder
!wget -P $input_directory -nc https://archive.ics.uci.edu/static/public/109/wine.zip 

folder  datasets  exists
folder  output  exists
--2023-09-25 14:48:36--  https://archive.ics.uci.edu/static/public/109/wine.zip
S'està resolent archive.ics.uci.edu (archive.ics.uci.edu)… 

128.195.10.252
S'està connectant a archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443… 

conectat.

HTTP: s'ha enviat la petició, s'està esperant una resposta…

200 OK
Mida: no especificat
S'està desant a: ‘datasets/wine.zip’


wine.zip                [<=>                 ]       0  --.-KB/s               
wine.zip                [ <=>                ]   5,90K  --.-KB/s    in 0s      

2023-09-25 14:48:37 (68,9 MB/s) - s'ha desat ‘datasets/wine.zip’ [6038]

Now, we read the content of the zip file using the pandas module. Information of the data can be found in the ML repository at UCI.

import pandas as pd
from zipfile import ZipFile

with ZipFile('datasets/wine.zip', 'r') as f:

#extract in current directory
    f.extractall(input_directory, members =['wine.names',"wine.data"])

wine = pd.read_csv('datasets/wine.data',header=None)
wine.head()

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

as we do not have the names of the columns, we can manually assign them based on the information contained in the web link.

wine.columns = ['class','Alcohol','Maliacid','Ash','Alcalinity_of_ash','Magnesium','Total_phenols','Flavanoids','Nonflavonoid_phenols','Proanthocyanins','Color_intensity','Hue','0D280_0D315_of_diluted_wines','Proline']
wine.head()

	class	Alcohol	Maliacid	Ash	Alcalinity_of_ash	Magnesium	Total_phenols	Flavanoids	Nonflavonoid_phenols	Proanthocyanins	Color_intensity	Hue	0D280_0D315_of_diluted_wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

Alternatively, we can directly read the CSV from its URL, without previously downloading it. We will use Fisher’s iris data. Here we already have the name of the columns/features so we read them directly from the dataset.

dataname = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'

iris = pd.read_csv(dataname)
iris.head()

	rownames	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	1	5.1	3.5	1.4	0.2	setosa
1	2	4.9	3.0	1.4	0.2	setosa
2	3	4.7	3.2	1.3	0.2	setosa
3	4	4.6	3.1	1.5	0.2	setosa
4	5	5.0	3.6	1.4	0.2	setosa

as the first column is a duplicated index column, we can remove it

iris.drop('rownames', axis=1)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

2.2. Reformating tables#

Some times tha data comes in odd formats, not useful for analysis.

For example, consider the table with scores given here, with values obtained before and after some particular training:

scores

We want to reformat it in such a way that the score data is in a single column. Let us create a table that uses three features: Score (continuous data), Time (before or after) and Student (integer values from 1 to 5). We will use the pandas.melt method, speciffically devoted to this goal.

# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]
import pandas as pd
df = pd. DataFrame (values , columns =['Student ','Before ', 'After '])
# format dataframe as required
df = pd.melt(df , id_vars =['Student '], var_name ="Time", value_vars =['Before ','After '])
print(df)

   Student      Time  value
       1  Before      75
       2  Before      30
       3  Before     100
       4  Before      50
       5  Before      60
       1   After      85
       2   After      50
       3   After     100
       4   After      52
       5   After      65

2.3. Structuring features#

Types of features:

Quantitive (with possible discrete or continuous values)
Qualitative; can be eventually divided into a fixed number of categories: that is why they are referred as categorical or factors.

Here we will work with data from []. In particular, we will download the file containing nutritional measurements of thirteen features (columns) for 226 elderly individuals (rows). Note the file is now an excel file.

xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls'
nutri = pd.read_excel(xls)

pd.set_option('display.max_columns', 8)
nutri.head()

	gender	situation	tea	coffee	...	raw_fruit	cooked_fruit_veg	chocol	fat
0	2	1	0	0	...	1	4	5	6
1	2	1	1	1	...	5	5	1	4
2	2	1	0	4	...	5	2	5	4
3	2	1	0	0	...	4	0	3	2
4	2	1	2	1	...	5	5	3	2

5 rows × 13 columns

to check the structure of the data we can use the info attribute of the pandas dataframe, which matches the description that is given in the original source

description

Note the features can be classified as

qualitative
- ordinal (meat, fish, raw_fruit, cooked_fruit_veg,chocol)
- nominal (gender, situation), fat
quantitative
- discrete (tea, coffee)
- continuous (height, weight, age)

nutri.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   gender            226 non-null    int64
 1   situation         226 non-null    int64
 2   tea               226 non-null    int64
 3   coffee            226 non-null    int64
 4   height            226 non-null    int64
 5   weight            226 non-null    int64
 6   age               226 non-null    int64
 7   meat              226 non-null    int64
 8   fish              226 non-null    int64
 9   raw_fruit         226 non-null    int64
 10  cooked_fruit_veg  226 non-null    int64
 11  chocol            226 non-null    int64
 12  fat               226 non-null    int64
dtypes: int64(13)
memory usage: 23.1 KB

let us change now the data value and type of different features by means of the replace and the astype methods, and finally save the data in CSV format.

# gender, situation, meat, fish, raw_fruit, cooked_fruit_veg chocol and fat feature are categorical
DICT = {1:'Male',2:'Female'}
nutri['gender'] = nutri['gender'].replace(DICT).astype('category')

DICT = {1:'single',2:'couple',3:'family',4:'other'}
nutri['situation'] = nutri['situation'].replace(DICT).astype('category')

DICT = {0:'never',1:'less than once a week',2:'once a week',3:'2-3 times a week',4:'4-6 times a week', 5:'every day'}
nutri['meat'] = nutri['meat'].replace(DICT).astype('category')
nutri['fish'] = nutri['fish'].replace(DICT).astype('category')
nutri['raw_fruit'] = nutri['raw_fruit'].replace(DICT).astype('category')
nutri['cooked_fruit_veg'] = nutri['cooked_fruit_veg'].replace(DICT).astype('category')
nutri['chocol'] = nutri['chocol'].replace(DICT).astype('category')

DICT = {1:'Butter',2:'Margarine',3:'Peanut oil', 4:'Sunflower oil', 5:'Olive oil', 6:'Mix of vegetable oils', 7:'Colza oil',8:'Duck or goose fat'}
nutri['fat'] = nutri['fat'].replace(DICT).astype('category')

# tea and coffee are integer
nutri['tea'] = nutri['tea'].astype(int)
nutri['coffee'] = nutri['coffee'].astype(int)

# height, weigth, age are float
nutri['height'] = nutri['height'].astype(float)
nutri['weight'] = nutri['weight'].astype(float)
nutri['age'] = nutri['age'].astype(float)

nutri.head()

	gender	situation	tea	coffee	...	raw_fruit	cooked_fruit_veg	chocol	fat
0	Female	single	0	0	...	less than once a week	4-6 times a week	every day	Mix of vegetable oils
1	Female	single	1	1	...	every day	every day	less than once a week	Sunflower oil
2	Female	single	0	4	...	every day	once a week	every day	Sunflower oil
3	Female	single	0	0	...	4-6 times a week	never	2-3 times a week	Margarine
4	Female	single	2	1	...	every day	every day	2-3 times a week	Margarine

5 rows × 13 columns

nutri.info()

nutri.to_csv('output/nutri.csv',index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   gender            226 non-null    category
 1   situation         226 non-null    category
 2   tea               226 non-null    int64   
 3   coffee            226 non-null    int64   
 4   height            226 non-null    float64 
 5   weight            226 non-null    float64 
 6   age               226 non-null    float64 
 7   meat              226 non-null    category
 8   fish              226 non-null    category
 9   raw_fruit         226 non-null    category
 10  cooked_fruit_veg  226 non-null    category
 11  chocol            226 non-null    category
 12  fat               226 non-null    category
dtypes: category(8), float64(3), int64(2)
memory usage: 12.4 KB

2.4. Summary tables and statistics#

It is extremely important to know your data before any analysis. Apart from the descriptive tools shown above, descriptive statistical measurements can be obtained from pandas with two simple methods: describe and value_counts. Interesting to note that the outcome of both describe and value_counts are pandas series, one-dimensional ndarrays with axis labels.

nutri = pd.read_csv('output/nutri.csv')
nutri['fat'].describe()

count               226
unique                8
top       Sunflower oil
freq                 68
Name: fat, dtype: object

nutri['fat'].value_counts()

fat
Sunflower oil            68
Peanut oil               48
Olive oil                40
Margarine                27
Mix of vegetable oils    23
Butter                   15
Duck or goose fat         4
Colza oil                 1
Name: count, dtype: int64

We can also generate a contingey table by cross tabulating between two or more variables:

pd.crosstab(nutri.gender,nutri.situation,margins=True) # the margins attribute adds the rows/columns totals

situation	couple	family	single	All
gender
Female	56	7	78	141
Male	63	2	20	85
All	119	9	98	226

we can now turn to descriptive statistics like the sample mean, $\bar{x}$: $$\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$$

nutri.height.mean()

163.96017699115043

and the p-sample quantile of $\mathbf{x}$, being $0 <p <1$, which refers to the value $x$ such that at least a fraction $p$ of the data is less than or equal to $x$ and at least a fraction $1-p$ is greater or equals to $x$. Thus, the sample median is the sample 0.5-quatile. For our nutri data the first (25), second (50) and third (75% percentile of the data) can be obtained as:

nutri.height.quantile(q=[0.25,0.5,0.75])

25    157.0
50    163.0
75    170.0
Name: height, dtype: float64

If, in addition to the location of the data we are interested in its dispersions, we can start by obtaining the range

nutri.height.max()-nutri.height.min()

48.0

Next, we can calculate the sample variance as $$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$$

round(nutri.height.var(),2)

81.06

or the standard deviation as $s=\sqrt{s^2}$ or, better:

round(nutri.height.std(),2)

9.0

or simply make use of the describe method shown above, taking profit of the quantitative type of the height data

nutri.height.describe()

count    226.000000
mean     163.960177
std        9.003368
min      140.000000
25%      157.000000
50%      163.000000
75%      170.000000
max      188.000000
Name: height, dtype: float64

2.5. Data visualization#

Depending on the data type, data visualization will differ. In order to use the visualization power of Python , first we will import the matplotlib.pyplot module, as well as numpy, in addition to the already imported pandas module.

import matplotlib.pyplot as plt
import numpy as np

2.5.1. Qualitative variables#

here we can show the data in a simple bar plot, taking itno account that the category (x-axis) is not per se a numerical value, so we should manually place the different categories in the plot.

width = 0.35  # the width of the bars
x = [0, 0.8, 1.6]    # position of the bars
situation_counts = nutri.situation.value_counts()  # note that situation_counts is a pandas series
plt.bar(x,situation_counts,width,edgecolor='black')
plt.xticks(x,situation_counts.index)
plt.show()

../_images/506619d4c523f0412c455859d9db9e195a45e93852629ef2e7decbc3321f8015.png

2.5.2. Quantitative variables#

Such type oif variables allow for more complex graphical representation. We will see some possibilities. We are in particular interested in visualizing the location, dispersion and shae of the data.

The first example is visualizing data with a boxplot, that gives information on the location and dispersion, showing also outliers (data $x_i$ that is beyond the whiskers of the boxplot, this is, $x_i<Q_1-1.5 (Q_3-Q_1)$ or $x_i>Q_3+1.5 (Q_3-Q_1)$, being $Q_3-Q_1$ called interquantile range or IQR).

plt.boxplot(nutri.age,widths=width,vert=False)  # the vert controls the verticality of the plot
plt.xlabel('age')
plt.show()

../_images/ac8d5d13cd1b2a9ce51db49e2178e3afcf41403b7f7982a81d6cdf3f54221e80.png

In a slightly more complex setup, we can compare two different categories in the boxplot:

males = nutri[nutri.gender == 'Male']
females = nutri[nutri.gender == 'Female']
plt.boxplot([males.coffee, females.coffee], notch =True , widths=(0.5 ,0.5))
plt.xlabel ('gender')
plt.ylabel ('coffee')
plt.xticks ([1 ,2] ,['Male','Female'])
plt.show()

../_images/9808ce2064fffb0982d9eaeb92fd29c06a6807b060329679f7118d2cd761d50f.png

we can also show the distribution of the data using a histogram, after breaking the data into bins or classes

plt.hist(nutri.age,bins=9,facecolor='green',edgecolor='black',linewidth=1)
plt.xlabel('Age')
plt.ylabel('Quantity')
plt.show()

../_images/de58b866c659774a45f452e8661b7bee4e9fde5c0175189478ec14ddc0f02306.png

alternatively, we can weigth the values to show $\frac{counts}{total}$ by using the trick of multiplying all values times the quantity $1/266$.

weigths = np.ones_like(nutri.age)/nutri.age.count()
plt.hist(nutri.age,bins=9,weights=weigths,facecolor='green',edgecolor='black',linewidth=1)
plt.xlabel('Age')
plt.ylabel('Proportion of total')
plt.show()

../_images/8e81a3dc123628df0b758f11a9146248f3f78281c742f7ab9fd27dd55aff7b8c.png

The empirical cumulative distribution function, $F_n$, is a step function that jupms $k/n$ at observation values, where $k$ is the fraction of tied observations at that value: $$F_n(x)=\frac{\mathrm{number \; of \; }x_i \leq x}{n}$$ The function can be defined and plotted in the same way for both discrete and continuous data.

x = np.sort(nutri.age)
y = np.linspace(0,1,len(nutri.age))
plt.xlabel('Age')
plt.ylabel('$F_n(x)$')
plt.step(x,y)
plt.xlim(x.min(),x.max())
plt.show()

../_images/5c09c0171d2460204a3c45e1fac44b9fb9b6778d7af953e12ef3224554383f20.png

Next, let us have a look at a plot that takes into account the contingency table shown above. We will make use now of the seaborn package, which simplifies the plotting of statistical information.

TODO: add examples with sns.histplot()

import seaborn as sns
sns.countplot(x='situation',hue='gender',data=nutri,hue_order=['Male','Female'],palette=['green','red'],saturation=1,edgecolor='black')
plt.legend(loc='upper right')
plt.xlabel('Situation')
plt.ylabel('Counts')
plt.show()

../_images/e9a26abe38a2259ab26dc8047e62afbdbe5759286d3b586397751f4b4f3d8679.png

Scatter plots are useful to visualize patterns between quantitative features.

plt.scatter(nutri.height,nutri.weight,s=12,marker='o')
plt.xlabel('height')
plt.ylabel('weight')
plt.show()

../_images/7133db2f7c837002dfa12547e2cf32443eee0e79b8af73aa4e13de874966a53f.png

We can get more sophisticated plots and fit some lines to the data to visualy compare data. In this example, we will use data from Vincent Arel Bundock repo on birth weights of babies for smoking and non smoking mothers, and plot the data againts the age of the mother. We will deal with significance of the results in upcoming sessions.

urlprefix = 'http://vincentarelbundock.github.io/Rdatasets/csv/'
dataname = 'MASS/birthwt.csv'
bwt = pd.read_csv(urlprefix + dataname)
bwt = bwt.drop('rownames' ,axis=1)  #drop unnamed column
bwt

	low	age	lwt	race	...	ht	ui	ftv	bwt
0	0	19	182	2	...	0	1	0	2523
1	0	33	155	3	...	0	0	3	2551
2	0	20	105	1	...	0	0	1	2557
3	0	21	108	1	...	0	1	2	2594
4	0	18	107	1	...	0	1	0	2600
...	...	...	...	...	...	...	...	...	...
184	1	28	95	1	...	0	0	2	2466
185	1	14	100	3	...	0	0	2	2495
186	1	23	94	3	...	0	0	0	2495
187	1	17	142	2	...	1	0	0	2495
188	1	21	130	1	...	1	0	3	2495

189 rows × 10 columns

styles = {0: ['o','red'], 1: ['^','blue']}
for k in styles:
    grp = bwt[bwt.smoke == k]
    m,b = np.polyfit(grp.age , grp.bwt , 1) # fit a straight line
    plt.scatter(grp.age , grp.bwt , c= styles[k][1] , s=15 , linewidth =0,
                 marker = styles[k][0])
    plt.plot(grp.age , m*grp.age + b, '-', color = styles[k][1])

plt.xlabel('age')
plt.ylabel('birth weight (g)')
plt.legend(['non - smokers','smokers'],prop={'size':8},loc=(0.7 ,0.3))
plt.show()

../_images/790eb0bba06ebe0c146cc502a8aa2bd8ac7f26ddd312c655623d05b07e964304.png

[1]

Dirk P. Kroese, Zdravko Botev, Thomas Taimre, and Radislav: Vaisman. Data Science and Machine Learning: Mathematical and Statistical Methods. Machine Learning & Pattern Recognition. Chapman & Hall/CRC, 2020. URL: https://acems.org.au/data-science-machine-learning-book-available-download (visited on 2023-08-15).

	gender	situation	tea	coffee	...	raw_fruit	cooked_fruit_veg	chocol	fat
0	2	1	0	0	...	1	4	5	6
1	2	1	1	1	...	5	5	1	4
2	2	1	0	4	...	5	2	5	4
3	2	1	0	0	...	4	0	3	2
4	2	1	2	1	...	5	5	3	2

	gender	situation	tea	coffee	...	raw_fruit	cooked_fruit_veg	chocol	fat
0	2	1	0	0	...	1	4	5	6
1	2	1	1	1	...	5	5	1	4
2	2	1	0	4	...	5	2	5	4
3	2	1	0	0	...	4	0	3	2
4	2	1	2	1	...	5	5	3	2