5. UNIT 4. Unsupervised Learning#

This Unit includes main introduction to unsupervised learning, strongly based in [1].

Table of contents

  • UNIT 4. Unsupervised Learning

    • EM: Univariate problem

    • EM: Multivariate case

    • DBSCAN

    • Hierarchical clustering

      • Agglomerative Hierarchical Clustering

      • Divisive Hierarchical Clustering

    • PCA with Singular Value Decomposition

5.1. EM: Univariate problem#

from numpy import hstack
from numpy.random import normal
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# generate a sample
X1 = normal(loc=20, scale=5, size=3000)
X2 = normal(loc=40, scale=5, size=7000)
X = hstack((X1, X2))
# plot the histogram
plt.hist(X, bins=50, density=True)
plt.show()
# reshape into a table with one column
X = X.reshape((len(X), 1))
# fit model
model = GaussianMixture(n_components=2, init_params='random')
model.fit(X)
yhat = model.predict(X) # predict latent values
print(yhat[:100])       # check latent value for first few points
print(yhat[-100:])      # check latent value for last few points
../_images/97841a3c7688a9655152d95d96c5c785af317d57e99b7b988897480901178a65.png
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0]

5.2. EM: Multivariate case#

We will code now a step by step example of the ME algorithm.

We will start by retaking the data in the K-Means experiment used in UNIT 2. Now, we are going to use the Expectation-Maximization (EM) algorithm.

In this exercise, we use 300 data points that were independently generated from three bivariate normal distributiuons with means:

cluster

mean vector

covariance matrix

1

\(\begin{pmatrix}-4\\0\end{pmatrix}\)

\(\begin{pmatrix}1&1.4\\1.4&1.5\end{pmatrix}\)

2

\(\begin{pmatrix}0.5\\-1\end{pmatrix}\)

\(\begin{pmatrix}2&-0.95\\-0.95&1\end{pmatrix}\)

3

\(\begin{pmatrix}-1.5\\-3\end{pmatrix}\)

\(\begin{pmatrix}2&0.1\\0.1&0.1\end{pmatrix}\)

import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt

#Xmat = np.genfromtxt('datasets/clusterdata.csv', delimiter=',')

mean = []
cov = []
xmat = []


# define the means and covariances of the three pieces of the dataset
mean.append(np.array([-4, 0]))
cov.append(np.array([[2,1.4],[1.4,1.5]]))

mean.append(np.array([0.5, -1]))
cov.append(np.array([[2,-0.95],[-0.95,1]]))

mean.append(np.array([-1.5, -3]))
cov.append(np.array([[2,0.1],[0.1,0.1]]))

c=['b','r','g']
for i in range(3):
    data=np.random.multivariate_normal(mean[i],cov[i],100)
    xmat.append(data)
    plt.scatter(data[:,0],data[:,1],c=c[i])     # plot each sub dataset

Xmat = np.concatenate((xmat[0],xmat[1],xmat[2]),axis=0)

K = 3                   # number of cluster
n, D = Xmat.shape       # dimension of data

# plot the contours of the three Gaussians

for i in range(3):
    N    = 200
    X    = np.linspace(-8, 4, N)
    Y    = np.linspace(-4, 4, N)
    X, Y = np.meshgrid(X, Y)
    pos  = np.dstack((X, Y))
    rv   = multivariate_normal(mean[i], cov[i])
    Z    = rv.pdf(pos)
    plt.contour(X, Y, Z,levels=2)

plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
../_images/6c964bb7af1f5b91944fd7334f50e8c6762cf2e730139a4d259ee9c91a1af005.png

We will apply the Expectation-Maximization algorithm by introducing auxiliary variables. We use an initial guess of variables \(W\)

  1. The input is the data and an initial guess for the parameters: weights of the Gaussians in the mixture, means and covariances for each

  2. We assign the parameters and evaluate \(\mu\) and \(\mathrm{Cov}\)

  3. We continue iterating until convergence is reached

# convergence is very sensitive to the choce of initial parameters
# so we should try different initial possibilities
W = np.array([[1/3,1/3,1/3]])
M  = np.array([[-2.0,-4,0],[-3,1,-1]])  #stores the mu's as columns
C = np.zeros((3,2,2))

# we assume first the covariances in each distribution
C[:,0,0] = 1
C[:,1,1] = 1

p = np.zeros((3,300))
print(p.shape)

for i in range(0,100):      # let us iterate a given number of steps

    for k in range(0,K):    #E-step
        mvn = multivariate_normal( M[:,k].T, C[k,:,:] )
        p[k,:] = W[0,k]*mvn.pdf(Xmat)

    p = p/sum(p,0)   #normalize

    W = np.mean(p,1).reshape(1,3)
    for k in range(0,K):
        M[:,k] = (Xmat.T @ p[k,:].T)/sum(p[k,:])
        xm = Xmat.T - M[:,k].reshape(2,1)
        C[k,:,:] = xm @ (xm*p[k,:]).T/sum(p[k,:])

print(W)
print(M)
print(C)
(3, 300)
[[0.30811885 0.3417826  0.35009855]]
[[-1.54716006 -4.09550252  0.80081707]
 [-2.98874145 -0.04513584 -1.31832856]]
[[[ 2.42875572  0.16570491]
  [ 0.16570491  0.10175855]]

 [[ 1.80454936  1.12889005]
  [ 1.12889005  1.23348418]]

 [[ 1.94232761 -0.7793724 ]
  [-0.7793724   0.96187334]]]

5.3. DBSCAN#

DBSCAN (Density-Based spatial Clustering of Applications with Noise) finds core samples in regions of high density and expands clusters from them. the algorithm works very well for data which contains clusters of similar density.

In the example below we run DBSCAN on the data generated above. In doing so, we realize how data that has not a clear difference in compactness is difficult to clusterize with DBSCAN which, in turn, does a very good job with clearly separated (and grouped) data.

import numpy as np

from sklearn import metrics
from sklearn.cluster import DBSCAN

print(Xmat)

db = DBSCAN(eps=1,min_samples=10).fit(Xmat)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present (noisy samples are given the label math: -1)
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
[[-3.68586805e+00  1.04535095e-01]
 [-3.73786659e+00  1.53294597e-01]
 [-2.80147624e+00  1.26232239e+00]
 [-3.02779398e+00  2.01279903e+00]
 [-4.22684541e+00 -2.20732466e-01]
 [-5.96418668e+00 -1.63515718e+00]
 [-1.10860163e+00  2.17524827e+00]
 [-4.01691267e+00  7.46308389e-01]
 [-4.53620407e+00  6.28668179e-01]
 [-5.38367713e+00 -8.67750023e-02]
 [-5.22043277e+00  7.27783868e-01]
 [-5.72419971e+00 -9.97901082e-01]
 [-2.55420947e+00 -1.37182775e+00]
 [-6.48511859e+00 -2.31082994e+00]
 [-6.26787507e+00 -8.47241155e-01]
 [-4.78859721e+00 -4.82420170e-01]
 [-3.11867989e+00 -6.28632331e-01]
 [-3.32886311e+00  5.06316292e-01]
 [-2.89252952e+00  2.86386198e-01]
 [-6.10360832e+00 -1.77866419e+00]
 [-4.09961779e+00 -1.48557723e+00]
 [-1.38180034e+00  2.43831184e+00]
 [-2.61133615e+00  8.85677323e-01]
 [-3.80580700e+00 -1.43616612e+00]
 [-3.41944309e+00 -7.41444513e-01]
 [-3.36693379e+00  1.93503978e-01]
 [-3.70220394e+00  6.64567259e-01]
 [-3.48622372e+00  8.37020384e-01]
 [-5.27247684e+00 -1.67107853e-01]
 [-3.64997231e+00 -2.23641294e-01]
 [-6.08655477e+00 -8.87343388e-01]
 [-3.11237682e+00 -5.92653778e-02]
 [-5.27640063e+00  1.49683386e-01]
 [-4.11158938e+00  5.62664788e-01]
 [-3.74202667e+00 -2.19332892e-01]
 [-5.08881050e+00 -1.02107459e+00]
 [-3.82704188e+00  1.43941377e+00]
 [-3.52607942e+00 -2.43440249e-01]
 [-3.26163186e+00  1.77996510e+00]
 [-3.82933763e+00  1.14616580e-02]
 [-4.14190597e+00 -3.82177862e-01]
 [-5.54070564e+00 -1.13317187e+00]
 [-5.40411244e+00  4.10260031e-01]
 [-4.69202772e+00 -6.07432023e-01]
 [-6.59598158e+00 -1.34631280e+00]
 [-5.74903223e+00 -1.02716616e+00]
 [-6.11826952e+00 -1.63088231e+00]
 [-3.20361631e+00  4.27234010e-01]
 [-1.48391835e+00  1.62703025e+00]
 [-3.59158298e+00  1.30377049e+00]
 [-4.40984285e+00 -3.84195663e-01]
 [-2.20556394e+00  8.69183187e-01]
 [-2.97109026e+00  1.09389282e+00]
 [-4.76599970e+00  8.98268009e-01]
 [-2.50024081e+00  8.68329097e-01]
 [-3.05552387e+00  8.32484836e-01]
 [-4.53314536e+00 -6.03529807e-01]
 [-4.79032407e+00 -6.80666051e-01]
 [-4.11711025e+00 -9.91316551e-02]
 [-6.44939861e+00 -1.79386967e+00]
 [-4.10873193e+00 -1.27885267e+00]
 [-3.57476701e+00 -3.74709388e-01]
 [-3.71488032e+00 -9.56248979e-01]
 [-3.79133325e+00 -1.13516623e+00]
 [-3.18381041e+00  1.71218651e+00]
 [-4.12614397e+00  3.49645231e-01]
 [-5.36064094e+00 -1.19115428e+00]
 [-2.17931309e+00 -7.07475465e-02]
 [-6.56142874e+00 -1.66327708e+00]
 [-6.78855212e+00 -1.19373327e+00]
 [-6.25544548e+00 -8.71717652e-01]
 [-3.71381543e+00  1.92877039e+00]
 [-5.07119565e+00 -1.29237381e+00]
 [-1.59775814e+00  2.19593333e+00]
 [-3.93517370e+00 -4.11335301e-03]
 [-2.80475980e+00  6.15555547e-01]
 [-5.51268948e+00 -1.57954382e+00]
 [-2.58598068e+00 -3.27315406e-01]
 [-3.12994703e+00  3.50452920e-01]
 [-4.57618311e+00  6.21738977e-04]
 [-3.72125425e+00 -8.08933202e-01]
 [-5.26377962e+00 -5.20548121e-01]
 [-2.15628730e+00 -9.40951638e-02]
 [-5.56200170e+00 -1.08807957e+00]
 [-2.85022369e+00  1.05772794e+00]
 [-4.38777227e+00 -1.13984703e+00]
 [-3.01705605e+00  1.14465714e+00]
 [-3.17312843e+00 -6.07711107e-01]
 [-5.59006539e+00 -1.53224203e+00]
 [-4.67570377e+00 -9.36124778e-01]
 [-3.94743790e+00  5.38657281e-01]
 [-5.60775143e+00 -1.11685208e+00]
 [-4.91450646e+00 -1.09292510e+00]
 [-3.60209618e+00 -4.01888252e-01]
 [-3.68100488e+00  3.01080262e-01]
 [-2.90390707e+00  6.06573681e-01]
 [-3.94324508e+00  8.53216588e-01]
 [-7.47631296e+00 -3.14786071e+00]
 [-4.07202329e+00  6.02036565e-01]
 [-3.21314175e+00  1.29349900e+00]
 [ 2.89475743e+00 -3.97598745e+00]
 [-1.80956459e+00 -1.13014772e-01]
 [-3.00569972e+00  1.65480116e+00]
 [-1.30857164e+00 -9.39222551e-01]
 [-8.18680354e-01 -1.74268069e-01]
 [-2.91200529e-01 -8.00841923e-01]
 [ 1.78419798e+00 -2.39625679e+00]
 [ 1.02591432e+00 -1.21951712e+00]
 [ 2.18236261e+00 -6.73937446e-01]
 [ 3.92176221e+00 -2.48206932e+00]
 [ 2.40483997e+00 -3.08162482e+00]
 [ 3.10708702e+00 -2.99239311e+00]
 [ 2.04351104e+00 -7.45602857e-01]
 [ 1.39704407e+00 -1.00460377e+00]
 [-9.25019686e-01  1.68051589e+00]
 [-1.90097235e-01 -6.42761538e-01]
 [ 1.64575105e+00 -1.38244384e+00]
 [ 1.98509885e+00 -2.13390903e+00]
 [ 2.47848814e+00 -7.40193803e-01]
 [ 1.93178226e+00 -2.12158877e+00]
 [ 1.26974396e+00 -5.42164788e-01]
 [-3.80484193e+00  2.91958301e-01]
 [-1.19523728e+00 -1.10952718e+00]
 [ 1.12378596e+00 -1.28700004e+00]
 [ 1.80729911e+00 -2.29497019e+00]
 [ 2.90544495e+00 -3.79799026e+00]
 [-1.56761739e+00  7.58514613e-01]
 [ 4.18883419e-01 -7.09314775e-01]
 [ 1.28094965e+00 -1.44927938e+00]
 [-4.98004233e-01 -2.77851793e-01]
 [ 2.09039099e-02 -1.32617623e+00]
 [ 2.38133459e-01 -1.45311882e+00]
 [-1.16590290e+00 -1.18694084e+00]
 [ 3.48722746e-01 -1.32371513e+00]
 [-5.03176541e-01 -1.27463402e+00]
 [ 8.56677741e-01 -3.80194142e-01]
 [ 3.76996501e-01 -5.94732773e-01]
 [-1.45764628e-01 -3.17445291e-01]
 [ 3.09004793e+00 -1.99146298e+00]
 [ 2.03510166e+00 -8.92813805e-01]
 [ 5.03879073e-01 -1.59982619e+00]
 [-6.25880389e-01 -9.81134346e-01]
 [ 1.15564811e+00 -2.05773712e+00]
 [-5.93575370e-01 -6.41185361e-01]
 [-7.63015824e-03 -1.38918058e+00]
 [-6.03925103e-02 -1.21271510e+00]
 [ 3.03700141e+00 -1.89718242e+00]
 [ 1.32239940e+00 -3.46085526e-01]
 [ 4.07005098e-02 -1.23735053e+00]
 [-3.88960747e-01  3.50315226e-01]
 [ 1.60345199e+00 -1.52905338e+00]
 [ 3.27260061e+00 -3.60457663e+00]
 [ 1.19287862e+00 -1.03124930e+00]
 [ 1.45912844e+00 -1.76401023e+00]
 [-6.22594293e-01 -1.17415302e+00]
 [ 1.65979619e+00 -1.36950869e+00]
 [ 4.51985728e-02  3.81615301e-01]
 [-6.05341988e-01 -1.65942059e+00]
 [ 1.70696262e-01 -8.08892501e-01]
 [ 5.45946196e-01 -8.80301739e-01]
 [ 9.69220501e-02 -2.01698062e+00]
 [ 1.02404089e+00 -1.59788456e+00]
 [-1.42725131e+00 -1.28886472e+00]
 [ 4.10988372e+00 -3.57542309e+00]
 [ 1.69995919e+00 -1.49319619e+00]
 [-2.44786299e-01 -1.29757621e+00]
 [-1.96179291e+00  1.06104283e+00]
 [ 4.97968959e-02 -4.09275996e-01]
 [ 9.09107067e-01 -1.77972129e-01]
 [ 7.05461615e-01 -3.48419096e-01]
 [ 7.74429178e-01 -5.24983841e-01]
 [-1.02830754e+00  1.84306621e-02]
 [ 2.93721086e+00 -2.95193005e+00]
 [ 1.15093867e+00 -1.96052631e+00]
 [-4.67835474e-01 -9.09631625e-01]
 [-1.69288493e+00 -3.01052110e-01]
 [ 8.36693832e-01  4.65939180e-01]
 [ 1.52012571e+00  1.07327762e-01]
 [ 8.00692133e-01 -1.60295563e+00]
 [ 8.43301851e-01 -3.85651683e-01]
 [ 4.98591145e-01 -1.28355775e+00]
 [ 2.38121947e+00 -1.34920842e+00]
 [ 1.62338023e+00 -7.11727405e-04]
 [-4.60840130e-01 -9.03531077e-01]
 [ 3.66536696e-01 -2.50442759e+00]
 [ 8.23453899e-01 -2.05345773e+00]
 [ 2.31202117e+00 -1.37528295e+00]
 [ 1.59534678e+00 -1.99937015e+00]
 [-1.56614384e+00 -4.03027292e-01]
 [ 6.41544578e-01 -1.72810710e+00]
 [ 2.61889783e+00 -2.53847763e+00]
 [-2.87850335e+00  1.51485110e+00]
 [ 5.40696766e-01 -1.27754071e-01]
 [ 1.16340849e+00 -1.56240233e+00]
 [ 2.71538843e+00 -1.50564934e+00]
 [ 1.42924700e+00 -1.04394011e+00]
 [ 6.18426789e-01 -1.22546822e+00]
 [ 2.62144300e+00 -1.26742527e+00]
 [ 2.15374656e+00 -1.04035301e+00]
 [ 2.95570559e-01 -8.13494869e-01]
 [-3.49532958e+00 -3.18655975e+00]
 [-1.06699262e+00 -3.01961317e+00]
 [ 6.76297677e-01 -2.80895945e+00]
 [ 5.90279542e-01 -3.33550579e+00]
 [-2.19033211e+00 -3.32178804e+00]
 [-1.24885437e+00 -2.69518863e+00]
 [-3.07167227e-01 -2.85057235e+00]
 [-3.81910410e+00 -3.21755138e+00]
 [-3.83203750e+00 -3.41829132e+00]
 [-3.34021226e-01 -3.05591028e+00]
 [ 5.96423009e-03 -3.09606365e+00]
 [-1.16812427e+00 -3.08438210e+00]
 [ 1.75943515e+00 -3.13584959e+00]
 [ 1.38620504e+00 -2.01753984e+00]
 [ 9.52444604e-01 -2.64684713e+00]
 [-4.12415377e-01 -3.60950861e+00]
 [-3.08744911e+00 -3.04521655e+00]
 [-3.62322961e+00 -3.20363804e+00]
 [-7.58147046e-02 -2.93631229e+00]
 [-2.04713575e+00 -2.95597257e+00]
 [-5.31327244e+00 -3.63279293e+00]
 [-3.14845075e+00 -2.88597782e+00]
 [-1.35875874e+00 -2.88410792e+00]
 [-1.95473004e+00 -3.22209080e+00]
 [-7.20173528e-01 -2.65063886e+00]
 [-3.13329221e+00 -3.36942836e+00]
 [ 7.14428115e-01 -3.19703579e+00]
 [ 1.97046762e+00 -2.36960234e+00]
 [-1.65078164e+00 -3.43329038e+00]
 [-1.89625256e-01 -2.93754640e+00]
 [-1.84196483e+00 -2.47483543e+00]
 [-7.36253742e-01 -2.72701525e+00]
 [-1.02160953e+00 -3.11663761e+00]
 [-1.53820237e+00 -3.22117907e+00]
 [-2.00138297e+00 -2.68376234e+00]
 [-8.92459911e-01 -2.86097971e+00]
 [-3.59017076e+00 -3.12363393e+00]
 [-3.78657370e+00 -3.64551386e+00]
 [ 8.14664927e-01 -2.91642809e+00]
 [-2.62699716e+00 -3.02247194e+00]
 [-2.88395948e+00 -3.01169526e+00]
 [-3.44680607e+00 -3.42808707e+00]
 [ 5.81718138e-02 -2.71209959e+00]
 [-2.86113468e+00 -3.12466313e+00]
 [ 1.69903609e+00 -2.60098145e+00]
 [-2.54756192e+00 -2.50696634e+00]
 [-8.83761527e-01 -3.30072527e+00]
 [ 7.15606002e-01 -2.93921152e+00]
 [-8.24396524e-01 -2.82664001e+00]
 [-3.07431459e+00 -3.08104765e+00]
 [ 6.51222767e-01 -3.14466728e+00]
 [ 9.56571339e-02 -1.99880061e+00]
 [-3.94100522e+00 -2.77790366e+00]
 [ 7.31634027e-01 -3.16431270e+00]
 [-2.01022736e+00 -2.65573686e+00]
 [-1.70388152e+00 -2.61618253e+00]
 [-2.79149304e+00 -3.15854081e+00]
 [-7.41177420e-01 -2.83687439e+00]
 [-7.21291138e-01 -2.95795791e+00]
 [-6.82127679e-01 -3.19162948e+00]
 [-3.07164415e+00 -2.95619897e+00]
 [ 5.84401376e-01 -2.67539833e+00]
 [-2.43604908e+00 -3.31939016e+00]
 [-1.58922326e+00 -2.67846862e+00]
 [-5.82701107e-01 -2.56011225e+00]
 [-2.26066228e+00 -3.01961565e+00]
 [-1.59862442e+00 -2.86295600e+00]
 [-3.91226200e+00 -3.89576196e+00]
 [ 2.92926640e-01 -2.82578973e+00]
 [ 1.20619760e-01 -3.44210900e+00]
 [-1.55159413e+00 -2.81344546e+00]
 [ 2.97732498e-01 -2.86248529e+00]
 [-2.06722498e+00 -2.97233960e+00]
 [-2.71074644e+00 -2.91779688e+00]
 [-3.54229275e+00 -2.94056047e+00]
 [-1.93609885e+00 -2.64763855e+00]
 [-7.71517422e-01 -2.70377784e+00]
 [-2.03618234e+00 -3.25550744e+00]
 [-9.57487826e-02 -2.44963599e+00]
 [-2.55033429e+00 -3.13242171e+00]
 [-4.21188673e+00 -2.84026109e+00]
 [-2.75069990e+00 -3.19155565e+00]
 [-2.21483104e+00 -3.25315261e+00]
 [-1.70254326e+00 -2.71040554e+00]
 [-1.31020115e+00 -2.45019376e+00]
 [-1.21822260e+00 -3.36886732e+00]
 [-1.01220943e+00 -2.67335600e+00]
 [-3.88310451e+00 -3.21819468e+00]
 [ 7.36989132e-01 -2.44029573e+00]
 [-3.64932926e+00 -2.89417168e+00]
 [ 2.27431402e-01 -3.00696089e+00]
 [-8.15598529e-02 -2.52634105e+00]
 [-2.09160684e+00 -3.21427148e+00]
 [-9.07571236e-01 -3.85777781e+00]
 [ 2.67204608e+00 -2.67903952e+00]
 [-2.61314978e+00 -2.98057194e+00]
 [ 2.05043758e-01 -2.54379381e+00]
 [ 1.18266751e+00 -2.45147761e+00]
 [-3.16351495e+00 -2.44421384e+00]
 [ 5.74934004e-01 -2.82250250e+00]]
Estimated number of clusters: 1
Estimated number of noise points: 10
unique_labels = set(labels)
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = labels == k

    xy = Xmat[class_member_mask & core_samples_mask]
    plt.plot(
        xy[:, 0],
        xy[:, 1],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="k",
        markersize=14,
    )

    xy = Xmat[class_member_mask & ~core_samples_mask]
    plt.plot(
        xy[:, 0],
        xy[:, 1],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="k",
        markersize=6,
    )

plt.title(f"Estimated number of clusters: {n_clusters_}")
plt.show()
../_images/d284391c3a0a1651404c62de321cc06e12c53a505f99fce1f258c9bcec6bdc47.png

5.4. Hierarchical clustering#

In hierarchical clustering we use either an agglomerative or divisive algorithm to generate clusters in a given dataset. In this very simple implementation, we use scipy to clusterize a collection of 3 clusters and one single orphan data point. Adapted from this entry in Stackoverflow.

The threshold given as a parameter in the code (thresh) is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.

import matplotlib.pyplot as plt
import numpy
import scipy.cluster.hierarchy as hcluster

# generate 3 clusters of each around 100 points and one orphan point
N=100
data = numpy.random.randn(3*N,2)
data[:N] += 5
data[-N:] += 10
data[-1:] -= 20

# clustering
thresh = 1.5
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

# plotting
plt.scatter(*numpy.transpose(data), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
plt.title(title)
plt.show()
../_images/aef486cea8e42a974c1cc86270a6e808126840f46f1a2cce9b3282235dc0e101.png

5.4.1. Agglomerative Hierarchical Clustering#

Find below a basic implementation of the Ward linkage function for agglomerative hierarchical clustering, using scipy functions. Adapted from [1]

import numpy as np 
from scipy.spatial.distance import cdist

def update_distances(D,i,j, sizes): # calculate distances for merged cluster
    n = D.shape[0]     
    d = np.inf * np.ones(n+1)
    for k in range(n): # Update distances     
       d[k] = ((sizes[i]+sizes[k])*D[i,k] + 
       (sizes[j]+sizes[k])*D[j,k] - 
       sizes[k]*D[i,j])/(sizes[i] + sizes[j] + sizes[k])
                   
    infs =  np.inf * np.ones(n) # array of infinity
    D[i,:],D[:,i],D[j,:],D[:,j] =  infs,infs,infs,infs # deactivate 
    new_D = np.inf * np.ones((n+1,n+1))
    new_D[0:n,0:n] = D # copy old matrix into new_D
    new_D[-1,:], new_D[:,-1] = d,d # add new row and column
    return new_D
            
def agg_cluster(X):  
    n = X.shape[0]
    sizes = np.ones(n)
    D = cdist(X, X,metric = 'sqeuclidean')   # initialize distance matr.
    np.fill_diagonal(D, np.inf * np.ones(D.shape[0]))    
    Z = np.zeros((n-1,4))  #linkage matrix encodes hierachy tree
    for t in range(n-1):
        i,j = np.unravel_index(D.argmin(), D.shape) # minimizer pair
        sizes = np.append(sizes, sizes[i] + sizes[j])
        Z[t,:]=np.array([i, j, np.sqrt(D[i,j]), sizes[-1]]) 
        D = update_distances(D, i,j, sizes)  # update distance matr.
    return Z   
    
# MAIN ROUTINE
import scipy.cluster.hierarchy as h 

#X = np.genfromtxt('clusterdata.csv',delimiter=',') # read the data
Z = agg_cluster(Xmat)  # form the linkage matrix

h.dendrogram(Z) # SciPy can produce a dendogram from Z 
# fcluster function assigns cluster ids to all points based on Z 
cl = h.fcluster(Z, criterion = 'maxclust', t=3) 

import matplotlib.pyplot as plt
plt.figure(2), plt.clf()
cols = ['red','green','blue']
colors = [cols[i-1] for i in cl]
plt.scatter(Xmat[:,0], Xmat[:,1],c=colors)
plt.show()
../_images/242c9183548cffdfe98cc9668dbf20e4a1e492227b68d59ab0d2fc8dab8ffdf5.png ../_images/4e9ede66165a3702c32daffc02cc85845b00918c15ebaf88fbf9a2aea6ec6e1e.png

5.4.2. Divisive Hierarchical Clustering#

and now, for the same data, we aply a divisive clustering approach (also taken from [1])

""" clustCE2.py """
import numpy as np
from numpy import genfromtxt
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt

def S(x,D):
    V1 = np.where(x==0)[0] # {V1,V2} is the partition 
    V2 = np.where(x==1)[0]
    tmp = D[V1]
    tmp = tmp[:,V2]
    return np.mean(tmp) # the size of the cut

def maxcut(D,N,eps,rho,alpha):
    n = D.shape[1]
    Ne = int(rho*N)
    p = 1/2*np.ones(n) 
    p[0] = 1.0
    while (np.max(np.minimum(p,np.subtract(1,p))) > eps):
        x = np.array(np.random.uniform(0,1, (N,n))<=p, dtype=np.int64)
        sx = np.zeros(N)
        for i in range(N):
            sx[i] = S(x[i],D)
        
        sortSX = np.flip(np.argsort(sx))
        elIds = sortSX[0:Ne]
        elites = x[elIds]
        pnew = np.mean(elites, axis=0)
        p = alpha*pnew + (1.0-alpha)*p

    return np.round(p)

#Xmat = genfromtxt('clusterdata.csv', delimiter=',')
n = Xmat.shape[0]

D = squareform(pdist(Xmat))

N = 1000
eps = 10**-2
rho = 0.1
alpha = 0.9

# CE
pout = maxcut(D,N,eps,rho, alpha);

cutval = S(pout,D)
print("cutvalue ",cutval)
#plot
V1 = np.where(pout==0)[0]
xblue = Xmat[V1]
V2 = np.where(pout==1)[0]
xred = Xmat[V2]
plt.scatter(xblue[:,0],xblue[:,1], c="blue")
plt.scatter(xred[:,0],xred[:,1], c="red")
cutvalue  4.89440804758133
<matplotlib.collections.PathCollection at 0x7fa96b147880>
../_images/72a7b7df37d8c85be0f7215ca775e786eca5b6c9ce4d7f353539839d3be17ed5.png

We will apply HCA to the famous Iris dataset by Fisher. we start by retrieving the data

import requests, zipfile, io

url = "https://archive.ics.uci.edu/static/public/53/iris.zip"
filename = "datasets/iris.data"

r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("datasets")

# from iris.names
#   1. sepal length in cm
#   2. sepal width in cm
#   3. petal length in cm
#   4. petal width in cm
#   5. class

import pandas as pd
df = pd.read_csv(filename, sep=',',header=None, names=['Sepal.length', 'Sepal.width', 'Petal.length','Petal.width','Species'])
df
Sepal.length Sepal.width Petal.length Petal.width Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

now we build a Scatter Plot Matrix

from pandas.plotting import scatter_matrix
import numpy as np
colors= np.array(50*['r']+50*['g']+50*['b'])
scatter_matrix(df,alpha=0.2,figsize=(6,6),grid=True,diagonal='hist',color=colors)
array([[<AxesSubplot:xlabel='Sepal.length', ylabel='Sepal.length'>,
        <AxesSubplot:xlabel='Sepal.width', ylabel='Sepal.length'>,
        <AxesSubplot:xlabel='Petal.length', ylabel='Sepal.length'>,
        <AxesSubplot:xlabel='Petal.width', ylabel='Sepal.length'>],
       [<AxesSubplot:xlabel='Sepal.length', ylabel='Sepal.width'>,
        <AxesSubplot:xlabel='Sepal.width', ylabel='Sepal.width'>,
        <AxesSubplot:xlabel='Petal.length', ylabel='Sepal.width'>,
        <AxesSubplot:xlabel='Petal.width', ylabel='Sepal.width'>],
       [<AxesSubplot:xlabel='Sepal.length', ylabel='Petal.length'>,
        <AxesSubplot:xlabel='Sepal.width', ylabel='Petal.length'>,
        <AxesSubplot:xlabel='Petal.length', ylabel='Petal.length'>,
        <AxesSubplot:xlabel='Petal.width', ylabel='Petal.length'>],
       [<AxesSubplot:xlabel='Sepal.length', ylabel='Petal.width'>,
        <AxesSubplot:xlabel='Sepal.width', ylabel='Petal.width'>,
        <AxesSubplot:xlabel='Petal.length', ylabel='Petal.width'>,
        <AxesSubplot:xlabel='Petal.width', ylabel='Petal.width'>]],
      dtype=object)
../_images/9eadf093c533503147dc7d90528c6a874c6a40c7c07689abc1332bb2a4b59a22.png

which, in terms of numerical correlation, yields

matrix=df.corr()
print(matrix)
              Sepal.length  Sepal.width  Petal.length  Petal.width
Sepal.length      1.000000    -0.109369      0.871754     0.817954
Sepal.width      -0.109369     1.000000     -0.420516    -0.356544
Petal.length      0.871754    -0.420516      1.000000     0.962757
Petal.width       0.817954    -0.356544      0.962757     1.000000

which can be visualized easily with a heatmap with seaborn:

import seaborn as sns
sns.heatmap(matrix,cmap='YlGnBu')
<AxesSubplot:>
../_images/aab46d7129840a869c2ac7be62b1c4d1f01735f57ea512ae29e175a30c713bd1.png

we can also see the diversity of the data using a Parallel Coordinates plot

plt.figure(figsize=(16,8))
pd.plotting.parallel_coordinates(df,'Species')
<AxesSubplot:>
../_images/4fb10fa53a081ed9223d4249d7de3a68c89d03ac6cd549c22ea694ddc36880ec.png
from scipy.cluster.hierarchy import dendrogram, linkage


df_dend = df.set_index('Species')
df_dend = df_dend.reset_index(drop=True)
df.head()

Z = linkage(df_dend,'ward')

plt.figure(figsize=(16,8))
plt.title('Hierarchical clustering dendrogram')
plt.ylabel('sample index')
plt.xlabel('distance (Ward)')

dendrogram(Z,labels=df_dend.index,orientation='left')

plt.show()
../_images/dfc2ccf07cec2a0c95100d0811c97aa67630f20de988ded3497935bf5b9ea407.png

These visualizations easily demonstrat how the separation of the hierarchical clustering is very good with the “Setosa” species, but misses in labeling many “Versicolor” species as “Virginica”. Using again seaborn we can mix the HCA with a heatmap:

sns.clustermap(df_dend)
<seaborn.matrix.ClusterGrid at 0x7fa9629c85b0>
../_images/c6fcd9c8eca529966eaf34a5ab171ea5f4aa1d7ab72ef62a034d6fa03615e143.png

Other hierarchical clustering methods are more popular, like DBSCAN. DBSCAN works well when seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage.

5.5. PCA with Singular Value Decomposition#

""" PCAdat.py """
import numpy as np 
X = np.genfromtxt('datasets/pcadat.csv', delimiter=',')

n = X.shape[0]

# Gram matrix
X = X - X.mean(axis=0)
G = X.T @ X 
U, _ , _ = np.linalg.svd(G/n)

# projected points first component
Y = X @ np.outer(U[:,0],U[:,0])

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.w_xaxis.set_pane_color((0, 0, 0, 0))
ax.plot(Y[:,0], Y[:,1], Y[:,2], c='k', linewidth=1)
ax.scatter(X[:,0], X[:,1], X[:,2], c='b')
ax.scatter(Y[:,0], Y[:,1], Y[:,2], c='r')

for i in range(n):
    ax.plot([X[i,0], Y[i,0]], [X[i,1],Y[i,1]], [X[i,2],Y[i,2]], 'b')
    
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')

# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
plt.savefig('../figures/pca1py.pdf')
plt.show()
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
../_images/def4195413a80b6a219aa95ffa154af60fc6d67f1eeb7f1750f85d8838661c2d.png
[1] (1,2,3)

Dirk P. Kroese, Zdravko Botev, Thomas Taimre, and Radislav: Vaisman. Data Science and Machine Learning: Mathematical and Statistical Methods. Machine Learning & Pattern Recognition. Chapman & Hall/CRC, 2020. URL: https://acems.org.au/data-science-machine-learning-book-available-download (visited on 2023-08-15).