py4sci

Table Of Contents

Previous topic

Random Forests

Next topic

Hierarchical clustering

This Page

K-means clustering

We will apply K-means clustering to the NCI data, which is the data used for the hierarchical cluster we saw last class. This plot shows the within cluster sum of squares as a function of the number of clusters. To estimate the variability, we used 5 different random initial data points to initialize K-means. As you can see, it is typically descreasing, and there is some variability.

In[2]:

%load_ext rmagic
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster import vq

Given a data matrix, a set of centroids associated to some partition labels, we want to compute the within cluster sum of squares:

In[3]:

def within_sum_of_squares(data, centroids, labels):
    """
    Compute total sum of squares of a prototype clustering
    algorithm.
    """

    SSW = 0
    for l in np.unique(labels):
        data_l = data[labels == l]
        resid = data_l - centroids[l]
        SSW += (resid**2).sum()
    return SSW

%R -o nci -n library(ElemStatLearn); data(nci)

In[4]:

nci.shape;

Let’s form a number of clustering of the samples of different sizes, storing the within cluster sum of squares

In[14]:

niter, kmax = 5, 25
SSW = []
for k in range(1,kmax):
    SSW.append([within_sum_of_squares(nci.T, *vq.kmeans2(nci.T, k=k, minit='points')) for _ in range(niter)])
SSW = np.array(SSW)

In[15]:

plt.errorbar(range(1,kmax), SSW.mean(1), yerr=SSW.std(1), fmt='ro')

Out[15]:

<Container object of 3 artists>
_images/kmeans_fig_00.png
<matplotlib.figure.Figure at 0x90b12b0>

Iris data

We can also try to see how stable the clustering is on the iris data by repeating it several times. I found a random instance that split the Iris-setosa cluster and have saved the seed. There appear to be a few configurations that it chooses, in any case, there are definitely more than one.

    set.seed(3)
    iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        sep = ",", header = FALSE)
    names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
        "iris.type")
    par(mfrow = c(2, 2))
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])

_images/kmeans_fig_01.png
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
        -5], 3)$cluster])

_images/kmeans_fig_02.png

K-medoid (PAM)

The K-medoid algorithm is implemented in the pam function in the cluster package in R. PAM stands for: Partitioning Around Medoid. It seems to be more robust than K-means in the sense that for the iris data, it almost never split the Iris-setosa cluster into 2 groups. The call to sample is just to shuffle the rows of the dataset to ensure that nothing special is happening based on the ordering of the rows.

    library(cluster)
    par(mfrow = c(2, 2))

    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])

_images/kmeans_fig_03.png
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[pam(iris[s, -5], 3)$cluster])

_images/kmeans_fig_04.png

Finally, the pam function also allows us to produce a silhouette plot

    plot(pam(iris[, -5], 3), which.plots = 2)

_images/kmeans_fig_05.png

As a function of the number of clusters, the average silhouette width reaches its best value at 2 for the iris data. Unfortunately, the silhouette method cannot select 1 cluster.

    ASW = numeric(19)
    k = 2:20
    for (kk in 2:20) {
        ASW[kk - 1] = pam(iris[, -5], kk)$silinfo$avg.width
    }
    plot(k, ASW, pch = 23, bg = "red")

_images/kmeans_fig_06.png

Mixture modelling

A softer version of K-means (or K-medoids) can be found using a Gaussian mixture model. Rather than assigning each point to a cluster, a mixture model assigns each point to a cluster with some probability.

Nevertheless, at the end we can obtain a label by choosing the cluster with highest estimated probability.

    library(mclust)

    ## Warning: package 'mclust' was built under R version 2.15.1

    ## Package 'mclust' version 4.0

    par(mfrow = c(2, 2))
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
    s = sample(150)
    plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
        "green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])

_images/kmeans_fig_07.png