Random Forests

#### Next topic

Hierarchical clustering

# K-means clustering¶

We will apply -means clustering to the NCI data, which is the data used for the hierarchical cluster we saw last class. This plot shows the within cluster sum of squares as a function of the number of clusters. To estimate the variability, we used 5 different random initial data points to initialize K-means. As you can see, it is typically descreasing, and there is some variability.

In:

```%load_ext rmagic
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster import vq```

Given a data matrix, a set of centroids associated to some partition labels, we want to compute the within cluster sum of squares:

In:

```def within_sum_of_squares(data, centroids, labels):
"""
Compute total sum of squares of a prototype clustering
algorithm.
"""

SSW = 0
for l in np.unique(labels):
data_l = data[labels == l]
resid = data_l - centroids[l]
SSW += (resid**2).sum()
return SSW

%R -o nci -n library(ElemStatLearn); data(nci)```

In:

```nci.shape;
```

Let’s form a number of clustering of the samples of different sizes, storing the within cluster sum of squares

In:

```niter, kmax = 5, 25
SSW = []
for k in range(1,kmax):
SSW.append([within_sum_of_squares(nci.T, *vq.kmeans2(nci.T, k=k, minit='points')) for _ in range(niter)])
SSW = np.array(SSW)
```

In:

```plt.errorbar(range(1,kmax), SSW.mean(1), yerr=SSW.std(1), fmt='ro')
```

Out:

`<Container object of 3 artists>` `<matplotlib.figure.Figure at 0x90b12b0>`

## Iris data¶

We can also try to see how stable the clustering is on the iris data by repeating it several times. I found a random instance that split the Iris-setosa cluster and have saved the seed. There appear to be a few configurations that it chooses, in any case, there are definitely more than one.

```    set.seed(3)
sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
"iris.type")
par(mfrow = c(2, 2))
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])

``` ```    plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])
plot(iris\$sepal.length, iris\$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)\$cluster])

``` ## K-medoid (PAM)¶

The -medoid algorithm is implemented in the pam function in the cluster package in R. PAM stands for: Partitioning Around Medoid. It seems to be more robust than -means in the sense that for the iris data, it almost never split the Iris-setosa cluster into 2 groups. The call to sample is just to shuffle the rows of the dataset to ensure that nothing special is happening based on the ordering of the rows.

```    library(cluster)
par(mfrow = c(2, 2))

s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])

``` ```    s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)\$cluster])

``` Finally, the pam function also allows us to produce a silhouette plot

```    plot(pam(iris[, -5], 3), which.plots = 2)

``` As a function of the number of clusters, the average silhouette width reaches its best value at 2 for the iris data. Unfortunately, the silhouette method cannot select 1 cluster.

```    ASW = numeric(19)
k = 2:20
for (kk in 2:20) {
ASW[kk - 1] = pam(iris[, -5], kk)\$silinfo\$avg.width
}
plot(k, ASW, pch = 23, bg = "red")

``` ## Mixture modelling¶

A softer version of -means (or -medoids) can be found using a Gaussian mixture model. Rather than assigning each point to a cluster, a mixture model assigns each point to a cluster with some probability.

Nevertheless, at the end we can obtain a label by choosing the cluster with highest estimated probability.

```    library(mclust)

```
```    ## Warning: package 'mclust' was built under R version 2.15.1

```
```    ## Package 'mclust' version 4.0

```
```    par(mfrow = c(2, 2))
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))\$classification])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))\$classification])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))\$classification])
s = sample(150)
plot(iris\$sepal.length[s], iris\$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))\$classification])

``` 