py4sci

Table Of Contents

Previous topic

K-means clustering

Next topic

Model-based clustering

This Page

Hierarchical clustering

Complete linkage

First, we will cluster using “complete” linkage which uses the maximum dissimilarity

    iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        sep = ",", header = FALSE)
    names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
        "iris.type")
    iris_hclust = hclust(dist(iris[, -5]))
    plot(iris_hclust)

_images/hierarchical_fig_00.png

We can cut the tree and look at the resulting clustering. Let’s cut it at the canonical 3 groups. We see the results are quite similar to the K-means and mixture model results.

    iris_3 = cutree(iris_hclust, k = 3)
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])

_images/hierarchical_fig_01.png

From the first plot, the three groups corresponds to a height of roughly 4, perhaps a little bit less. We can also cut the tree by height. This means that the maximum dissimilarity between any clusters is 4.

    iris_h = cutree(iris_hclust, h = 3.9)
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_h])

_images/hierarchical_fig_02.png
    iris_6 = cutree(iris_hclust, k = 6)
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green",
        "yellow", "orange", "purple")[iris_6])

_images/hierarchical_fig_03.png

Single linkage

Single linkage uses the minimum distance between the clusters

    iris_hclust_single = hclust(dist(iris[, -5]), method = "single")
    plot(iris_hclust_single)

_images/hierarchical_fig_04.png

This plot has the prototypical ``chaining’’ seen in single linkage. Its split into 3 groups has one large group, with a very small group of size 2.

    iris_3 = cutree(iris_hclust_single, k = 3)
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])

_images/hierarchical_fig_05.png

Group average linkage

Using method="average" yields the average linkage tree. It is usually somewhat intermediate between complete and single linkage.

    iris_hclust_average = hclust(dist(iris[, -5]), method = "average")
    plot(iris_hclust_average)

_images/hierarchical_fig_06.png
    iris_3 = cutree(iris_hclust_average, k = 3)
    plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])

_images/hierarchical_fig_07.png