py4sci

Table Of Contents

Previous topic

Distances

Next topic

Multidimensional scaling

This Page

Visual summaries

This example involves various visual summaries of a data set.

The data set we will use is Fisher’s famous iris data set, which we can find at the UCI machine learning database site.

    iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        sep = ",", header = FALSE)
    head(iris)

    ##    V1  V2  V3  V4          V5
    ## 1 5.1 3.5 1.4 0.2 Iris-setosa
    ## 2 4.9 3.0 1.4 0.2 Iris-setosa
    ## 3 4.7 3.2 1.3 0.2 Iris-setosa
    ## 4 4.6 3.1 1.5 0.2 Iris-setosa
    ## 5 5.0 3.6 1.4 0.2 Iris-setosa
    ## 6 5.4 3.9 1.7 0.4 Iris-setosa

This data file does not have variables names, so we attach them to the data as follows.

    names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
        "iris.type")

Stem and leaf plots

This plot is an ASCII graphic representation of a sample of number’s first few significant digits.

    stem(iris$petal.width)

    ##
    ##   The decimal point is 1 digit(s) to the left of the |
    ##
    ##    1 | 000000
    ##    2 | 0000000000000000000000000000
    ##    3 | 0000000
    ##    4 | 0000000
    ##    5 | 0
    ##    6 | 0
    ##    7 |
    ##    8 |
    ##    9 |
    ##   10 | 0000000
    ##   11 | 000
    ##   12 | 00000
    ##   13 | 0000000000000
    ##   14 | 00000000
    ##   15 | 000000000000
    ##   16 | 0000
    ##   17 | 00
    ##   18 | 000000000000
    ##   19 | 00000
    ##   20 | 000000
    ##   21 | 000000
    ##   22 | 000
    ##   23 | 00000000
    ##   24 | 000
    ##   25 | 000

Histograms

Our first plot will be some histograms. A histogram represents counts within given intervals by the height of the bars. The following histogram has 10 equally spaced breaks, so 11 bins.

    hist(iris$petal.width, breaks = 10)

_images/visualization_fig_00.png

We can try to get finer information by adding more breaks, though the information is less accurate because of fewer data points in each bin.

    hist(iris$petal.width, breaks = 20)

_images/visualization_fig_01.png

If we don’t want to use R‘s automatically chosen breaks, we can supply our own. We can also add color to the plot

    seq(0, 2.5, length = 15)

    ##  [1] 0.0000 0.1786 0.3571 0.5357 0.7143 0.8929 1.0714 1.2500 1.4286 1.6071
    ## [11] 1.7857 1.9643 2.1429 2.3214 2.5000

    hist(iris$petal.width, breaks = seq(0, 2.5, length = 15), col = "blue")

_images/visualization_fig_02.png

Density

A density is a smoothed version of a histogram. Formally, it is a non-negative function that integrates to 1. The area under the curve in any given interval is an estimate of the probability that that variable will fall into that interval.

R has built in tools for estimating density. We will estimate the density for petal.length among the three different types of irises.

    density.setosa = density(iris$petal.length[iris$iris.type == "Iris-setosa"])
    density.versicolor = density(iris$petal.length[iris$iris.type == "Iris-versicolor"])
    density.virginica = density(iris$petal.length[iris$iris.type == "Iris-virginica"])

Now, we will plot all densities, by first plotting one density, then adding the remaining ones with lines

    plot(density.setosa, ylab = "f(length)", xlab = "length (cm)", col = "red",
        main = "Density for petal lengths", xlim = c(0, 8), lwd = 4)
    lines(density.versicolor, col = "blue", lwd = 4)
    lines(density.virginica, col = "green", lwd = 4)
    legend(4, 2.5, c("Setosa", "Virginica", "Versicolor"), lwd = rep(4, 3), col = c("red",
        "green", "blue"))

_images/visualization_fig_03.png

We see from this plot that Setosa has the smallest petal lengths.

Distribution functions

The cumulative area under a density is called a distribution function. Any sample of numbers has a natural distribution function, known as the ECDF (Empirical Cumulative Distribution Function). Note that it is not the cumulative area under the previous densities, it is rougher.

    ecdf.setosa = ecdf(iris$petal.length[iris$iris.type == "Iris-setosa"])
    ecdf.versicolor = ecdf(iris$petal.length[iris$iris.type == "Iris-versicolor"])
    ecdf.virginica = ecdf(iris$petal.length[iris$iris.type == "Iris-virginica"])

Let’s plot these ECDFs, which are always step functions. It has the same information as the density: we see that the red curve is always greater than or equal to the blue or green curves. This means that the irises in the red group (setosa) have the smallest petal lengths.

    plot(ecdf.setosa, ylab = "F(length)", xlab = "length (cm)", col.h = "red", main = "ECDF for petal lengths",
        verticals = T, col.v = "red", do.p = F, lwd = 4, xlim = c(0, 8))
    lines(ecdf.versicolor, col.h = "blue", col.v = "blue", verticals = T, lwd = 4,
        do.p = F)
    lines(ecdf.virginica, col.h = "green", col.v = "green", verticals = T, lwd = 4,
        do.p = F)

_images/visualization_fig_04.png

Quantile plots

The inverse of the ECDF is the quantile function. Plotting the quantile function displays all quantiles of the data sets in a visual fashion.

    ps = seq(0, 1, length = 20)
    quantile.setosa = quantile(iris$petal.length[iris$iris.type == "Iris-setosa"],
        probs = ps)
    quantile.setosa

    ##     0% 5.263% 10.53% 15.79% 21.05% 26.32% 31.58% 36.84% 42.11% 47.37%
    ##  1.000  1.200  1.300  1.300  1.332  1.400  1.400  1.400  1.400  1.500
    ## 52.63% 57.89% 63.16% 68.42% 73.68% 78.95% 84.21% 89.47% 94.74%   100%
    ##  1.500  1.500  1.500  1.500  1.511  1.600  1.600  1.684  1.700  1.900

    quantile.versicolor = quantile(iris$petal.length[iris$iris.type == "Iris-versicolor"],
        probs = ps)
    quantile.virginica = quantile(iris$petal.length[iris$iris.type == "Iris-virginica"],
        probs = ps)

This plot will tell us the same information all the other plots did, that the iris types are separated by petal length.

    plot(ps, quantile.setosa, xlab = "p", ylab = "length (cm)", col = "red", main = "Quantiles for petal lengths",
        type = "l", lwd = 4, ylim = c(0, 8))
    lines(ps, quantile.versicolor, col = "blue", lwd = 4)
    lines(ps, quantile.virginica, col = "green", lwd = 4)

_images/visualization_fig_05.png

Boxplots

We already saw boxplots on Homework 1. Here is an example that shows box plots of each column in the data set.

    boxplot(iris[, -5])

_images/visualization_fig_06.png

As in Homework 1, we can create a set of boxplots to show how a given variable like petal.length varies with iris.type.

    boxplot(iris$petal.length ~ iris$iris.type)

_images/visualization_fig_07.png

Pairwise plots

Another useful summary of the data is a set of pairwise plots of all the continuous variables. Pairwise plots when one or more of the variables are discrete are more difficult to interpret – boxplots would be better in this case, or a table if both variables are discrete.

    pairs(as.matrix(iris[, -5]), pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/visualization_fig_08.png

Contour plots & higher dimensional densities

Density estimates can also be produced for higher dimensional data. Here is an example of a 2-dimensional kernel density estimate of the petal lengths and widths. The 2D density estimate is in the MASS package.

    library(MASS)
    petal.dens = kde2d(iris$petal.length, iris$petal.width)
    contour(petal.dens)

_images/visualization_fig_09.png

We can also view it as an image.

    image(petal.dens)

_images/visualization_fig_10.png

Correlation matrices

Here is an example that shows the case-by-case correlation matrix of the iris data. The blocky structure of this matrix reflects the fact that the cases are broken into three groups and the rows were ordered by these three groups.

    library(fields)
    cI = cor(t(iris[, -5]))
    plot(c(0, 1.2), c(0, 1), type = "n", bty = "n", xlab = "", ylab = "")
    image(cI, col = tim.colors(20), add = TRUE)
    colorbar.plot(1.1, 0.5, seq(min(cI), max(cI), length = 51), col = tim.colors(50),
        adj.x = 0, horizontal = FALSE)
    text(1.15, 0.2, round(min(cI), 2))
    text(1.15, 0.8, max(cI))

_images/visualization_fig_11.png