py4sci

Table Of Contents

Previous topic

Features for time series

Next topic

Distances

This Page

PCA of olympic decathlon data

This example is a short introduction to using R for PCA analysis. The data are scores on various olympic decathlon events for 33 athletes. It is contained in the package ade4.

    library(ade4)

    ## Attaching package: 'ade4'

    ## The following object(s) are masked from 'package:base':
    ##
    ## within

    data(olympic)
    summary(olympic$tab)

    ##       100            long           poid           haut
    ##  Min.   :10.6   Min.   :6.22   Min.   :10.3   Min.   :1.79
    ##  1st Qu.:11.0   1st Qu.:7.00   1st Qu.:13.2   1st Qu.:1.94
    ##  Median :11.2   Median :7.09   Median :14.1   Median :1.97
    ##  Mean   :11.2   Mean   :7.13   Mean   :14.0   Mean   :1.98
    ##  3rd Qu.:11.4   3rd Qu.:7.37   3rd Qu.:15.0   3rd Qu.:2.03
    ##  Max.   :11.6   Max.   :7.72   Max.   :16.6   Max.   :2.27
    ##       400            110            disq           perc
    ##  Min.   :47.4   Min.   :14.2   Min.   :34.4   Min.   :4.00
    ##  1st Qu.:48.3   1st Qu.:14.7   1st Qu.:39.1   1st Qu.:4.60
    ##  Median :49.1   Median :15.0   Median :42.3   Median :4.70
    ##  Mean   :49.3   Mean   :15.0   Mean   :42.4   Mean   :4.74
    ##  3rd Qu.:50.0   3rd Qu.:15.4   3rd Qu.:44.8   3rd Qu.:4.90
    ##  Max.   :51.3   Max.   :16.2   Max.   :50.7   Max.   :5.70
    ##       jave           1500
    ##  Min.   :49.5   Min.   :257
    ##  1st Qu.:55.4   1st Qu.:266
    ##  Median :59.5   Median :272
    ##  Mean   :59.4   Mean   :276
    ##  3rd Qu.:64.0   3rd Qu.:286
    ##  Max.   :72.6   Max.   :303

The biplot gives a graphical summary of both cases (athletes) in terms of scores and the variables in terms of loadings.

    pca.olympic = princomp(olympic$tab)
    biplot(pca.olympic)

_images/olympic_fig_00.png

From this plot, we see that the first principal component is positively associated with longer times on the 1500. So, slower runners will have higher value on this component.

    data.frame(olympic$tab[, "1500"], pca.olympic$scores[, 1])

    ##    olympic.tab....1500.. pca.olympic.scores...1.
    ## 1                  268.9                  -6.074
    ## 2                  273.0                  -2.678
    ## 3                  263.2                 -12.350
    ## 4                  285.1                   9.527
    ## 5                  256.6                 -19.567
    ## 6                  274.1                  -2.312
    ## 7                  291.2                  15.048
    ## 8                  265.9                 -10.023
    ## 9                  269.6                  -5.977
    ## 10                 292.2                  16.717
    ## 11                 295.9                  20.036
    ## 12                 256.7                 -18.720
    ## 13                 257.9                 -18.561
    ## 14                 269.0                  -6.525
    ## 15                 267.5                  -8.326
    ## 16                 268.5                  -7.400
    ## 17                 302.4                  27.956
    ## 18                 286.0                  10.557
    ## 19                 262.4                 -14.220
    ## 20                 277.8                   2.264
    ## 21                 266.4                  -9.508
    ## 22                 262.9                 -13.229
    ## 23                 272.7                  -2.996
    ## 24                 277.8                   1.099
    ## 25                 285.6                   8.574
    ## 26                 270.1                  -6.798
    ## 27                 261.9                 -15.095
    ## 28                 303.2                  27.202
    ## 29                 272.1                  -4.895
    ## 30                 293.9                  17.291
    ## 31                 295.0                  19.235
    ## 32                 293.7                  16.966
    ## 33                 270.0                  -7.218

    plot(olympic$tab[, "1500"], pca.olympic$scores[, 1], pch = 23, bg = "red", cex = 2)

_images/olympic_fig_01.png

Also, the second principal component is associated with strength in the form of a long javelin throw.

    plot(olympic$tab[, "jave"], pca.olympic$scores[, 2], pch = 23, bg = "red", cex = 2)

_images/olympic_fig_02.png

Standardizing


In the previous example, we saw that the two variables were based somewhat on speed and strength. However, we did not scale the variables so the 1500 has much more weight than the 400, for instance. We correct this by passing the cor=TRUE argument, which defaults to FALSE, as an argument to princomp.

    pca.olympic = princomp(olympic$tab, cor = TRUE)
    biplot(pca.olympic)

_images/olympic_fig_03.png

This plot reinforces our earlier interpretation and has put the running events on an “even playing field” by standardizing.

    pca.olympic$loadings[, 1]

    ##     100    long    poid    haut     400     110    disq    perc    jave
    ##  0.4159 -0.3941 -0.2691 -0.2123  0.3558  0.4335 -0.1758 -0.3841 -0.1799
    ##    1500
    ##  0.1701

    pca.olympic$loadings[, 2]

    ##      100     long     poid     haut      400      110     disq     perc
    ##  0.14881 -0.15208  0.48354  0.02790  0.35216  0.06957  0.50333  0.14958
    ##     jave     1500
    ##  0.37196  0.42097

Overall score

Since the first variable seems related to speed, we should not be surprised that an increase in this variable decreases the overall decathlon score.

    plot(pca.olympic$scores[, 1], olympic$score, xlab = "Comp. 1", pch = 23, bg = "red",
        cex = 2)

_images/olympic_fig_04.png