This example is a short introduction to using `R` for PCA analysis.
The data are scores on various olympic decathlon events for 33 athletes.
It is contained in the package `ade4`.

library(ade4)

## Attaching package: 'ade4'

## The following object(s) are masked from 'package:base': ## ## within

data(olympic) summary(olympic$tab)

## 100 long poid haut ## Min. :10.6 Min. :6.22 Min. :10.3 Min. :1.79 ## 1st Qu.:11.0 1st Qu.:7.00 1st Qu.:13.2 1st Qu.:1.94 ## Median :11.2 Median :7.09 Median :14.1 Median :1.97 ## Mean :11.2 Mean :7.13 Mean :14.0 Mean :1.98 ## 3rd Qu.:11.4 3rd Qu.:7.37 3rd Qu.:15.0 3rd Qu.:2.03 ## Max. :11.6 Max. :7.72 Max. :16.6 Max. :2.27 ## 400 110 disq perc ## Min. :47.4 Min. :14.2 Min. :34.4 Min. :4.00 ## 1st Qu.:48.3 1st Qu.:14.7 1st Qu.:39.1 1st Qu.:4.60 ## Median :49.1 Median :15.0 Median :42.3 Median :4.70 ## Mean :49.3 Mean :15.0 Mean :42.4 Mean :4.74 ## 3rd Qu.:50.0 3rd Qu.:15.4 3rd Qu.:44.8 3rd Qu.:4.90 ## Max. :51.3 Max. :16.2 Max. :50.7 Max. :5.70 ## jave 1500 ## Min. :49.5 Min. :257 ## 1st Qu.:55.4 1st Qu.:266 ## Median :59.5 Median :272 ## Mean :59.4 Mean :276 ## 3rd Qu.:64.0 3rd Qu.:286 ## Max. :72.6 Max. :303

The biplot gives a graphical summary of both cases (athletes) in terms
of `scores` and the variables in terms of `loadings`.

pca.olympic = princomp(olympic$tab) biplot(pca.olympic)

From this plot, we see that the first principal component is positively associated with longer times on the 1500. So, slower runners will have higher value on this component.

data.frame(olympic$tab[, "1500"], pca.olympic$scores[, 1])

## olympic.tab....1500.. pca.olympic.scores...1. ## 1 268.9 -6.074 ## 2 273.0 -2.678 ## 3 263.2 -12.350 ## 4 285.1 9.527 ## 5 256.6 -19.567 ## 6 274.1 -2.312 ## 7 291.2 15.048 ## 8 265.9 -10.023 ## 9 269.6 -5.977 ## 10 292.2 16.717 ## 11 295.9 20.036 ## 12 256.7 -18.720 ## 13 257.9 -18.561 ## 14 269.0 -6.525 ## 15 267.5 -8.326 ## 16 268.5 -7.400 ## 17 302.4 27.956 ## 18 286.0 10.557 ## 19 262.4 -14.220 ## 20 277.8 2.264 ## 21 266.4 -9.508 ## 22 262.9 -13.229 ## 23 272.7 -2.996 ## 24 277.8 1.099 ## 25 285.6 8.574 ## 26 270.1 -6.798 ## 27 261.9 -15.095 ## 28 303.2 27.202 ## 29 272.1 -4.895 ## 30 293.9 17.291 ## 31 295.0 19.235 ## 32 293.7 16.966 ## 33 270.0 -7.218

plot(olympic$tab[, "1500"], pca.olympic$scores[, 1], pch = 23, bg = "red", cex = 2)

Also, the second principal component is associated with strength in the form of a long javelin throw.

plot(olympic$tab[, "jave"], pca.olympic$scores[, 2], pch = 23, bg = "red", cex = 2)

Standardizing

In the previous example, we saw that the two variables were based
somewhat on speed and strength. However, we did not scale the variables
so the 1500 has much more weight than the 400, for instance. We correct
this by passing the `cor=TRUE` argument, which defaults to `FALSE`,
as an argument to `princomp`.

pca.olympic = princomp(olympic$tab, cor = TRUE) biplot(pca.olympic)

This plot reinforces our earlier interpretation and has put the running events on an “even playing field” by standardizing.

pca.olympic$loadings[, 1]

## 100 long poid haut 400 110 disq perc jave ## 0.4159 -0.3941 -0.2691 -0.2123 0.3558 0.4335 -0.1758 -0.3841 -0.1799 ## 1500 ## 0.1701

pca.olympic$loadings[, 2]

## 100 long poid haut 400 110 disq perc ## 0.14881 -0.15208 0.48354 0.02790 0.35216 0.06957 0.50333 0.14958 ## jave 1500 ## 0.37196 0.42097

Since the first variable seems related to speed, we should not be surprised that an increase in this variable decreases the overall decathlon score.

plot(pca.olympic$scores[, 1], olympic$score, xlab = "Comp. 1", pch = 23, bg = "red", cex = 2)