Next: Confidence Intervals Up: Lectures Previous: Cross Validation

Subsections

# Bootstrapping a Principal Component Analysis

The scores data are the first example in chapter 7 of the text, the analysis which is done is called a principal components analysis, here is a little about that decomposition.

### Description of singular value Decomposition

This is the most important matrix decomposition in statistics.

Here is a first introduction: a little made-up example with matlab to start with:

>> u=[3 1 -1 2]'
u =
3
1
-1
2
>> v=(1:4)'
v =
1
2
3
4
>> X=u*v'
X =
3     6     9    12
1     2     3     4
-1    -2    -3    -4
2     4     6     8
>>svd(X)
ans =
21.2132
0.0000
0
0
>> E=10^(-3)*randn(4);
>> XE=X+E
XE =
3.0012    5.9993    9.0003   12.0012
1.0006    2.0017    3.0009    3.9994
-0.9999   -1.9999   -3.0014   -3.9994
2.0004    4.0018    5.9993    7.9996
>> svd(XE)
ans =
21.2143
0.0028
0.0019
0.0005
>> cond(X)
Condition is infinite
ans =
Inf
>> cond(XE)
ans =
4.2567e+04
%An example you can't see with your bare eyes:
>> X2=u'*v
X2 =
0.2468    0.2499    0.0166    0.2487
0.3225    0.3266    0.0218    0.3250
0.1495    0.1514    0.0101    0.1506
0.4367    0.4423    0.0295    0.4401
>> svd(X2)
ans =
1.0730
0.0000
0
0
>> [U,S,V]=svd(X2);
>> U*S*V'
ans =
0.2468    0.2499    0.0166    0.2487
0.3225    0.3266    0.0218    0.3250
0.1495    0.1514    0.0101    0.1506
0.4367    0.4423    0.0295    0.4401
>> 10000*(X2-U*S*V')
ans =
1.0e-11 *
-0.0278    0.0833   -0.0035    0.0555
-0.0555    0.0555         0    0.0555
0    0.0555    0.0017    0.0278
-0.0555    0.1665         0    0.1110

>> A=rand(4)
A =
0.5045    0.4940    0.0737    0.9138
0.5163    0.2661    0.5007    0.5297
0.3190    0.0907    0.3841    0.4644
0.9866    0.9478    0.2771    0.9410
>> flops(0)
>> [L,U,P]=lu(A)
L =
1.0000         0         0         0
0.5233    1.0000         0         0
0.5114   -0.0406    1.0000         0
0.3234    0.9388    0.7363    1.0000
U =
0.9866    0.9478    0.2771    0.9410
0   -0.2298    0.3557    0.0373
0         0   -0.0535    0.4342
0         0         0   -0.1945
>> flops
ans =
34
>> P*A
ans =
0.9866    0.9478    0.2771    0.9410
0.5163    0.2661    0.5007    0.5297
0.5045    0.4940    0.0737    0.9138
0.3190    0.0907    0.3841    0.4644
>> L*U
ans =
0.9866    0.9478    0.2771    0.9410
0.5163    0.2661    0.5007    0.5297
0.5045    0.4940    0.0737    0.9138
0.3190    0.0907    0.3841    0.4644
>> P*A-L*U
ans =    1.0e-15 *
0         0         0         0
-0.1110         0         0         0
0         0         0         0
-0.0555   -0.0139         0         0


Here is what we need to remember:

Actually the singular values are the square roots of the eigenvalues of .

## Principal Components

• Start by recentring , from now on consider centered ie ,
• Cols of are new variables,
• Principal Components ,
• Principal axes ,
• Distance between two points,

• Transition Formulae ,

centered, all points (observations) same weight.

variables can be replaced by the columns of .

is the best (optimal) approximation of rank for .

The columns of are the directions along which the variances are maximal. Definition:
Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of ) and they are the rows of . The components are orthogonal and their lengths are the singular values .

In the same way the principal axes are defined as the rows of the matrix . Coordinates of the variables on the basis of columns of .

However, this decomposition will be highly dependent upon the unity of measurement (scale) on which the variables are measured. It is only used, in fact, when the are all of the same order''. Usually what is done is that a weight is assigned to each variable that takes into account its overall scale. This weight is very often inversely proportional to the variable's variance.

So we define a different metric between observations instead of

The same can be said of the observations; some may be more important'' than others (resulting from a mean of a group which is larger).

### Matlab for the Scores Example -in handout 4/27/99

Score data
>> round(cov(scor88))
ans =
306   127   102   106   117
127   173    85    95    99
102    85   113   112   122
106    95   112   220   156
117    99   122   156   298
>> corrcoef(scor88)
ans =
1.0000    0.5534    0.5468    0.4094    0.3891
0.5534    1.0000    0.6096    0.4851    0.4364
0.5468    0.6096    1.0000    0.7108    0.6647
0.4094    0.4851    0.7108    1.0000    0.6072
0.3891    0.4364    0.6647    0.6072    1.0000
>> mean(scor88)
ans = 38.95   50.59   50.60   46.68   42.3068
>> scorc88=scor88-ones(88,1)*mean(scor88);
>> mean(scorc88)
ans =1.0e-13 *
0.1421   -0.1970   -0.0565   -0.0436   -0.0363
>> [U,S,V] = svd(scorc88,0);
>> S
244.4752         0         0         0         0
0  132.6034         0         0         0
0         0   95.0053         0         0
0         0         0   85.8070         0
0         0         0         0   52.8898
>> V
V = 0.5054   -0.7487   -0.2998    0.2962   -0.0794
0.3683   -0.2074    0.4156   -0.7829   -0.1889
0.3457    0.0759    0.1453   -0.0032    0.9239
0.4511    0.3009    0.5966    0.5181   -0.2855
0.5347    0.5478   -0.6003   -0.1757   -0.1512
>> [VE, LE]=eig(cov(scorc88))
VE =
0.2962   -0.2998    0.0794   -0.7487    0.5054
-0.7829    0.4156    0.1889   -0.2074    0.3683
-0.0032    0.1453   -0.9239    0.0759    0.3457
0.5181    0.5966    0.2855    0.3009    0.4511
-0.1757   -0.6003    0.1512    0.5478    0.5347
LE =
84.6304         0         0         0         0
0  103.7473         0         0         0
0         0   32.1533         0         0
0         0         0  202.1111         0
0         0         0         0  686.9898

Nonparametric Bootstrap of the ratio of the variance of the first few(k) axes.
function out=bpca(B,orig,k)
%Bootstrap of the PCA of a dataset
%Output the proprtion
%of variance for the first k axes
[n,p]=size(orig);
for (b =(1:B))
newsample=bsample(orig);
out(b)=pca(newsample,n,p,k);
end %---------------------------

function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a matrix
[n,p]=size(orig);
indices=randint(1,n,n)+1;
out=orig(indices,:);

%-------------------------------

function out=pca(data,n,p,k)
%Do a principal component analysis
%Return the ratio of the variance
%explained by the first k components
data=data-ones(n,1)*mean(data);
[U,S,V] = svd(data,0);
S2=diag(S).^2;
out=sum(S2(1:k))/sum(S2(1:p));
%---------------------------------

>> resbpca10K1=bpca(10000,scor88,1);
>> resbpca10K=bpca(10000,scor88,2);
>> hist(resbpca10K,100)

one component
two components

### How to generate a multivariate normal

Choleski decomposition of a matrix is the decomposition of of the form

>> chl=chol(cov(scor88));
>> round(chl'*chl)
ans =
306   127   102   106   117
127   173    85    95    99
102    85   113   112   122
106    95   112   220   156
117    99   122   156   298

>> xs=randn(88,5);
>> mean(xs)
ans =
0.0480   -0.2056    0.1209    0.1097    0.1051
>> corrcoef(xs)
ans =
1.0000   -0.1623   -0.0664   -0.1947   -0.1042
-0.1623    1.0000    0.1412    0.0361    0.0600
-0.0664    0.1412    1.0000   -0.2334    0.0056
-0.1947    0.0361   -0.2334    1.0000   -0.0734
-0.1042    0.0600    0.0056   -0.0734    1.0000
>> ys=xs*chl;
>> cov(ys)
ans =
225.8649   65.7374   57.5737   29.1913   40.6952
65.7374  149.5844   74.5579   74.1853   83.3744
57.5737   74.5579   93.5467   66.5157   89.7249
29.1913   74.1853   66.5157  152.4465   88.0262
40.6952   83.3744   89.7249   88.0262  210.1678

Parametric Bootstrap Simulations
>> xs=randn(88,5);
>> mean(xs)
ans= 0.013  0.035 -0.151 -0.059 0.033
>> ys=xs*chl;
>> round(cov(ys))
309   169   147   130   150
169   237   137   160   146
147   137   161   158   149
130   160   158   252   172
150   146   149   172   272
function out=spca(B,orig,k);
%Parametric Bootstrap of the PCA of a dataset
%Output the proportion
%of variance for the first k axes
out=zeros(1,B);
[n,p]=size(orig);
chl=chol(cov(orig));
for (b =(1:B))
rands=randn(n,p);
newsample=rands*chl;
out(b)=pca(newsample,88,5,k);
end
r100k=spca(100000,scor88,2);
hist(r100k,200);


Next: Confidence Intervals Up: Lectures Previous: Cross Validation
Susan Holmes 2004-05-19