next up previous
Next: Confidence Intervals Up: Lectures Previous: Cross Validation

Subsections

Bootstrapping a Principal Component Analysis

The scores data are the first example in chapter 7 of the text, the analysis which is done is called a principal components analysis, here is a little about that decomposition.

Description of singular value Decomposition

This is the most important matrix decomposition in statistics.

Here is a first introduction: a little made-up example with matlab to start with:

>> u=[3 1 -1 2]'
u =
     3
     1
    -1
     2
>> v=(1:4)'
v = 
     1
     2
     3
     4
>> X=u*v'
X =
     3     6     9    12
     1     2     3     4
    -1    -2    -3    -4
     2     4     6     8
>>svd(X)
ans =
   21.2132
    0.0000
         0
         0
>> E=10^(-3)*randn(4);
>> XE=X+E 
XE =
    3.0012    5.9993    9.0003   12.0012
    1.0006    2.0017    3.0009    3.9994
   -0.9999   -1.9999   -3.0014   -3.9994
    2.0004    4.0018    5.9993    7.9996
>> svd(XE) 
ans =
   21.2143
    0.0028
    0.0019
    0.0005
>> cond(X)
Condition is infinite
ans =
   Inf 
>> cond(XE)
ans =
   4.2567e+04
%An example you can't see with your bare eyes:
>> X2=u'*v 
X2 =
    0.2468    0.2499    0.0166    0.2487
    0.3225    0.3266    0.0218    0.3250
    0.1495    0.1514    0.0101    0.1506
    0.4367    0.4423    0.0295    0.4401
>> svd(X2) 
ans =
    1.0730
    0.0000
         0
         0
>> [U,S,V]=svd(X2);
>> U*S*V' 
ans =
    0.2468    0.2499    0.0166    0.2487
    0.3225    0.3266    0.0218    0.3250
    0.1495    0.1514    0.0101    0.1506
    0.4367    0.4423    0.0295    0.4401
>> 10000*(X2-U*S*V') 
ans =
    1.0e-11 *
   -0.0278    0.0833   -0.0035    0.0555
   -0.0555    0.0555         0    0.0555
         0    0.0555    0.0017    0.0278
   -0.0555    0.1665         0    0.1110
>> A=rand(4)
A = 
    0.5045    0.4940    0.0737    0.9138
    0.5163    0.2661    0.5007    0.5297
    0.3190    0.0907    0.3841    0.4644
    0.9866    0.9478    0.2771    0.9410
>> flops(0)
>> [L,U,P]=lu(A) 
L =
     1.0000         0         0         0
    0.5233    1.0000         0         0
    0.5114   -0.0406    1.0000         0
    0.3234    0.9388    0.7363    1.0000
U =
     0.9866    0.9478    0.2771    0.9410
         0   -0.2298    0.3557    0.0373
         0         0   -0.0535    0.4342
         0         0         0   -0.1945
>> flops 
ans =
    34 
>> P*A
ans =
    0.9866    0.9478    0.2771    0.9410
    0.5163    0.2661    0.5007    0.5297
    0.5045    0.4940    0.0737    0.9138
    0.3190    0.0907    0.3841    0.4644
>> L*U
ans =
     0.9866    0.9478    0.2771    0.9410
    0.5163    0.2661    0.5007    0.5297
    0.5045    0.4940    0.0737    0.9138
    0.3190    0.0907    0.3841    0.4644
>> P*A-L*U
ans =    1.0e-15 *
         0         0         0         0
   -0.1110         0         0         0
         0         0         0         0
   -0.0555   -0.0139         0         0

Here is what we need to remember:

\begin{displaymath}
X=USV', V'V=I, U'U=I, S\; diagonal\; s_i
\end{displaymath}

Actually the singular values are the square roots of the eigenvalues of $X'X$.

Principal Components

$X$ centered, all points (observations) same weight.

\begin{displaymath}
1   X=0\quad\mbox{and}\quad x_{ij} = \sum^r_{t=1}x_{it}s_t v_{jt},
\end{displaymath}

$p$ variables can be replaced by the $r$ columns of $v$.

\begin{displaymath}
x^{(k)}_{ij} = \sum^k_{t=1}u_{it}s_t v_{jt}
\end{displaymath}

is the best (optimal) approximation of rank $k$ for $X$.

The columns of $V$ are the directions along which the variances are maximal. Definition:
Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of $V$) and they are the rows of $G=XV=US$. The components are orthogonal and their lengths are the singular values $C^\prime C=SU^\prime US=S^2$.

In the same way the principal axes are defined as the rows of the matrix $Z=U^\prime X=SV^\prime$. Coordinates of the variables on the basis of columns of $U$.

\begin{displaymath}
Z=S^{-1}C^\prime X \quad\mbox{and}\quad G=XZ^\prime S^{-1}.
\end{displaymath}

However, this decomposition will be highly dependent upon the unity of measurement (scale) on which the variables are measured. It is only used, in fact, when the $X_{\cdot k}$ are all of the same ``order''. Usually what is done is that a weight is assigned to each variable that takes into account its overall scale. This weight is very often inversely proportional to the variable's variance.

So we define a different metric between observations instead of

\begin{eqnarray*}
d \left( x_{i\cdot}, x_{j\cdot}\right) &=& \left( x_{i\cdot}-...
...2_1}\\
& \ddots\\
& & \frac{1}{\sigma^2_p}
\end{array}\right).
\end{eqnarray*}

The same can be said of the observations; some may be more ``important'' than others (resulting from a mean of a group which is larger).

Matlab for the Scores Example -in handout 4/27/99

Score data
>> round(cov(scor88))
ans =
   306   127   102   106   117
   127   173    85    95    99
   102    85   113   112   122
   106    95   112   220   156
   117    99   122   156   298
>> corrcoef(scor88) 
ans =
    1.0000    0.5534    0.5468    0.4094    0.3891
    0.5534    1.0000    0.6096    0.4851    0.4364
    0.5468    0.6096    1.0000    0.7108    0.6647
    0.4094    0.4851    0.7108    1.0000    0.6072
    0.3891    0.4364    0.6647    0.6072    1.0000
>> mean(scor88)
ans = 38.95   50.59   50.60   46.68   42.3068
>> scorc88=scor88-ones(88,1)*mean(scor88);
>> mean(scorc88)
ans =1.0e-13 *
    0.1421   -0.1970   -0.0565   -0.0436   -0.0363
>> [U,S,V] = svd(scorc88,0);
>> S 
  244.4752         0         0         0         0
         0  132.6034         0         0         0
         0         0   95.0053         0         0
         0         0         0   85.8070         0
         0         0         0         0   52.8898
>> V
V = 0.5054   -0.7487   -0.2998    0.2962   -0.0794
    0.3683   -0.2074    0.4156   -0.7829   -0.1889
    0.3457    0.0759    0.1453   -0.0032    0.9239
    0.4511    0.3009    0.5966    0.5181   -0.2855
    0.5347    0.5478   -0.6003   -0.1757   -0.1512
>> [VE, LE]=eig(cov(scorc88)) 
VE =
    0.2962   -0.2998    0.0794   -0.7487    0.5054
   -0.7829    0.4156    0.1889   -0.2074    0.3683
   -0.0032    0.1453   -0.9239    0.0759    0.3457
    0.5181    0.5966    0.2855    0.3009    0.4511
   -0.1757   -0.6003    0.1512    0.5478    0.5347
LE =
   84.6304         0         0         0         0
         0  103.7473         0         0         0
         0         0   32.1533         0         0
         0         0         0  202.1111         0
         0         0         0         0  686.9898
Nonparametric Bootstrap of the ratio of the variance of the first few(k) axes.
function out=bpca(B,orig,k)
%Bootstrap of the PCA of a dataset
%Output the proprtion
%of variance for the first k axes
[n,p]=size(orig);
for (b =(1:B))
    newsample=bsample(orig);
    out(b)=pca(newsample,n,p,k);
end %---------------------------


function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a matrix
      [n,p]=size(orig);
      indices=randint(1,n,n)+1;
      out=orig(indices,:);

%-------------------------------

function out=pca(data,n,p,k)
%Do a principal component analysis
%Return the ratio of the variance
%explained by the first k components
data=data-ones(n,1)*mean(data);
[U,S,V] = svd(data,0);
S2=diag(S).^2;
out=sum(S2(1:k))/sum(S2(1:p));
%---------------------------------

>> resbpca10K1=bpca(10000,scor88,1);
>> resbpca10K=bpca(10000,scor88,2);
>> hist(resbpca10K,100)
one component
two components

How to generate a multivariate normal

Choleski decomposition of a matrix is the decomposition of $A$ of the form

\begin{displaymath}A=L ^\prime L\end{displaymath}

>> chl=chol(cov(scor88));
>> round(chl'*chl)
ans =
   306   127   102   106   117
   127   173    85    95    99
   102    85   113   112   122
   106    95   112   220   156
   117    99   122   156   298

>> xs=randn(88,5);
>> mean(xs)
ans =
    0.0480   -0.2056    0.1209    0.1097    0.1051
>> corrcoef(xs)
ans =
    1.0000   -0.1623   -0.0664   -0.1947   -0.1042
   -0.1623    1.0000    0.1412    0.0361    0.0600
   -0.0664    0.1412    1.0000   -0.2334    0.0056
   -0.1947    0.0361   -0.2334    1.0000   -0.0734
   -0.1042    0.0600    0.0056   -0.0734    1.0000
>> ys=xs*chl; 
>> cov(ys)
ans =
  225.8649   65.7374   57.5737   29.1913   40.6952
   65.7374  149.5844   74.5579   74.1853   83.3744
   57.5737   74.5579   93.5467   66.5157   89.7249
   29.1913   74.1853   66.5157  152.4465   88.0262
   40.6952   83.3744   89.7249   88.0262  210.1678
Parametric Bootstrap Simulations
>> xs=randn(88,5);
>> mean(xs)
ans= 0.013  0.035 -0.151 -0.059 0.033
>> ys=xs*chl; 
>> round(cov(ys))
   309   169   147   130   150
   169   237   137   160   146
   147   137   161   158   149
   130   160   158   252   172
   150   146   149   172   272
function out=spca(B,orig,k);
%Parametric Bootstrap of the PCA of a dataset
%Output the proportion
%of variance for the first k axes
out=zeros(1,B);
[n,p]=size(orig);
chl=chol(cov(orig));
for (b =(1:B))
    rands=randn(n,p);
    newsample=rands*chl;
    out(b)=pca(newsample,88,5,k);
end
r100k=spca(100000,scor88,2);
hist(r100k,200);


next up previous
Next: Confidence Intervals Up: Lectures Previous: Cross Validation
Susan Holmes 2004-05-19