next up previous
Next: Complete Enumeration Up: Lectures Previous: The bootstrap: Some Examples

Subsections

Some notation

From an original sample

\begin{displaymath}{\cal X}_n
=(X_1,X_2...X_n) \stackrel{iid}{\sim} F\end{displaymath}

draw a new sample of $n$ observations among the original sample with replacement, each observation having the same probabilty of being drawn ($=\frac{1}{n}$). A bootstrap sample is often denoted

\begin{displaymath}{\cal X}_n^*
=X_1^*,X_2^*...X_n^* \stackrel{iid}{\sim} F_n
\mbox{ the empirical distribution }\end{displaymath}

If we are interested in the behaviour of a random variable $\widehat{\theta}=\theta({\cal X}_n,F)$, then we can consider the sequence of $B$ new values obtained through computation of $B$ new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.

Here is an example of a line of matlab that does just that: indices=randint(1,n,n)+1; Or if you have the statistics toolbox, you can use: indices=unidrnd(n,1,n);

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)

An approximation of the distribution of the estimate $\widehat{\theta}=\theta({\cal X}_n,F)$ is provided by the distribution of

\begin{displaymath}\widehat{\theta}^{*b}=
\theta({\cal X}_n^{*b},F_n),   b=1..B
\end{displaymath}

$G_n^*(t)=P_{F_n}\left(
\widehat{\theta}^* \leq t \right)$ denotes the bootstrap distribution of $\widehat{\theta}^*$, often approximated by

\begin{displaymath}
\widehat{G}_n^*(t)=
\char93 \{\widehat{\theta}^*
\leq t
\}/B\end{displaymath}

\fbox{The Bootstrap Algorithm}

  1. Compute the original estimate from the original data. $\widehat{\theta}=\theta({\cal X}_n)$
  2. For b=1 to B do : %B is the number of bootstrap samples
    1. Create a resample ${\cal X}_b^*$
    2. Compute $\widehat{\theta}^*_b=\theta({\cal X}_b^*)$
  3. Compare $\widehat{\theta}^*_b$ to $\widehat{\theta}$.

Accuracy of the sample mean

Using the linearity of the mean and the fact that the sample is iid we have

\begin{displaymath}
\widehat{se}(\bar{x})= \sqrt{\frac{s^2}{n}}\end{displaymath}

where $s^2$ is the usual estimate of the variance obtained from the sample.

If we were given $B$ true samples, and their associated estimates $\hat{\theta^{*b}}$, we could compute the usual variance estimate for this sample of $B$ values, namely:

\begin{displaymath}\widehat{se}_{boot}(s)=\{ \sum_{b=1}^B
[s(\mbox{${\cal X}$}^{*b})-s(\mbox{${\cal X}$}^{*.})]^2/(B-1)
\}^{\frac{1}{2}}
\end{displaymath}

where

\begin{displaymath}s(\mbox{${\cal X}$}^{*.})]=\frac{1}{B}\sum_{b=1}^B s(\mbox{${\cal X}$}^{*b})\end{displaymath}

Mouse example

Here are some computations for the mouse data(page 11 of text)

Treatment Group
treat=[94 38 23 197 99 16 141]'
treat =
    94
    38
    23
   197
    99
    16
   141
>> median(treat)         
ans =    94
>> mean(treat)
ans =   86.8571
>> var(treat)
ans =   4.4578e+03
>> var(treat)/7
ans =
  636.8299
>> sqrt(637)
ans =   25.2389
thetab=zeros(1,1000);
for (b =(1:1000))               
thetab(b)=median(bsample(treat));
end
hist(thetab)
>> sqrt(var(thetab))
ans =
   37.7768
>> mean(thetab)
ans =
   80.5110
This is what the histogram looks like:
Control Group
control=[52 104 146 10 51 30 40 27 46]';
>> median(control)
ans =    46
>> mean(control)
ans =   56.2222
>> var(control)
ans =   1.8042e+03
>> var(control)/length(control)
ans =  200.4660
>> sqrt(200.4660)
ans =   14.1586
thetab=zeros(1,1000);
for (b =(1:1000))               
thetab(b)=median(bsample(control));
end
hist(thetab)
>> sqrt(var(thetab))
ans =   11.9218
>> mean(thetab)
ans =   45.4370
This is what the histogram looks like:

Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?

The combinatorics of the bootstrap distribution

As we noted in class, and looking at the histograms, the main aspect of the bootstrap distribution of the median is that it can take on very few values, in the case of the treatment group for instance, $7$. The simple bootstrap will always present this discrete characteristic even if we know the underlying distribution is continuous, there are ways to fix this and in many cases it won't matter but it is an important feature.

How many different bootstrap samples are there?

By different samples, the samples must differ as sets, ie there is no difference between the sample $\{x_1,x_2,\ldots,x_n\}$ $\{x_2,x_1,\ldots , x_n \}$, ie the observations are exchangeable or the statistic of interest is a symmetrical function $s$ of the sample: $\hat{\theta}=s(\mbox{${\cal X}$})$.
Definition:
The sequence $(X_1,X_2,\ldots,X_n)$ of random variables is said to be exchangeable if the distribution of the $n$ vector $(X_1,X_2,\ldots,X_n)$ is the same as that of $(X_{\pi(1)},X_{\pi(2)},\ldots,X_{\pi(n)})$, for $\pi$ any permutation of $n$ elements.

Suppose we condition on the sample of $n$ distinct observations $\mbox{${\cal X}$}$, there are as many different samples as there are ways of choosing $n$ objects out of a set of $n$ possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new notation for a bootstrap resample, up to now we have noted a possible reasample, say $\mbox{${\cal X}$}^{*b}=\{x_1,x_1,x_3,x_4,x_4\}$, because of the exchangeability/symmetry property we can recode this as the $n$ vector counting the number of occurrences of each of the observations. in this recoding we have $\mbox{${\cal X}$}^{*b}=(2,0,1,2,0)$ and the set of all bootstrap resamples is the $n$ dimensional simplex

\begin{displaymath}C_n=\{(k_1,k_2,\ldots,k_n), k_i \in \N, \sum k_i=n \}\end{displaymath}

Here is the argument I used in class to explain how big $C_n$ is. Each component in the vector is considered to be a box, there are $n$ boxes to contain $n$ balls in all, we want to contain to count the number of ways of separating the n balls into the $n$ boxes. Put down $n-1$ separators of $\vert$ to make boxes, and $n$ balls, there will be $2n-1$ positions from which to choose the $n-1$ bars' positions, for instance our vector above corresponds to: oo||o|oo| . Thus

\begin{displaymath}\vert C_n\vert={{2n-1}\choose{n-1}}\end{displaymath}

Stirling's formula ( $n!\sim n^ne^{-n}(2\pi n)^{\frac{1}{2}}$) gives an approximation $C_n \sim (n\pi)^{-\frac{1}{2}} 2^{2n-1}$,

here is the function file approxcom.m

function out=approxcom(n)
out=round((pi*n)^(-.5)*2^(2*n-1));
that produces the following table of the number of resamples:
\begin{array}{\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert}
\hl...
...6232& 78207663 & 6.93 10^{10}& 6.35 10^{13} &
5.94 10^{16}\\
\hline
\end{array}

Are all these samples equally likely, thinking about the probability of drawing the sample of all $x_1$'s by choosing the index $1$ $n$ times in the integer uniform generation should persuade you that this sample appears only once in $n^{n}$ times. Whereas the sample with $x_1$ once and $x_2$ all the other observations can appear in $n$ out of the $n^{n}$ ways.

Which is the most likely bootstrap sample?

The most likely resample is the original sample $\mbox{${\cal X}$}=\{x_1,x_2,...,x_n\}$, the easiest way to see this is to consider:


The multinomial distribution

In fact when we are drawing bootstrap resamples we are just drawing from the mulinomial distribution a vector $(k_1,k_2,...k_n)$, with each of the $n$ categories being equally likely, $p_i=\frac{1}{n}$, so that the probability of a possible vector is

\begin{displaymath}Prob_{boot}(k_1,k_2,...k_n)=\frac{n!}{k_1!k_2!\cdots k_n!}
(\...
..._1+k_2+k_3\cdots k_n}=
{{n}\choose{k_1,k_2,\ldots,k_n}} n^{-n}
\end{displaymath}

This will be largest when all the $k_i$'s are $1$, thus the most likely sample in the boostrap resampling is the original sample, here is the table of the most likely values:
\begin{array}{\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert}
\hl...
...} & 5.4\times10^{-5} & 3\times 10^{-6} &
2.3\times 10^{-8}\\
\hline
\end{array}
As long as the statistic is somewhat a smooth function of the observations, we can see that discreteness of the boostrap distribution is not a problem.
next up previous
Next: Complete Enumeration Up: Lectures Previous: The bootstrap: Some Examples
Susan Holmes 2004-05-19