draw a new sample of observations among the original sample with replacement, each observation having the same probabilty of being drawn (). A bootstrap sample is often denoted

If we are interested in the behaviour of a random variable , then we can consider the sequence of new values obtained through computation of new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.

Here is an example of a line of `matlab` that does just
that:
`indices=randint(1,n,n)+1;
`
Or if you have the statistics toolbox,
you can use:
`indices=unidrnd(n,1,n);`

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

`sample(n,n,replace=T)
`

An approximation of the distribution of the
estimate
is provided by the distribution
of

denotes the bootstrap distribution of , often approximated by

*
*

- Compute the original estimate from the original data.
- For b=1 to B do : %B is the number of bootstrap samples
- Create a resample
- Compute

- Compare to .

where is the usual estimate of the variance obtained from the sample.

If we were given true samples, and their associated
estimates
,
we could compute the usual variance estimate
for this sample of values, namely:

where

Treatment Group
treat=[94 38 23 197 99 16 141]' treat = 94 38 23 197 99 16 141 >> median(treat) ans = 94 >> mean(treat) ans = 86.8571 >> var(treat) ans = 4.4578e+03 >> var(treat)/7 ans = 636.8299 >> sqrt(637) ans = 25.2389 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(treat)); end hist(thetab) >> sqrt(var(thetab)) ans = 37.7768 >> mean(thetab) ans = 80.5110This is what the histogram looks like: |

Control Group
control=[52 104 146 10 51 30 40 27 46]'; >> median(control) ans = 46 >> mean(control) ans = 56.2222 >> var(control) ans = 1.8042e+03 >> var(control)/length(control) ans = 200.4660 >> sqrt(200.4660) ans = 14.1586 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(control)); end hist(thetab) >> sqrt(var(thetab)) ans = 11.9218 >> mean(thetab) ans = 45.4370This is what the histogram looks like: |

Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?

Definition:

The sequence of random variables is said to be exchangeable if the distribution of the vector is the same as that of , for any permutation of elements.

Suppose we condition on the sample of distinct observations , there are as many different samples as there are ways of choosing objects out of a set of possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new
notation for a bootstrap resample,
up to now we have noted a possible
reasample, say
,
because of the exchangeability/symmetry property
we can recode this as the vector counting
the number of occurrences of each of the observations.
in this recoding we have
and the set of all bootstrap resamples
is the dimensional simplex

Here is the argument I used in class to explain how big is. Each component in the vector is considered to be a box, there are boxes to contain balls in all, we want to contain to count the number of ways of separating the n balls into the boxes. Put down separators of to make boxes, and balls, there will be positions from which to choose the bars' positions, for instance our vector above corresponds to:

Stirling's formula ( ) gives an approximation ,

here is the function file `approxcom.m`

function out=approxcom(n) out=round((pi*n)^(-.5)*2^(2*n-1));that produces the following table of the number of resamples:

Are all these samples equally likely, thinking about the probability of drawing the sample of all 's by choosing the index times in the integer uniform generation should persuade you that this sample appears only once in times. Whereas the sample with once and all the other observations can appear in out of the ways.

The multinomial distribution

This will be largest when all the 's are , thus the most likely sample in the boostrap resampling is the original sample, here is the table of the most likely values:

As long as the statistic is somewhat a smooth function of the observations, we can see that discreteness of the boostrap distribution is not a problem.