Next: Complete Enumeration Up: Lectures Previous: The bootstrap: Some Examples

Subsections

# Some notation

From an original sample

draw a new sample of observations among the original sample with replacement, each observation having the same probabilty of being drawn (). A bootstrap sample is often denoted

If we are interested in the behaviour of a random variable , then we can consider the sequence of new values obtained through computation of new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.

Here is an example of a line of matlab that does just that: indices=randint(1,n,n)+1; Or if you have the statistics toolbox, you can use: indices=unidrnd(n,1,n);

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)

An approximation of the distribution of the estimate is provided by the distribution of

denotes the bootstrap distribution of , often approximated by

1. Compute the original estimate from the original data.
2. For b=1 to B do : %B is the number of bootstrap samples
1. Create a resample
2. Compute
3. Compare to .

## Accuracy of the sample mean

Using the linearity of the mean and the fact that the sample is iid we have

where is the usual estimate of the variance obtained from the sample.

If we were given true samples, and their associated estimates , we could compute the usual variance estimate for this sample of values, namely:

where

### Mouse example

Here are some computations for the mouse data(page 11 of text)

 Treatment Group treat=[94 38 23 197 99 16 141]' treat = 94 38 23 197 99 16 141 >> median(treat) ans = 94 >> mean(treat) ans = 86.8571 >> var(treat) ans = 4.4578e+03 >> var(treat)/7 ans = 636.8299 >> sqrt(637) ans = 25.2389 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(treat)); end hist(thetab) >> sqrt(var(thetab)) ans = 37.7768 >> mean(thetab) ans = 80.5110  This is what the histogram looks like:
 Control Group control=[52 104 146 10 51 30 40 27 46]'; >> median(control) ans = 46 >> mean(control) ans = 56.2222 >> var(control) ans = 1.8042e+03 >> var(control)/length(control) ans = 200.4660 >> sqrt(200.4660) ans = 14.1586 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(control)); end hist(thetab) >> sqrt(var(thetab)) ans = 11.9218 >> mean(thetab) ans = 45.4370  This is what the histogram looks like:

Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?

## The combinatorics of the bootstrap distribution

As we noted in class, and looking at the histograms, the main aspect of the bootstrap distribution of the median is that it can take on very few values, in the case of the treatment group for instance, . The simple bootstrap will always present this discrete characteristic even if we know the underlying distribution is continuous, there are ways to fix this and in many cases it won't matter but it is an important feature.

### How many different bootstrap samples are there?

By different samples, the samples must differ as sets, ie there is no difference between the sample , ie the observations are exchangeable or the statistic of interest is a symmetrical function of the sample: .
Definition:
The sequence of random variables is said to be exchangeable if the distribution of the vector is the same as that of , for any permutation of elements.

Suppose we condition on the sample of distinct observations , there are as many different samples as there are ways of choosing objects out of a set of possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new notation for a bootstrap resample, up to now we have noted a possible reasample, say , because of the exchangeability/symmetry property we can recode this as the vector counting the number of occurrences of each of the observations. in this recoding we have and the set of all bootstrap resamples is the dimensional simplex

Here is the argument I used in class to explain how big is. Each component in the vector is considered to be a box, there are boxes to contain balls in all, we want to contain to count the number of ways of separating the n balls into the boxes. Put down separators of to make boxes, and balls, there will be positions from which to choose the bars' positions, for instance our vector above corresponds to: oo||o|oo| . Thus

Stirling's formula ( ) gives an approximation ,

here is the function file approxcom.m

function out=approxcom(n)
out=round((pi*n)^(-.5)*2^(2*n-1));

that produces the following table of the number of resamples:

Are all these samples equally likely, thinking about the probability of drawing the sample of all 's by choosing the index times in the integer uniform generation should persuade you that this sample appears only once in times. Whereas the sample with once and all the other observations can appear in out of the ways.

### Which is the most likely bootstrap sample?

The most likely resample is the original sample , the easiest way to see this is to consider:

### The multinomial distribution

In fact when we are drawing bootstrap resamples we are just drawing from the mulinomial distribution a vector , with each of the categories being equally likely, , so that the probability of a possible vector is

This will be largest when all the 's are , thus the most likely sample in the boostrap resampling is the original sample, here is the table of the most likely values:

As long as the statistic is somewhat a smooth function of the observations, we can see that discreteness of the boostrap distribution is not a problem.

Next: Complete Enumeration Up: Lectures Previous: The bootstrap: Some Examples
Susan Holmes 2004-05-19