next up previous
Next: Which is the most Up: The combinatorics of the Previous: The combinatorics of the

How many different bootstrap samples are there?

By different samples, the samples must differ as sets, ie there is no difference between the sample $\{x_1,x_2,\ldots,x_n\}$ $\{x_2,x_1,\ldots , x_n \}$, ie the observations are exchangeable or the statistic of interest is a symmetrical function $s$ of the sample: $\hat{\theta}=s(\mbox{${\cal X}$})$.
Definition:
The sequence $(X_1,X_2,\ldots,X_n)$ of random variables is said to be exchangeable if the distribution of the $n$ vector $(X_1,X_2,\ldots,X_n)$ is the same as that of $(X_{\pi(1)},X_{\pi(2)},\ldots,X_{\pi(n)})$, for $\pi$ any permutation of $n$ elements.

Suppose we condition on the sample of $n$ distinct observations $\mbox{${\cal X}$}$, there are as many different samples as there are ways of choosing $n$ objects out of a set of $n$ possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new notation for a bootstrap resample, up to now we have noted a possible reasample, say $\mbox{${\cal X}$}^{*b}=\{x_1,x_1,x_3,x_4,x_4\}$, because of the exchangeability/symmetry property we can recode this as the $n$ vector counting the number of occurrences of each of the observations. in this recoding we have $\mbox{${\cal X}$}^{*b}=(2,0,1,2,0)$ and the set of all bootstrap resamples is the $n$ dimensional simplex

\begin{displaymath}C_n=\{(k_1,k_2,\ldots,k_n), k_i \in \N, \sum k_i=n \}\end{displaymath}

Here is the argument I used in class to explain how big $C_n$ is. Each component in the vector is considered to be a box, there are $n$ boxes to contain $n$ balls in all, we want to contain to count the number of ways of separating the n balls into the $n$ boxes. Put down $n-1$ separators of $\vert$ to make boxes, and $n$ balls, there will be $2n-1$ positions from which to choose the $n-1$ bars' positions, for instance our vector above corresponds to: oo||o|oo| . Thus

\begin{displaymath}\vert C_n\vert={{2n-1}\choose{n-1}}\end{displaymath}

Stirling's formula ( $n!\sim n^ne^{-n}(2\pi n)^{\frac{1}{2}}$) gives an approximation $C_n \sim (n\pi)^{-\frac{1}{2}} 2^{2n-1}$,

here is the function file approxcom.m

function out=approxcom(n)
out=round((pi*n)^(-.5)*2^(2*n-1));
that produces the following table of the number of resamples:
\begin{array}{\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert l\vert}
\hl...
...6232& 78207663 & 6.93 10^{10}& 6.35 10^{13} &
5.94 10^{16}\\
\hline
\end{array}

Are all these samples equally likely, thinking about the probability of drawing the sample of all $x_1$'s by choosing the index $1$ $n$ times in the integer uniform generation should persuade you that this sample appears only once in $n^{n}$ times. Whereas the sample with $x_1$ once and $x_2$ all the other observations can appear in $n$ out of the $n^{n}$ ways.


next up previous
Next: Which is the most Up: The combinatorics of the Previous: The combinatorics of the
Susan Holmes 2004-04-27