next up previous
Next: Balanced Bootstraps Up: Lectures Previous: Some notation

Subsections

Complete Enumeration

Theoretically, we could give a complete enumeration of the bootstrap sampling distribution, all we need to know is how to compute the statistic $\hat{\mbox{${\cal X}$}*}$ for all the bootstrap resamples. There is a way of going through all resamles as defined by the simplex $C_n$, it is called a Gray code. As an example, consider the law school data used by Efron (1982). One can only hope to enumerate completely for moderate sample sizes ($n\leq
20$ today). For larger sample sizes partial enumeration through carefully spaced points is discussed in Diaconis and Holmes (1994a). Another idea is to use a small dose of randomness, not as much as Monte Carlo, by doing a random walk between close points so as to be able to use updating procedures all the same, this is detailed in the case of exploring the tails of a bootstrap distribution in Diaconis and Holmes (1994b).

The original Gray code

Let $Z^n_2$ be the set of binary $n$-tuples. This may be identified with the vertices of the usual $n$-cube or with the set of all subsets of an $n$ element set. The original Gray code gives an ordered list of $Z^n_2$ with the property that successive values differ only in a single place. For example, when $n=3$ such a list is

$000,\; 001,\; 011,\; 010,\; 110,\; 111,\; 101,\; 100\; .$

It is easy to give a recursive description of such a list, starting from the list for $n=1$ (namely 0,1). Given a list $L_n$ of length $2^n$, form $L_{n+1}$ by putting a zero before each entry in $L_n$, and a one before each entry in $L_n$. Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from $0,1$ we get $00, 01, 11, 10$ and then the list displayed above for $n=3$. For $n=4$ the list becomes : $
0000,\;
0001,\;
0011,\;
0010,\;
0110,\;
0101,\;
0100,\;
1100,\;
1101,\;
1111,\;
1110,\;
1010,\;
1011,\;
1001,\;
1000.
$

Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.

One way to express this is as follows : let $m=\sum \epsilon_i 2^i$ be the binary representation of the integer $m$, and let $\cdots e_3 e_2 e_1 e_0$ be the string of rank $m$ in the Gray code list. Then $e_i=\epsilon_i+\epsilon_{i+1} \; \; (mod 2) \; \; (i=0,1,2 \ldots)$ and $\epsilon_i=e_i+e_{i+1}+ \cdots (mod 2) \; \; (i=0,1,2 \ldots)$. For example, when $n=4$, the list above shows the string of rank $6$ is $0101$ ; now $6=0110=0 \cdot 1 + 1 \cdot 2 + 1 \cdot 4 +0 \cdot 8$. So $e_0 = 0+1=1, e_1 = 1+1=0, e_2 = 1+0=1,
e_3=0+0=0$. Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).

Gray Codes for the Bootstrap

Bickel and Freedman (1981) carried out exact enumeration of the bootstrap distribution for the mean using the fast Fourier transform, Bailey (1992) used a similar approach for simple functionals of means. Fisher and Hall (1991) suggest exact enumeration procedures that we will compare to the Gray code approach in Section B below.

Let ${\cal X}_n=\{x_1,x_2,\cdots,x_n\}$ be the original data supposed to be independent and identically distributed from an unknown distribution $F$ on a space ${\cal X}$. The bootstrap proceeds by supposing that replacement of $F$ by $F_n$, the empirical distribution, can provide insight on sampling variability problems.

Practically one proceeds by repeatedly choosing from the $n$ points with replacement. This leads to bootstrap replications ${\cal X}^*_n=\{x^*_1,\cdots,x^*_n\}$. There are $n^n$ such possible replications, however these are not all different and grouping together replications that generate the same subset we can characterize each resample by its weight vector $(k_1,k_2,\cdots,k_n)$ where $k_i$ is the number of times $x_i$ appears in the replication. Thus $k_1+\cdots +k_n=n$.

Let the space of compositions of $n$ into at most $n$ parts be

\begin{displaymath}C_n=\{{\bf k}=(k_1,\cdots,k_n),\; k_1+\cdots +k_n=n,\; k_i\geq 0,\;
k_i\;\hbox{integer}\}.\leqno (3.1)\end{displaymath}

Thus $\vert C_n\vert={2n-1\choose n-1}$. We proceed by running through all compositions in a systematic way. Note that the uniform distribution on ${\cal X}^n_n$ induces a multinomial distribution on $C_n$

\begin{displaymath}m_n({\bf k})={1\over n^n}{n\choose k_1\cdots k_n}.\leqno (3.2)\end{displaymath}

To form the exhaustive bootstrap distribution of a statistic $T({\cal X}_n)$ one need only compute each of the ${2n-1\choose n-1}$ statistics and associate a weight $m_n({\bf k})$ with it. The shift from ${\cal X}^n_n$ to $C_n$ gives substantial savings. For the law school data, $n=15, 15^{15}\simeq 4.38\times 10^{17}$ while ${29\choose 14} \simeq
7.7\times
10^7$.

Efficient updating avoids multiplying such large numbers by factors of $n$. This is what makes the computation feasible. Gray codes generate compositions by changing two coordinates of the vector ${\bf k}$ by one up and one down. This means that $m_n({\bf k})$ can be easily updated by multiplying and dividing by the new coordinates. Similar procedures, discussed in Section C below allow efficient changes in the statistics of interest.

Gray codes for compositions.

Following earlier work by Nijenhuis, Wilf and Knuth, Klingsberg (1982) gave methods of generating Gray codes for compositions. We will discuss this construction briefly here, details can be found in Klingsberg (1982) and Wilf (1989).

For $n=3$, the algorithm produces the 10 compositions of $C_3$ in the following order:

\begin{displaymath}300,\; 210,\; 120,\; 030,\; 021,\; 111,\; 201,\; 102,\; 012,\; 003\; .\end{displaymath}

The easiest way to understand the generation of such a list is recursive, construction of the $n$-compositions of $n$ can actually be done through that of the $(n-1)$ compositions of $(n-i), i=1,\cdots,n$.

For any $n$, the 2-composition is just provided by the list $n0,(n-1)1,\cdots,0n$, which is of length $n+1$. So the 2-compositions out of 3 are $L(n,n-1)=L(3,2)=30, 21, 12, 03$

the 2-compositons out of 2 are $L(n-1,n-1)=L(2,2)=20, 11, 02$

and 2-compositions out of 1 are $L(n-2,n-1)=L(1,2)=10, 01$

Finally there is only one 2-composition of $0$: $L(0,2)=00$

The 3-out of 3 list is obtained by appending a $0$ to the $L(3,2)$ list, a 1 to the $L(2,2)$ list, a 2 to the $L(1,2)$ list and a 3 to the $L(0,2)$ list. These four lists are then concatenated by writing the first, the second in reverse order, the third in its original order followed by the fourth in reverse order. This is actually written:

\begin{displaymath}{\cal L}(3,3)={\cl}(3,2)\oplus 0,\;\; \overline{{\cl}(2,2)\oplus 1},\;\;
{\cl}(1,2)\oplus 2,\;\;\overline{{\cl}(0,2)\oplus 3}\end{displaymath}

and more generally

\begin{displaymath}{\cl}(n,n)={\cl}(n,n-1)\oplus 0,\;\;\overline{{\cl}(n-1,n-1)\oplus
1},\;\;
{\cl}(n-2,n-1)\oplus 2,\cdots \;.\end{displaymath}

The same procedure leads to the 35 compositions of $n=4$ in the following order:

\begin{displaymath}
\eqalign{&
4000,\;
3100,\;
2200,\;
1300,\;
0400,\;
0310,\;
1...
...&0112,\;
1012,\;
0022,\;
0013,\;
0103,\;
1003,\;
0004.\;
\cr }
\end{displaymath}

The lists generated in this way have the property that two successive compositions differ only by $\pm 1$ in two coordinates.

Klingsberg (1982) provides a simple nonrecursive algorithm that generates the successor of any compositon in this Gray code. This is crucial for the implementationin the the present paper. It requires that one keep track of the whereabouts of the first two non-zero elements and an updating counter. Both a $S$ and a $C$ version of the algorithm are provided in the appendix.

We conclude this subsection by discussing a different algorithm due to Nijenhuis and Wilf (1978, pp. 40-46) which runs through the compositions in lexicographic order (reading from right to left). This algorithm was suggested for bootstrapping by Fisher and Hall (1991).


$N-W$ algorithm to run through compositions $C_n$.

(1) Set $k_1=n, k_i=0, 2\leq i\leq n$.

(2) Let $h=$ first $i$ with $k_i\neq 0$. Set $t=k_h, k_h=0,
k_i=t-1,k_{n+1}=k_{n+1}+1$.

(3) Stop when $k_n=n$.


For example, the following list gives all 35 compoisitions in $C_4$ in the order produced by the N.W. algroithm


\begin{displaymath}
\begin{array}{lr}
&4000,\;
3100,\;
2200,\;
1300,\;
0400,\;
0...
...1012,\;
0022,\;
0013,\;
0103,\;
1003,\;
0004.\;
\\
\end{array}\end{displaymath}

Complete Bootstrap distribution for the Law School data

The data consists of 15 pairs of numbers (GPA, LSAT) for a sample of American law schools. The correlation coefficient is $\hat{\rho}=.776$.

I gave you inclass the handouts of the complete bootstrap distribution:
Figure 1.1 Exhaustive Bootstrap for the Correlation Coefficient of the Law School Data

And a Monte Carlo study with $B=40,000$

Figure 1.2 Monte Carlo Bootstrap for the Law School Data, B= 40,000


next up previous
Next: Balanced Bootstraps Up: Lectures Previous: Some notation
Susan Holmes 2004-05-19