next up previous index
Next: EM algorithm Up: Computer Intensive Methods in Previous: Numerical Analysis for Statisticians   Index

Subsections

The Bootstrap, Permutation Tests, Simulation

Motivation

We want to compute bias, variability, confidence intervals for estimates for which the theory doesn't provide closed form solutions.

Examples :

  1. The parameters in non parametric regressions (smooths, ppreg,...)
  2. Correlation coefficients for samples that don't necessarily come from a Normal distribution.
  3. Testing when we don't know the distribution of the test statistic under the null hypothesis.

Suppose we are interested in the estimation of an unknown parameter $\theta$ (this doesn't mean that we are in a parametric context). The two main questions asked at that stage are :

The second question has to be answered through information on the distribution or at least the variance of the estimator.

Of course there are answers in very simple contexts: for instance when the parameter of interest is the mean $\mu$ then the estimator $\bar{X}$ has a known standard deviation : the estimated standard error noted sometimes

\begin{displaymath}
\hat{\sigma}=\left[(X_i-\bar{X})^2/n^2\right]\end{displaymath}

However no such estimator is available for the sample median for instance.

In maximum likelihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of

\begin{displaymath}\hat{\theta}=
\frac{1}{\sqrt{Fisher Info}}\end{displaymath}

The bootstrap is a more general way to answer question 2 , with the following aspects:

If we had several samples from the unknown (true) distribution $F$ then we could consider the variations of the estimator :

\begin{displaymath}
\begin{array}{llllll}
F & \stackrel{\mbox{Random Sample}}{\l...
...,X_2^B...X_n^B) & = & {\cal X}_n^B & \hat{\theta}_B
\end{array}\end{displaymath}

Such a situation is never the case, so we replace these new samples by a resampling procedure based on the only information we have about $F$, and that is an empirical $\hat{F}_n$ :this side is what is called bootstrapping.

The bootstrap : the univariate context

In practice

From an original sample

\begin{displaymath}{\cal X}_n
=(X_1,X_2...X_n) \stackrel{iid}{\sim} F\end{displaymath}

draw a new sample of $n$ observations among the original sample with replacement, each observation having the same probabilty of being drawn ($=\frac{1}{n}$). The bootstrap sample is often denoted

\begin{displaymath}{\cal X}_n^*
=X_1^*,X_2^*...X_n^* \stackrel{iid}{\sim} F_n
\mbox{ the empirical distribution }\end{displaymath}

If we are interested in the behaviour of a random variable $\widehat{\theta}=\theta({\cal X},F)$, then we can consider the sequence of $B$ new values obtained through computation of $B$ new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability through a function denoted by RANINT(1,n) hereof.

Here is an example of a line of matlab that does just that:

T= floor((rand(1)*n)+1);

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)

An approximation of the distribution of the estimate $\widehat{\theta}=\theta({\cal X}_n,F)$ is provided by the distribution of

\begin{displaymath}\widehat{\theta}^{*b}=
\theta({\cal X}_n^{*b},F_n), \ \ b=1..B
\end{displaymath}

$G_n^*(t)=P_{F_n}\left(
\widehat{\theta}^* \leq t \right)$ denotes the bootstrap distribution of $\widehat{\theta}^*$, often approximated by

\begin{displaymath}
\widehat{G}_n^*(t)=
\char93 \{\widehat{\theta}^*
\leq t
\}/B\end{displaymath}

\fbox{The Bootstrap Algorithm}

  1. For b=1 to B do : /* B is the number of bootstrap samples */
    1. For i=1 to n do : /* Generation of Sample ${\cal X}_b^*$ */
      1. OBS=RANINT(1,n)
      2. Add observation number OBS to Sample ${\cal X}_b^*$
    2. Compute $\widehat{\theta}^*_b=\theta({\cal X}_b^*)$

An example data set is LS (or laws) : the law school data consisting in 15 bivariate data points, each corresponding to an American law school as analysed in EFRON 1982:

\begin{displaymath}
x_i=(LSAT_i,GPA_i)\end{displaymath}

$LSAT_i$ is the class average score on a nationwide exam, $GPA_i$ is the class average undergraduate grades.

In order to repeat a large number of times the same instructions needed to construct $B$ data sets one usually has to use a looping facility, actually matlab and S are easier to program in this circumstance because one can use ordinary loops and use owner-created functions, as well as trying to benefit form the vectorized commands.

S examples for simple bootstraps

laws<-scan()
576 3.39
635 3.30
558 2.81
578 3.03 
666 3.44
580 3.07
555 3.00
661 3.43
651 3.36
605 3.13
653 3.12
575 2.74
545 2.76
572 2.88
594 2.96


"boot.simple"<-

function(data, fun, nboot, ...)
{
	n <- nrow(as.matrix(data))
	res.boot <- vector("list", length = nboot)
	indb <- matrix(sample(n, size = n * nboot, T), nrow = nboot)
	for(b in (1:nboot)) {
		res <- fun(indb[b,  ], ...)
		plot(res, xlab = "", ylab = "", lab = c(0, 0, 0))
		res.boot[[b]] <- res
	}
	return(res.boot)
}

"boot"<-

function(ind, nboot, fun, ...)
{
# Tibshirani's efficient Bootstrap
# If fun is function of a matrix or returns multiple values
# it has to be transformed before

	data <- matrix(sample(ind, size = length(ind) * nboot, replace = T),
		nrow = nboot)
	return(apply(data, 1, fun, ...))
}

#Elementary example of using boot

corr<- function(x, xdata)
{
	cor(xdata[x, 1], xdata[x, 2])
}
j1000 <- boot(1:15, 1000, theta, laws)
hist(j1000 - 0.777)
q.bt <- quantile(junk1000, probs = c(0.01, 0.99))
#With greek label
hist(junk1000 - 0.777, nclass = 40, xlab = "q* - q", font = 13)

#Example of Bootstrapping a Mann-Whitney statistic
"mw"<-
function(x, y)
{
#Function that computes a Mann-Whitney statistic for two samples
        n <- length(x)
	m <- length(y)
	rks <- rank(c(x, y))
	wx <- sum(rks[1:n])
	ux <- n * m + (m * (m + 1))/2 - wx
	mwx <- 1 - ux/(m * n)
	return(mwx)
}

"mwbt"<-
function(x, y, nboot = 100, fun = mw, ...)
{
	bts <- rep(0, nboot)
	for(b in (1:nboot)) {
		xbts <- sample(x, length(x), replace = T)
		ybts <- sample(y, length(y), replace = T)
		bts[b] <- fun(xbts, ybts, ...)
	}
	return(bts)
}

"mwxy"<-
function(xy, n)
{
#	n <- length(x)
	m <- length(xy) - n
	rks <- rank(xy)
	wx <- sum(rks[1:n])
	ux <- n * m + (m * (m + 1))/2 - wx
	mwx <- 1 - ux/(m * n)
	return(mwx)
}

Parametric bootstrap

If one has a parametric distribution $F_\lambda$ which is well defined except for the pamameter $\lambda$, instead of drawing with replacement from the original sample, one can draw from the distribution $F_{\widehat{\lambda}}$, where $\widehat{\lambda}$ is the estimation obtained from the original sample.

#Parametric Bootstrap
"boot.par"<-
function(model.sim = law.sim, n = 15, nboot = 66)
{
#Parametric bootstrap of correlation coefficient
#Law sim contains bivariate normal data

	theta <- rep(0, nboot)
	for(b in (1:nboot))
		theta[b] <- corr(((b - 1) * n + 1):(b * n), law.sim)
	return(theta)
}


#Example of using the functions and plotting the results

n1 <- rnorm(1000)
n2 <- rnorm(1000)
law.sim <- cbind(n1, 0.777 * n1 + sqrt(1 - (0.777^2)) * n2)
rbootpar <- boot.par()
hist(rbootpar)
par(new=T)
plot(density(rbootpar, width = 0.15), xlab = "", ylab = "",
 lab = c(0, 0, 0),type = "l")
title("Parametric Bootstrap - Law School corr.", cex = 1.9)

Smoothed Bootstrap

An intermediate solution between parametric and nonparametric bootsrapping is the smoothed bootstrap.

Instead of resampling directly from the empirical distribution $F_n$, one smoothes it out first then the smoothed version is used to generate the new samples.

If a kernel smoother is used, for instance a Gaussian kernel, then generation from the smoothed distribution $\widehat{F}_n$ is easy since it is sufficient to resample with replacement and then perturb the sampled point by adding a small gaussian random variable (SILVERMAN B.W. & YOUNG G.A.)

#Smoothed Bootstrap

"boots"_function(x,nboot=100,fun,kernel="norm",h=1)
{
#Function doing a smoothed bootstrap with either
# a normal or an epaneichikov kernel
#h is the window width (can be tricky to choose)

n_nrow(as.matrix(x))
theta_rep(0,nboot)
for (k in (1:nboot))
{
if(kernel=="norm") eps_rnorm(n,mean=0,sd=1)
else eps_epan(n)
xb_x[sample(n,n,T)]+eps*h
theta[k]_fun(xb)
}
return(theta)
}


"epan"_function(n)
{
eps_rep(0,n)
v_matrix(runif(3*n,-1,1),nrow=n)
for(i in(1:n))
{
if ((abs(v[i,3])>=abs(v[i,2])) && 
(abs(v[i,3])>=abs(v[i,1]))) 
                          eps[i]_v[i,2]
                     else eps[i]_v[i,3]
}
return(eps)
}

Different Uses for bootstrapping

Quality of Estimates

Bias

:

\begin{displaymath}BIAS=E_F(\theta(F_n))-\theta(F)\end{displaymath}

If we take the statistic $R({\cal X},F)=
\theta(F_n)-\theta(F)$, (where $F_n$ denotes the empirical distribution function) and apply the bootstrap algorithm to its expectation then we obtain the estimate :

\begin{displaymath}
\widehat{BIAS}=E_*R^*=
\frac{1}{B}\sum_{b=1}^B(\widehat{\the...
...widehat{\theta})
=\widehat{\theta}^{*\bullet}-\widehat{\theta}
\end{displaymath}

where the dot indicates averaging : $\widehat{\theta}^{*\bullet}=
\sum_{b=1}^B\widehat{\theta}^{*b}/B$

Variance

:

\begin{displaymath}\widehat{SD}=\left\{
\frac{1}{B-1}\sum_{b=1}^B
[\widehat{\theta}^{*b}-\widehat{\theta}^{*\bullet}]^2 \right\}
^{\frac{1}{2}}\end{displaymath}

Confidence Intervals : The percentile method

As before we denote by $G_n^*(t)=P_{F_n}\left(
\widehat{\theta}^* \leq t \right)$ the bootstrap distribution of $\widehat{\theta}^*$, approximated by

\begin{displaymath}
\widehat{G}_n^*(t)=
\char93 \{\widehat{\theta}^*
\leq t
\}/B\end{displaymath}

The percentile method consists in taking the $1-2\alpha$ confidence interval for $\theta$ as being

\begin{displaymath}\left[
\widehat{G}_n^{*-1}(\alpha),
\widehat{G}_n^{*-1}(1-\alpha)
\right]\end{displaymath}

Theoretically speaking this is equivalent to replacement of the unknown distribution $G(x,F)=P_F(\widehat{\theta}<x)$ by the estimate $G(x,F_n)$.
\fbox{The Percentile Algorithm } :

  1. Input the level $=2\alpha$ of the confidence interval
  2. For b=1 to B do /* B is the number of bootstrap samples*/
  3. For i=1 to n do /* Generation of Sample ${\cal X}_b$ */
    1. OBS=RANINT(1,n)
    2. Add observation number OBS to the Sample ${\cal X}_b$
  4. Compute $\widehat{\theta}^*_b=\theta({\cal X}_b)$
  5. Combine the $\widehat{\theta}^*_b=\theta({\cal X}_b)$ into a new data set THETA
  6. Compute the $\alpha$ and $(1-\alpha)$'th quantiles for THETA.

Enhancements

As can be predicted, the bigger $B$, the number of bootstrap resamples, the better the approximations are, thus much recent work has been devoted to making the most of the computations :

Preliminaries : Pivotal quantities
Construction of confidence intervals is based (ideally) on a pivotal quantity $R_n$, that is a function of the sample and the parameter $\theta$ whose distribution is independant of the parameter $\theta$, the sample, or any other unknown parameter. Thus for instance no knowledge is needed about the parameters $\mu$ and $\sigma^{2}$ of a normal variable $X \sim N(
\mu,\sigma^{2})$ to construct confidence intervals for both of these parameters, because one has two pivotal quantities available :

\begin{displaymath}\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1}\end{displaymath}


\begin{displaymath}\frac{(n-1)S^{2}}{\sigma^{2}}\sim \chi^{2}_{n-1} \end{displaymath}

Let $c_{n}(\alpha)$ denote the largest $(1-\alpha)th$ quantile of the distribution of the pivot $R_n$, and $\Theta$ the parameter space, set of all possible values for $\theta$ then

\begin{displaymath}\left\{t \in \Theta : R_{n}(t) \leq
c_{n}(\alpha) \right\}\end{displaymath}

is a confidence set whose level is $1-\alpha$. Thus if $t_{(1-\alpha/2)}$, $\chi^{2}_{\alpha/2}$ and $\chi^{2}_{1-\alpha/2}$ denote the relevant quantiles for the distributions cited above the corresponding $1-\alpha$ confidence intervals will be

\begin{displaymath}\left[\bar{x}-t_{(1-\alpha/2)}s/\sqrt(n),
\bar{x}+t_{(1-\alpha/2)}s/\sqrt(n)
\right]
\end{displaymath}

and

\begin{displaymath}
\left[(n-1)s^{2}/\chi^{2}_{1-\alpha/2},
(n-1)s^{2}/\chi^{2}_{\alpha/2}
\right]\end{displaymath}

Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level ($1-\alpha$) of the confidence sets. Level error is sensitive to how far $R_{n}$ is from being a pivot.

We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation. Another is to use a cdf transform constructed from a first set of bootstrap simulations as explained in section 2.3.

Bootstrap-t : Studentizing

Suppose that we are interested in a parameter $\theta$, and that $\widehat{\theta}$ is an estimate based on the n-sample ${\cal X}_n$.

We will suppose that the asymptotic variance of $\widehat{\theta}$ is $\sigma^{2}/n$ with a corresponding estimate $\widehat{\sigma}^{2}/n$.

The studentized bootstrap or bootstrap-t method (EFRON 1982, HALL 1988) is based on approximating the distribution function $H_F(x)$ of $n^{\frac{1}{2}}(\widehat{\theta}-\theta)/
\widehat{\sigma}$ by $H_{F_n}(x)$, distribution of $n^{\frac{1}{2}}(\widehat{\theta^{*}}-\widehat{\theta})
/\widehat{\sigma}^*$.

Denote by $c_{n,\alpha}^*$ the $\alpha$ th quantile of the distribution of $n^{\frac{1}{2}}(\widehat{\theta^{*}}-\widehat{\theta})$,

\begin{displaymath}c_{n,\alpha}^*=H_{F_n}^{-1}(\alpha)\end{displaymath}

Thus the equitailed bootstrap-t confidence interval of coverage $2\alpha$ is :

\begin{displaymath}\left[\widehat{\theta}-n^{-\frac{1}{2}}c_{n,(1-\alpha)}
\wide...
...t{\theta}-n^{-\frac{1}{2}}c_{n,\alpha}\widehat{\sigma}
\right]
\end{displaymath}

One way of finding an estimate $\widehat{\sigma}^{*}$ is to use a second level bootstrap, this is rather expensive compuation-wise : one could prefer as an intermediate solution a jackknife estimate. \fbox{Bootstrap-t Algorithm}
  1. Compute the original estimate of $\theta$ : THETA(0)
  2. For b=1 to B do :
    1. Generate a sample ${\cal X}_b^*$
    2. Compute the estimate of $\theta$ based on ${\cal X}_b^*$: THETA(b)
    3. Use ${\cal X}_b^*$ as original data and compute an estimate of the standard deviation of $\widehat{\theta}$: SIGMA(b).
    4. Compute $R({\cal X}_b^*)$=(THETA(b)-THETA(0))/ SIGMA(b)
  3. Compute CNA and CNCA the $\alpha$ and $(1-\alpha)$'th quantiles for the sample formed by all the R's.
  4. LO=THETA(0)-SQRT(N)*CNCA*SIGMA(0)
  5. HI=THETA(0)-SQRT(N)*CNA*SIGMA(0)

S functions

"bootsd" <-
function(x, theta, nbootsd = 25)
{
	n <- length(x)
	boo <- boot(x, fun = theta, nboot = nbootsd)
	return(sqrt(((n - 1)/n) * var(boo)))
}

"bootst" <- function(x, fun = median, nboot = 100, nbootsd = 25)
{
# Studentized Bootstrap
	boo <- rep(0, nboot)
	for(b in (1:nboot)) {
		boo[b] <- bootsd(x, theta = fun, nbootsd = nbootsd)
	}
	thetab <- boot(x, fun = fun, nboo = nboot)
	theta0 <- fun(x)
	thetasd <- ((thetab - theta0)/boo)
	return(thetasd)
}

Preliminary Transformations of the Estimate

One of the drawbacks of the above-defined method is that the variance estimate can tend to be quite erratic. In the particular case of bootstrapping a correlation coefficient $\rho$ one sometimes ends up with confidence intervals that contain [-1,1], which is rather disappointing.

However experiences have shown that preliminary transformation of $\rho$ such as Fishers $z$ transform (=arctanh $\rho$), followed by the Bootstrap-t and then retransformation of the confidence interval back to the original scale leads to much more dependable confidence intervals. Unfortunately such a handy transform is not always available but R. TIBSHIRANI, 1988 has suggested an algorithm for creating automatically defined variance stabilizing transformations.

Pre-pivoting or Double Bootstrap

Based on the inverse CDF property :

i.e. that if $X$ is distributed according to $F$ then $F^{-1}(X)$ is uniform.

Denote by $H_n=H_n(.,F)$ the c.d.f. of our root $R_n(\theta)$ which is supposed to converge weakly to the c.d.f. $H=H(.,F)$ as $n$ increases.

As we have seen the idea behind the bootstrap is to replace the unknown distribution function $F$ by its empirical counterpart ${F}_{n}$, thus using $\widehat{H}_n=\widehat{H}_n(.,{F}_{n})$ as an estimate for the unknown cumulative distribution. Once validation of the procedure has been insured through verifying that

\begin{displaymath}
\widehat{H}_n
\stackrel{{\cal L}}{\Rightarrow}
H\end{displaymath}

Then

\begin{displaymath}
\underbrace{\widehat{H}_n(R_n(\theta))}
_{R_{n,1}}
\stackrel{{\cal L}}{\Rightarrow}
U(0,1)\end{displaymath}

In fact it has been shown that the distribution of $R_{n,1}$ is less dependent on $F$ than is that of $R_n(\theta)$, thus providing a better pivot.
\fbox{The prepivoting/double bootstrap algorithm}:
  1. For b=1 to B do
    1. Generate sample ${\cal Y}^{*}_{b}$ a bootstrap sample of size $n$ from ${F}_n$
    2. Compute R(b)= $R_n({\cal Y}^*_b,\widehat{\theta})$
    3. Initialize Z(b)=0
    4. For k=1..K do
      1. Generate sample ${\cal X}^{*}_{k}$ a bootstrap sample of size $n$ from ${F}_{n,b}^*$
      2. Compute $R_n({\cal X}^*_k,\widehat{\theta^*_b})$
      3. If $R_n({\cal X}^*_k,\widehat{\theta^*_b}) \leq
R_n({\cal Y}^*_b,\widehat{\theta})$ add one to Z(b)
    5. Divide Z(b) by $K$.
  2. Compute CN1 :the $(1-\alpha)$'th quantile of the $Z$'s
  3. Compute the CN1'th quantile of the R's.

Balanced Bootstrap : saving computations

DAVISON, HINKLEY AND SCHECHTMANN, 1986 introduced the idea that one could reduce the amount of simulations (=B) necessary to attain a given precision by using each of the sample observations exactly equally often.

The algorithm proposed by the authors was however rather expensive timewise ($\propto$ $nB \log{nB}$). We have a more economical one implemented proposed by Gleason .
\fbox{The Balanced Bootstrap Algorithm} :

  1. Form the list $L=\left\{I,I,I..I\right\}$
  2. For $i=n*B$ to $2$ do
    1. $j=RANINT(1,i)$
    2. Exchange $L_j$ and $L_i$

Antithetic Sampling : more on saving computations

The idea comes from current monte-carlo techniques, and is detailed in unpublished work by HALL. Like other enhancements presented here, its objective is to reduce the estimate's variance.

Let $\widehat{\theta}_{1}$ and $\widehat{\theta}_2$ be two different estimates of an unknown parameter $\theta$, if $\widehat{\theta}_{1}$ and $\widehat{\theta}_2$ can be chosen so as to be negatively correlated, then $\widehat{\theta}=\frac{1}{2}(\widehat{\theta}_1+
\widehat{\theta}_2)$ has half the variance of $\widehat{\theta}_{1}$ and $\widehat{\theta}_2$.

The bootstrap version of this is based on an antithetic permutation of the sample. B ordinary resampling operations then supplies the equivalent of 2B resampling operations.

Bootstrap and Randomization Tests

An example : Infants walking

(from DANIEL W.W., p.216) :
Researchers wished to know if they could conclude that two populations of infants differ with respect to the mean age at which they walked alone. The following data (ages in months) were collected:
Sample from population A:
9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75
Sample from population B:
12.5, 9.5, 13.5, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0, 12.0
What should we conclude ? ($\alpha=0.05$)
The classical approach : a t-test

The software output from script in the appendix shows :
The mean for population A is : 10.375
The mean for population B is : 12.208
The difference is : 1.833
As the variances can be considered as equal, a classical test under the normal assumption gives a t-value of -3.1561 indicating a significant difference because the P-value is 0.0046.
Assumptions :

  1. Random sampling of observations from the population
  2. Equal population standard deviations for males and females
  3. Normal distributions for test scores within groups
A permutation test approach Based on the fact that if there is no difference between the two populations then the result will be compatible to allocation at random of each observation to one of two groups (shuffling). Any statistical measure of the difference could be used here : mean difference, studentised (t) difference, or even a median difference.
\fbox{Randomization test algorithm}
  1. Compute the mean difference between samples A and B: MAB(0).
  2. For p=1 to P do
    1. Randomly allocate 12 observations to sample A and 10 to sample B
    2. Compute the mean difference MAB(p)
  3. Is MAB(0) a typical value from the randomization distribution MAB(p),p=1..P?
    (Compute #{MAB(p)$<$MAB(0)}/P and #{MAB(p)$<$MAB(0)}/P

A bootstrap approach Following the previous definitions of the bootstrap a possible bootstrap test could run as follows:

  1. Compute MAB=mean difference between the two original samples
  2. For b=1 to B=500 do :
    1. Draw a bootstrap sample from population A.
    2. Draw a bootstrap sample from population B.
    3. Compute a bootstrap estimate of the mean difference :MD(b).
  3. Construct the bootstrap distribution of MD, eventually smoothed and recentred at 0.
    Situate MAB(0) with respect to this distribution.

More generally

Approximate randomisation tests are frequently used to test independance of two sets of variables (which is the null hypothesis). Estimation of the distribution under this hypothesis is obtained by shuffling one variable (or set of variables) with respect to the other variable(s). (see MANLY, 1990, for an easy overview).

However it is important to note that there is no assumption on the fact that the observations form a random sample from some population. There is no inference that can be carried automatically over to a hypothetic population.

On the contrary the assumption underlying the bootstrap is that the observations form a random sample whose distribution approximates that of the population. Thus the conclusions drawn from a bootstrap test can be used to infer on the population.

NOREEN 1989 suggests using a combination of these two tests when inference is needed for testing whether two variables are stochastically independant in the population. However one must keep in mind that at the same time one tests whether the marginal distributions of the variables in the sample satisfactorially estimate those of the variables in the population.

Finding out how good a parametric bootstrap procedure is through a simulation experience :

Confidence Intervals for the expectation of an exponential distribution
  1. Choose n
  2. Choose an expectation for $X$ :$\mu$
  3. For b=1 to B do :
    1. Create a sample ${\cal X}^*_b=X_1^b,..., X_n^b$ having an exponential distribution
      ( $P\left\{X < d \right\}= 1 - e^{-\mu d }$)
      (In fact we use $U_1,U_2,..., U_n$ uniformly distributed on $\left[0,1\right]$
      and then compute $X_i=\frac{1}{\mu}
\log{(1-U_i)}$)
    2. Compute a bootstrap $0.95$ confidence interval for $\mu$
    3. Count $c_b=1$ if $\mu$ is in the interval and $c_b=0$ otherwise.
  4. $P=\frac{1}{S}\sum_{1}^{S}c_i$

For B large enough, $P$ provides the probability that $\mu$ is in the bootstrap confidence interval.

In order for the experience to be really useful, one ought to retry with different sample sizes and different expectations, one could also retry with all sorts of different distributions.



Using the Bootstrap to build Confidence Intervals

We call the unknown parameter $\tho$ and our estimate $\thh$. Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of $\thh-\tho$, we will be interested especially in its quantiles : denote the $\at$ quantile by $\underline{\delta}$ and the $1-\at$ quantile by $\bar{\delta}$. By definition we have: $P(\thh-\tho \leq \underline{\delta})=\frac{\alpha}{2}$ $P(\thh-\tho \leq \bar{\delta})= 1-\frac{\alpha}{2}$

It is by writing this that you can build the correct intervals.

Now I am going to compare an answer to an estimation probalem both through maximum likelihood and through bootstrapping. First let us remind ourselves of notations:

Maximum Likelihood Estimation

$X_1,X_2, X_3,\ldots X_n$ have joint density denoted

\begin{displaymath}
f_{\theta}(x_1,x_2,\ldots,x_n)=
f((x_1,x_2,\ldots,x_n\vert \tth)\end{displaymath}

Given observed values $X_1=x_1,X_2=x_2,\ldots,X_n=x_n$, the liklihood of is the function

\begin{displaymath}lik(\tth)=f(x_1,x_2,\ldots,x_n\vert \tth)\end{displaymath}

considered as a function of .

If the distribution is discrete, $f$ will be the frequency distribution function.

In words: \fbox{
lik(\tth)=probability of observing the given data as a function of \tth.
} Definition:
The maximum likelihood estimate (mle) of is that value of that maximises $lik(\tth)$: it is the value that makes the observed data the ``most probable''.

If the $X_i$ are iid, then the likelihood simplifies to

\begin{displaymath}lik(\tth)=\prod_{i=1}^{n} f(x_i\vert\tth)\end{displaymath}

Rather than maximising this product which can be quite tedious, we often use the fact that the logarithm is an increasing function so it will be equivalent to maximise the log likelihood:

\begin{displaymath}
l(\tth)=\sum_{i=1}^{n} \log(f(x_i\vert\tth))
\end{displaymath}


Maximum Likelihood of Multinomial Cell Probabilities

$X_1,X_2,\ldots,X_m$ are counts in cells/ boxes 1 up to m, each box has a different probability (think of the boxes being bigger or smaller) and we fix the number of balls that fall to be $n$: $X_1+X_2+\cdots +X_m=n$. The probability of each box is $p_i$, with also a constraint: $p_1+p_2+\cdots +p_m=1$, this is a case in which the $X_i's$ are NOT independent, the joint probability of a vector $x_1,x_2,\ldots,x_m$ is called the multinomial ,and has the form:

Each box taken separately against all the other boxes is a binomial, this is an extension thereof.

We study the log-likelihood of this :

\begin{displaymath}
l(p_1,p_2,...,p_m)=log n!-\suim log x_i!+\suim x_ilog p_i
\end{displaymath}

However we can't just go ahead and maximise this we have to take the constraint into account so we have to use the Lagrange multipliers again.

We use

\begin{displaymath}L( p_1,p_2,...,p_m,\lambda)=l(p_1,p_2,...,p_m) +\lambda(1-\suim p_i)\end{displaymath}

The Hardy Weinberg Example

This is a trinomial with three boxes: the probabilities are parametrized by:

\begin{displaymath}(1-\tth)^2 \qquad 2\tth (1- \tth) \qquad \tth^2\end{displaymath}


\begin{displaymath}l'(\tth)=-\frac{2X_1+X_2}{1-\tth}+ \frac{2X_3+X_2}{\tth}\end{displaymath}


\begin{displaymath}l''(\tth)=-\frac{2X_1+X_2}{(1-\tth)^2}+ \frac{2X_3+X_2}{\tth^2}\end{displaymath}

Each of the counts is binomially distributed with probabilities as described above so that:

\begin{eqnarray*}
E(X_1)&=n (1-\tth)^2\\
E(X_2)&=2n \tth (1-\tth)\\
E(X_3)&=n\tth^2\\
\end{eqnarray*}



We can either use asymptotic maximum likelihood or the bootstrap, we are going to compare the two using Splus.

After some algebra we find:

\begin{displaymath}E[l''(\thh)]=-\frac{2n}{(1-\thh)\thh}\end{displaymath}

Of course as usual we don't know $\tth$ we have to plug in the estimate:

\begin{displaymath}s_{\thh}=\frac{(1-\thh)\thh}{2n}\end{displaymath}

Example
$n=1029$, $X_1=342,X_2=500,X_3=187$ Using the $MLE: \thh=\frac{2X_3+X_2}{2n}=.4247$, and the $MLE$ estimate for the standard deviation is: $s_{\thh}=.011$, we will now compare to what a parametric bootstrap estimate gives:

The bootstrap procedure we are going to follow generates multinomials with the estimated probabilites and then computes for these new counts the bootstrap estimate for which we then compute the standard deviation.

Here is the Splus code I used:

multin <-   
function(pvec, num)
{
#Function that generates the output of 
#putting num balls into length(pvec) boxes,
#with probabilities each given by the vector pvec
        run <- runif(num)
        gener <- cut(run, cumsum(c(0, pvec)))
        return(gener)
}

> theta<-
function(xvec)
{
        return((2 * xvec[3] + xvec[2])/(2 * sum(xvec)))
}

bts.ml.mw<-
function(thetahat = 0.4247, nboot = 1000)
{
        res.bt <- rep(0, nboot)
        pv <- c((1 - thetahat)^2, 2 * thetahat * (1 - thetahat), thetahat^2)
        for(b in (1:nboot)) {
                xvec <- multin(pv, 1029)
                res.bt[b] <- theta(xvec)
        }
        return(res.bt)
}

 bt1 <- bts.ml.mw()
quantile(bt1, c(0.025, 0.975))
     2.5%     97.5% 
 0.403292 0.4446186

quantile(bt1 - 0.4247, c(0.025, 0.975))
        2.5%      97.5% 
 -0.02140797 0.01991856

> sqrt( var(bt1))
[1] 0.01074981

The matlab statistics package bootstrap

Preliminaries : The matlab function eval :
 
 EVAL Execute string with MATLAB expression.
    EVAL(s), where s is a string, causes MATLAB to execute
    the string as an expression or statement.
 
    EVAL(s1,s2) provides the ability to catch errors.  It
    executes string s1 and returns if the operation was
    successful. If the operation generates an error,
    string s2 is evaluated before returning. Think of this
    as EVAL('try','catch').  The error string produced by the
    failed 'try' can be obtained with LASTERR.
    
    [X,Y,Z,...] = EVAL(s) returns output arguments from the
    expression in string s.
 
    The input strings to EVAL are often created by 
    concatenating substrings and variables inside square
    brackets. For example:
 
    Generate a sequence of matrices named M1 through M12:
 
        for n = 1:12
           eval(['M' num2str(n) ' = magic(n)'])
        end
 
    Run a selected M-file script.  The strings making up 
    the rows of matrix D must all have the same length.
    
        D = ['odedemo '
             'quaddemo'
             'fitdemo '];
        n = input('Select a demo number: ');
        eval(D(n,:))
 
    See also FEVAL, EVALIN, ASSIGNIN, LASTERR.

%BOOTSTRP Bootstrap statistics.
%   BOOTSTRP(NBOOT,BOOTFUN,D1,...) draws NBOOT bootstrap data samples and
%   analyzes them using the function, BOOTFUN. NBOOT must be a 
%   positive integer.
%   BOOTSTRAP passes the (data) D1, D2, etc. to BOOTFUN.
%
%   [BOOTSTAT,BOOTSAM] = BOOTSTRP(...) Each row of BOOTSTAT contains
%   the results of BOOTFUN on one bootstrap sample. If BOOTFUN returns a matrix
%   then this output is converted to a long vector for storage in BOOTSTAT.
%   BOOTSAM is a matrix of indices into the row
% Initialize matrix to identify scalar arguments to bootfun.
scalard = zeros(nargin-2,1);
 
lp = '(';      % Left parenthesis string variable.
rp = ')';      % Right parenthesis string variable.
c  = ',';      % Comma string variable.
ds = 'd';      % 'd' as a string variable.


 
% Construct argument list for bootfun
for k = 3:nargin
   dk = [ds,num2str(k-2)];
   [row,col] = size(eval(dk));
   if max(row,col) == 1
      scalard(k-2) = 1;
   end
   if row == 1 & scalard(k-2) == 0
      eval([dk,'=',dk,'(:);']);
     row = col;
   end
   if k == 3
      pstring = [lp,dk];
     n = row;
     if nargin == 3
        pstring = [pstring,rp];
      end
   elseif k == nargin & nargin > 3
      pstring = [pstring,c,dk,rp];
   else
      pstring = [pstring,c,dk];
   end
end

 
% Create index matrix of bootstrap samples.
bootsam = unidrnd(n,n,nboot);
 
% Get result of bootfun on actual data and find its size. 
thetafit = eval([bootfun,pstring]);
[ntheta ptheta] = size(thetafit);
 
% Initialize a matrix to contain the results of all the 
%bootstrap calculations.
bootstat = zeros(nboot,ntheta*ptheta);
 
dbs = 'db';
 
% Do bootfun - nboot times.
for bootiter = 1:nboot
 
   for k = 3:nargin
      dk  = [ds,num2str(k-2)];
      dbk = [dbs,num2str(k-2)];
     if scalard(k-2) == 0
         eval([dbk,'=',dk,'(bootsam(:,',num2str(bootiter),'),:);']);
     else
         eval([dbk,'=',dk,';']);
     end
     
      if k == 3
         pstring = [lp,dbk];
        n = row;
        if nargin == 3
           pstring = [pstring,rp];
         end
      elseif k == nargin & nargin > 3
         pstring = [pstring,c,dbk,rp];
      else
         pstring = [pstring,c,dbk];
      end
   end
   evalstr = [bootfun,pstring];
   tmp = eval(evalstr);
   bootstat(bootiter,:) = (tmp(:))';
end

Bootstrapping Regression

There are three ways to bootstrap a regression situation:

The Bootstrap Behaves Badly

Estimation of CART Bias

(see Breiman,Friedman,Olshen,Stone 1984)

Suppose we wanted to estimate the overall risk of a CART tree through the bootstrap:

We will call $d_n$ the rule built from the sample $\eta$,$Y_n$ is the response of size $n$ and $X_n$ the explanatory variable, we call $R(d_n)$ the resubstitution estimate

\begin{displaymath}\hat{R}(d_n)=\frac{1}{N}\sum_N L(Y_n,d_n(\X_n))\end{displaymath}

sum taken over $N$ samples.

The true risk for $d_n$ is:

\begin{displaymath}R(d_n)=\int \int L(y,d(x)) G(dX,dy)\end{displaymath}

Where $G$ is the distribution of $(X,Y)$.

The difference

\begin{displaymath}\hat{R}(d_n)-R(d_n)\end{displaymath}

is a random variable because both elements depend on the learning sample through the elaborated rule $d_n$.

\begin{displaymath}
\mbox{ Call }
B(G)=E(R(d_n)-\hat{R}(d_n))\end{displaymath}

the bias that we cannot know exactly, so we could imagine using the bootstrap to estimate it.

Now we are going to show that under certian conditions the bootstrap is going to behave badly.

Suppose that this data is not well adapted to a prediction of $Y$ by $X$ and that the variables are independent.

Say we are trying to classify the observations, that is that the repsonse variable is a categorical variable, with say, $J$ clases, the prior probabilities for each class are unknown so taken to be $\frac{1}{J}$ Every classification rule will in fact have risk $ \frac{J-1}{J}$.

Suppose we construct a tree that goes all the way down to individual leaves, that is the rule will give exact classification for an observation it has already seen and will classify correctly an observation it is seeing for the first time with probability $\frac{1}{J}$. Thus $\hat{R}(d_n)=0$ and $B(G)=\frac{J-1}{J}$, the expected value of the bias computed for the empirical distribution $G_n$ is $B(G_n)=\frac{J-1}{J} \times P$( any member of the original sample does not appear at all in the sample), thus

\begin{displaymath}E B()=\frac{J-1}{J} [1-\frac{1}{n}]^n\end{displaymath}

Recalling that

\begin{displaymath}lim_{n\longrightarrow \infty} [1-\frac{1}{n}]^n
=e^{-1} \doteq .37\end{displaymath}

We will have the bootstrap estimate of bias will be $40\%$ of the true value, on average.

Multivariate Regression

Several papers by Freedman and Peters, (Jasa 1984, Journ Bus. Econ. Stat. 1985) document the fact that the bootstrap does not give the correct answer for multivariate regression situations where the number of variables is of a similar order as the number of observations.

These two examples seem to suggest that much care must be used in applying the bootstrap in nonparametric situations!

Dependent Data

When the independence assumption is fulfilled there are several ways to use the bootstrap, similar to the choices in ordinary regression:

Nonparametric Tests

Permutation Tests

Univariate: Some reminders and examples

1. Splus example session
 zea_scan()
1:  188 96 168 176 153 172 177 163 146 173 186 168 177 184 96
16: 139 163 160 160 147 149 149 122 132 144 130 144 102 124 144
 zea_matrix(zea,ncol=2)
 zea
      [,1] [,2] 
 [1,]  188  139
 [2,]   96  163
 [3,]  168  160
 [4,]  176  160
 [5,]  153  147
 [6,]  172  149
 [7,]  177  149
 [8,]  163  122
 [9,]  146  132
[10,]  173  144
[11,]  186  130
[12,]  168  144
[13,]  177  102
[14,]  184  124
[15,]   96  144

diff_zea[,2]-zea[,1]
> diff
 [1] -49  67  -8 -16  -6 -23 -28 -41 -14 -29 -56 -24 -75 -60  48
> rank(abs(diff))
 [1] 11 14  2  4  1  5  7  9  3  8 12  6 15 13 10
 sign(diff)
 [1] -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1
> sign(diff)*rank(abs(diff))
 [1] -11  14  -2  -4  -1  -5  -7  -9  -3  -8 -12  -6 -15 -13  10
 W_sum(wvec[wvec>0])
> W
[1] 24
args(wilcox.test)
function(x, y = NULL, alternative = "two.sided", mu = 0, 
paired =F, exact =T,correct=T)
sum(wvec[wvec<0])
[1] -96
> wilcox.test(zea[,1],zea[,2],paired=T)
         Exact Wilcoxon signed-rank test 
data:  zea[, 1] and zea[, 2] 
signed-rank statistic V = 96, n = 15, p-value = 0.0413 
alternative hypothesis: true mu is not equal to 0 
 
> wilcox.test(zea[,1],zea[,2],paired=T,exact=T,alternative="greater")
         Exact Wilcoxon signed-rank test 
data:  zea[, 1] and zea[, 2] 
signed-rank statistic V = 96, n = 15, p-value = 0.0206 
alternative hypothesis: true mu is greater than 0 
 > 15*16*31/24
[1] 310
> sqrt(310)
[1] 17.60682
  (96-120)/17.6
[1] -1.363636

2. The original Gray code, Fisher s test

A. The original Gray code.

Let $Z^n_2$ be the set of binary $n$-tuples. This may be identified with the vertices of the usual $n$-cube or with the set of all subsets of an $n$ element set. The original Gray code gives an ordered list of $Z^n_2$ with the property that successive values differ only in a single place. For example, when $n=3$ such a list is

\begin{displaymath}000,\; 001,\; 011,\; 010,\; 110,\; 111,\; 101,\; 100\;.\end{displaymath}

It is easy to give a recursive description of such a list, starting from the list for $n=1$ (namely 0,1). Given a list $L_n$ of length $2^n$, form $L_{n+1}$ by putting a zero before each entry in $L_n$, and a one before each entry in $L_n$. Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from $0,1$ we get $00, 01, 11, 10$ and then the list displayed above for $n=3$. For $n=4$ the list becomes :

\begin{displaymath}
0000,\;
0001,\;
0011,\;
0010,\;
0110,\;
0101,\;
0100,\;
1100,\;
1101,\;
1111,\;
1110,\;
1010,\;
1011,\;
1001,\;
1000.
\end{displaymath}

Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.

One way to express this is as follows : let $m=\sum \epsilon_i 2^i$ be the binary representation of the integer $m$, and let $\cdots e_3 e_2 e_1 e_0$ be the string of rank $m$ in the Gray code list. Then $e_i=\epsilon_i+\epsilon_{i+1} \; \; (mod 2) \; \; (i=0,1,2 \ldots)$ and $\epsilon_i=e_i+e_{i+1}+ \cdots (mod 2) \; \; (i=0,1,2 \ldots)$. For example, when $n=4$, the list above shows the string of rank $6$ is $0101$ ; now $6=0110=0 \cdot 1 + 1 \cdot 2 + 1 \cdot 4 +0 \cdot 8$. So $e_0 = 0+1=1, e_1 = 1+1=0, e_2 = 1+0=1,
e_3=0+0=0$. Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).

We study two uses for such Gray codes here: they provide a way of complete enumeration for Fisher's randomization procedure and for Hartigan's subset procedure. Similar procedures can be used for signed rank statistics and the usual two sample ($m$ out of $n$) tests.

B. Fisher's randomization test.

Fisher (1935) gave a permutation justification for the usual test for $n$ paired observations $(x_1,y_1),\cdots,(x_n,y_n)$. In his example (Darwin's Zea data) $x_i$ and $y_i$ were real numbers representing plant height for treated and untreated plants. Darwin had calculated the mean difference. Fisher gave a way of calibrating this by calculating $d_i=\vert x_i-y_i\vert$ and considering all $2^n$ possible sums $S_n=\epsilon_1
d_1+\cdots + \epsilon_n d_n$ with $\epsilon=\pm 1$.

We observe that a Gray code allows us to systematically run through all $2^n$ patterns. One need only update the present value of the sum by changing a single value. Furthermore the symmetry of the problem saves a factor 2 of the computations. Figure 2.1 shows a histogram of the randomization distribution based on this procedure applied to the Zea data. The original value of the sum $S_0$ is 39.25 inches. This clearly falls in the tail of the distribution. In fact the exact proportion of sign patterns with a larger absolute sum is $.0518$ (as Fisher also calculated). \psfig{figure=/home/susan/b494/fig21.ps,height=4in,width=6.5in} Figure 2.1 Randomization Distribution of $S_n$ for Zea Data

For the remainder of this section we will use the exact distribution to compare two methods of approximation for the significance level of Fisher's randomization test.

Given $(x_1,y_1),\ldots,(x_n,y_n)$ let $S_{n0}= (x_1 +
\cdots +x_n)-(y_1 + \cdots +y_n)$. Let $d_i=\vert x_i-y_i\vert$ and let $M_n$ be the number of choices of $2^n$ sign values such that $S_n=\epsilon_1d_1+\epsilon_2 d_2+\cdots +
\epsilon_nd_n$ is strictly larger than $S_{n0}$. For Fisher's example $M_{15}=835$ giving a p-value of $M_n/2^n=0.0518$.

The first approximation to $M_n$ uses the fact that $S_n$ is a sum of independant (non identically distributed) random variables. Thus with $\sigma_n^2=d_1^2+ \cdots + d_n^2$,

\begin{displaymath}
P\{ S_n \leq x\} \doteq
\Phi(\frac{x}{\sigma_n})
\eqno{(2.1)}
\end{displaymath}

Figure 2.2a shows a plot of $\{F_{true}({x{\sigma_n}})
-\Phi(x)\}$ versus x for the Zea data. The Normal approximation is quite good. It gives a p-value of 0.0538.

Fisher himself suggested a second approximation : using the studentized statistic

\begin{displaymath}
T_n=\frac{\sum_i X_i/ \sqrt{n}}
{
\{\sum_{i=1}^{n}
(X_i^2-...
...{X})^2/(n-1)
\}^{1/2}
}
{\rm\;
where }\; \; X_i=\epsilon_i d_i\end{displaymath}

This $T_n$ is a one to one increasing function of $S_n$, $($ if $f(S)=\sqrt{\frac{n-1}{ n}} {S \{{\sum_i d_i^2 -S^2/n\}^{-{\frac{1}
{ 2}}}}},\; \;
T_n=f(S_n))
$. So the number of sign patterns such that $T_n$ is strictly larger than the observed $T_{n0}$ equals $M_n$. Thus Fisher's t-approximation is based on :


\begin{displaymath}
P\{ T_n \leq x\} \doteq
F_{n-1}(x)
\eqno{(2.2)}
\end{displaymath}

with $F_{n-1}(x)$ the $t_{n-1}$ distribution function.

Remarks : The Gray code can be useful with any statistic and savings increase if efficient updating for single changes are available.

Fisher's test and this data set have been extensively discussed in the statistics literature. Fisher himself avoided complete computation by looking at the tails and coming in carefully. Usual approximations proceed by Monte Carlo, however Pagano and Tritchler (1983) give a fast Fourier algorithm for complete enumeration. We give a different Fourier implementation and further discussion in Section 2D. Lehmann (1975) and Tritchler (1984) discuss exact confidence intervals derived from Fisher's randomization test. Manly (1991) carries out the exact computation without using an updating procedure. Basu (1980) and his discussants debate the logic of such ``permutation'' tests. Maritz (1981) gives a clear presentation of more general two-sample location problems. The Gray code approach can be used for $n \leq 30$. It also adapts to the vector valued $T^2$ statistic studied by Eaton and Efron (1970).

Comparing two distances

Suppose we have two distance matrices $A$ and $B$ beteen the same entities computed in two different ways. This statistic was used by F.N. David for comparing disease events to see if they were `signs' of epidemics, by Henry Daniels, 1944, Biometrika, and by Knox.

Recently the permutation test based on thsi statistic has been renamed `Mantel's test' by the ecological community (Manly 1992).

The statistic we look at is $\sum_{i<j} a_{ij} b_{ij}$ Which gives an indication of their correlation.

In fact we remark that if each distance matrix were vectorized, this is the equivalent of an uncentered covariance between the vectors.

If the two distances are small in the same places, and big in the same places the statistic will be large indicating a strong link between the distances.

In order to actually test the significance of such a link, a permutation test is performed: the null distribution is obtained by permuting the elements of $A$ to obtain what is denoted by $\Pi A$ , keeping the matrix symmetric, and recomputing the statistic:

\begin{displaymath}\sum_{i<j} (\pi a)_{ij} b_{ij}\end{displaymath}

The Splus progam enabling the implementation of this algorithm can be found below.

What will be useful to us here is the idea of a vectorial covariance between distance matrices, which we will apply to the special case of correlation matrices which are in fact similarities.

We define as in (Escoufier, 1973), the vector covariance between two matrices as their natural inner product:

\begin{displaymath}
CovV(A,B)=Tr(B'A)
\end{displaymath}

and their vectorial correlation as

\begin{displaymath}
RV(A,B)=\frac{CovV(A,B)}{\sqrt{CovV(A,A)CovV(B,B)}}=
\frac{Tr(A'B)}{\sqrt{Tr(A'A)Tr(B'B)}}
\end{displaymath}

Remark:
If the matrices are reduced to centered vectors, this reduces to the classical definitions of covariance and correlation coefficient.

daniel<-
function(mata, matb, nperm)
{
        if(ncol(mata) != ncol(matb))
                stop("matrices should have same dimension")
        if(nrow(mata) != nrow(matb))
                stop("matrices should have same dimension")
        r0 <- stat(mata, matb)
        r <- matrix(0, nperm, 1)
        for(i in (1:nperm)) {
                matp <- permute(mata)
                r[i] <- stat(matp, matb)
        }
        p <- as.numeric(r < r0)/(nperm + 1)
        return(p, r0, r)
}
> permute
function(mata)
{
        ap <- matrix(0, nrow(mata), nrow(mata))
        ni <- sample(nrow(mata))
        for(i in 1:nrow(mata)) {
                for(j in 1:(i - 1)) {
                        ap[i, j] <- mata[ni[i], ni[j]]
                }
        }
        return(ap)
}
> stat<-
function(mata, matb)
{
        r0 <- 0
        for(i in 2:nrow(mata))
                for(j in (1:(i - 1))) {
                        r0 <- r0 + mata[i, j] * matb[i, j]
                }
        return(r0)
}

Wald-Wolfowitz Test on the Line

Pool the data, rank the data , count the number of `runs', sequences of observations that are from the same sample and follow each other.

More precisely, suppose we have two sample of observations, we will call them the $X$'s and the $Y$'s.

Null hypothesis $F_X=F_Y$.

Test statistic is the number of runs $R$, that is the number of consecutive sequences of identical labels. The null distribution of the test statistic can be found by a combinatorial argument:

\begin{displaymath}W=\frac{R-\frac{2mn}{N}-1}{\sqrt{\frac{2mn(2mn-N)}{N^2(N-1)}}} \sim \N(0,1)\end{displaymath}

as long as the ratio of the sample sizes stays a number. (bounded away from zero as they increase.)

Smirnov Test on the line

Sort the univariate observations, $r_i$ is the number of $X$'s that that have a rank smaller or equal to $i$.$s_i$ is the same but for the $Y_i$'s. The statistic here is

\begin{displaymath}D=max\vert d_i\vert\mbox{ where }d_i=
\frac{r_i}{m}-\frac{s_i}{n}
\end{displaymath}

Reject for large values of $D$

Minimal Spanning Tree Based Tests

This method is inspired by the Wald-Wolfowitz, and solves the problem that there is no natural multidimensional `ordering'.

The main reference is Friedman and Rafsky 1979, Ann stat.,pp. 697-717.

One may either use multidimensional data or first make the data bidimensionnal by principal components. (or another dimension reducing technique), in order to be able to represent the tree in a graphic.

Minimal Spanning Tree Algorithm

Not the travelling salesman problem. Algorithms include Prim's, Kruskals', I have implemented the greedy algorithm which is as follows: (it is a recursive algorithm) $T_0$ is a point.

Two-sample test

This extends to the case of more than two levels of treatment by doing the following:

Many -sample test

Example: Wine data As you may remember, this data was 14 physico-chemical composition varaibles (all continuous), that we are trying to relate to several categorical variables.

acp.vin <- acp.us(vin[, 1:14], q.choix = 2)
#Two components was the choice here
 plot(acp.vin$ind.cords, type = "n")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 1,  ], "1")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 2,  ], "2")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 3,  ], "3")

mst.wine <- mstree(acp.vin$ind.cords, plane = F)
[1] 10 66  6 28  8  4  6 27  7  9  1 14 14 11 14 44 15 25 45 41 39 50 16 76 78
[26] 37 58  5 27 31 62  2 63 31 55 69 40 45 38 18 21 52 51 47 37 26 42 49 76 46
[51] 75 22 54 55 34 53 34 57 33 61 35 33 65 60 78 59 64 77 40 73 72 74 71 43 48
[76] 46 75

segments(acp.vin$ind.cords[seq(mst.wine), 1], acp.vin$ind.cords[seq(mst.wine), 
        2], acp.vin$ind.cords[mst.wine, 1], acp.vin$ind.cords[mst.wine, 2])

title("Wines Minimal Spanning Tree")
> sum(vin[mst.wine,15]==vin[-78,15])
[1] 44

randclass_function(S=1000,data=as.numeric(vin[-78,15]),
compar=as.numeric(vin[mst.wine,15])){
same_rep(0,S)
n_length(compar)
for (i in (1:S))
same[i]_sum(as.numeric(data==compar[sample(n,n)]))
return(same)
}
r1_randclass()
> max(r1)
[1] 39
> max(r2)
[1] 40
 hist(r1,nclass=50)

Matlab MSTREE algorithm:
this doesn't work but should be made to as part of the mstree project.

function out=mst(distm)
%Computes the minimum spanning tree
%form a matrix of distances between n objects
%
n=length(distm);
out=[-n*ones(1,n-1) 0];
distree=distm(n,:);
n1=(1:(n-1));
[a b vectd]=find(tril(distm));
[a b vecti]=find(tril(ones(n,1)* (1:n),-1));
for k =(1:(n-1))
%Find the nearest edge to the tree
    nottree=find(out<0);
    [distmin imin]=min(vectd(out<0));
%Find the index of the minimum
    imin=n1(distm==distmin);
    imin=imin(1);
%Adjoin a new edge
    out(imin)=-out(imin);
%Update list of nearest vertices
    for i = (1:(n-1)) 
      if (out(i)<0 & distm(i,imin)<distm(i,-out(i)))
      out(i)=-imin;
      end;
    end
end

Nonuniform number generation

The `bible' for this is Luc Devroye's Non uniform Random Variate Generation Book, Springer Verlag, 1986.

General Methods

Inversion Method

As we saw when we spoke of the double bootstrap the following theorem is very useful:

This means that when we have a closed form formula for the inverse of the cdf which is not too expensive, we can use a random uniform and do the transformation that makes it have the correct density.

Example 1: Exponential

\begin{displaymath}f(x)=\lambda e^{-\lambda x}, x\geq 0, F(x)=1-e^{-\lambda x},
F^{-1}(U)=-\frac{1}{\lambda}log(1-U)\end{displaymath}

One can simplify this by remarking that $U$ and $1-U$ have the same distribution. Generate $
-\frac{1}{\lambda}log(U)$ Other distributions that have nice closed form expressions are the CAuchy, Rayleigh, Triangular, Pareto..

Example 2: Correlated Random Variables
$F{-1}(U)$ and $G{-1}(U)$ have maximum positive correlation and have the distributions $F$ and $G$, for negative correlation just take $G{-1}(-U)$.

Example 3: Maxima
$(F^n)^{-1}(U)$

If closed forms are not available, but a good numerical solver is, then you can use that.

Rejection Methods

Theorem: Suppose $f$ is a non zero density on $[a,b]$ and zero outside. (a,b may be infinite). Suppose we know a function $M(x)$ which majorizes $f$: $M(x)\geq f(x)$ on $[a,b]$ and take:

\begin{displaymath}m(x)=\frac{M(x)}{\int_{a}^{b} M(x) dx}\end{displaymath}

so that $m$ is a probability density.

We will choose $M$ to generate from, so it must be easier than $f$ to generate from. For instance, for $[a,b]$ bounded we can take the uniform on $[a,b]$.

Algorithm:

  1. Generate $T$ with density $m$.
  2. Generate $U$ uniform on $[0,1]$, independently of $T$ $\left\{ \begin{array}{ll}
\mbox{If } M(T) \times U \leq f(T) \mbox{accept T, ie } Z=T\\
\mbox{Otherwise reject T, go back to step 1}
\end{array}\right.$

This is why it works:

\begin{eqnarray*}
P(z\leq Z \leq z+dz) &=& P(z\leq T \leq z+dz \vert accept \; )...
... &=&
\int_a^b \frac{f(t)}{M(t)}m(t) dt=\frac{1}{\int_a^b M(t)dt}
\end{eqnarray*}



We need more than one uniform to generate one random variable with the right distribution.

How many are needed? This will depend on $c={\int_a^b M(t)dt}$.

The number of necessary trials before accepting will be a geometric with $p=\frac{1}{c}$.

Example 1:
Suppose $f$ is a bounded density on a compact support, $f(x)\leq M$ take $c=M(b-a)$,
Repeat the folowing:

Until $UM\leq f(X)$
Return X.

This is usually very inefficient, it is much better to take a bounding function that `fits'.

Example 2:
Simulating a normal using an exponential as a bounding function: we use the fact that the absolute value of a normal has density:

\begin{displaymath}
f(x)=\frac{2}{\sqrt{2 \pi}}e^{-\frac{x^2}{2}}
\end{displaymath}

$M(x)=\sqrt{2e/\pi}e^{-x}$ majorizes $f$, take $c=\sqrt{2e/\pi}$, we can generate from the density $m(x)$ easily because we know how to generate from an exponential of rate 1.


\begin{displaymath}\frac{f(x)}{M(x)}=exp(-\frac{1}{2}(x-1)^2) \end{displaymath}

Table Lookup

If the probabilities are rational numbers and the variable discrete, or well approximated by a discrete distribution you can use table lookup, which is very expensive to set-up but can be useful if you a very a very large number of random variables to generate:

Algorithm:

Examples : histogram data.

Specialized Methods

Polar methods for the Normal

Box - Muller
Fact:
If $X$ and $Y$ are independent normals then their polar coordiantes $R=\sqrt{X^2+Y^2}, \Theta=arctan(Y/X)$ are independent, $R^2 \sim expon(2)$, $ \Theta \sim U(0,2\pi)$.

Modified to short-cut cosines and sines:
Generate a point uniformly in the disk(1): $(V_1,V_2)$.
Call $S=V_1^2+V_2^2$,

Return $X=\sqrt{\frac{-2log S}{S}}V_1$, $Y=\sqrt{\frac{-2log S}{S}}V_2$

Exercice for next time (Tuesday April 22nd):
With matlab, do the flop count for both methods:

  1. Which is faster?
  2. What is the ratio (by how much is it faster)?

next up previous index
Next: EM algorithm Up: Computer Intensive Methods in Previous: Numerical Analysis for Statisticians   Index
Susan Holmes 2002-04-03