Next: EM algorithm Up: Computer Intensive Methods in Previous: Numerical Analysis for Statisticians   Index

Subsections

# The Bootstrap, Permutation Tests, Simulation

## Motivation

We want to compute bias, variability, confidence intervals for estimates for which the theory doesn't provide closed form solutions.

Examples :

1. The parameters in non parametric regressions (smooths, ppreg,...)
2. Correlation coefficients for samples that don't necessarily come from a Normal distribution.
3. Testing when we don't know the distribution of the test statistic under the null hypothesis.

Suppose we are interested in the estimation of an unknown parameter (this doesn't mean that we are in a parametric context). The two main questions asked at that stage are :

• Question 1 What estimator should be used ?
• Question 2 Having chosen an estimator, how accurate is it ?

The second question has to be answered through information on the distribution or at least the variance of the estimator.

Of course there are answers in very simple contexts: for instance when the parameter of interest is the mean then the estimator has a known standard deviation : the estimated standard error noted sometimes

However no such estimator is available for the sample median for instance.

In maximum likelihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of

The bootstrap is a more general way to answer question 2 , with the following aspects:

• Less or no parametric modelling.
• More computation. (a factor 100 to 1000)
• Automatic, whatever the situation (can be complex).

If we had several samples from the unknown (true) distribution then we could consider the variations of the estimator :

Such a situation is never the case, so we replace these new samples by a resampling procedure based on the only information we have about , and that is an empirical :this side is what is called bootstrapping.

# The bootstrap : the univariate context

## In practice

From an original sample

draw a new sample of observations among the original sample with replacement, each observation having the same probabilty of being drawn (). The bootstrap sample is often denoted

If we are interested in the behaviour of a random variable , then we can consider the sequence of new values obtained through computation of new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability through a function denoted by RANINT(1,n) hereof.

Here is an example of a line of matlab that does just that:

T= floor((rand(1)*n)+1);


If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)


An approximation of the distribution of the estimate is provided by the distribution of

denotes the bootstrap distribution of , often approximated by

1. For b=1 to B do : /* B is the number of bootstrap samples */
1. For i=1 to n do : /* Generation of Sample */
1. OBS=RANINT(1,n)
2. Add observation number OBS to Sample
2. Compute

An example data set is LS (or laws) : the law school data consisting in 15 bivariate data points, each corresponding to an American law school as analysed in EFRON 1982:

is the class average score on a nationwide exam, is the class average undergraduate grades.

In order to repeat a large number of times the same instructions needed to construct data sets one usually has to use a looping facility, actually matlab and S are easier to program in this circumstance because one can use ordinary loops and use owner-created functions, as well as trying to benefit form the vectorized commands.

## S examples for simple bootstraps

laws<-scan()
576 3.39
635 3.30
558 2.81
578 3.03
666 3.44
580 3.07
555 3.00
661 3.43
651 3.36
605 3.13
653 3.12
575 2.74
545 2.76
572 2.88
594 2.96

"boot.simple"<-

function(data, fun, nboot, ...)
{
n <- nrow(as.matrix(data))
res.boot <- vector("list", length = nboot)
indb <- matrix(sample(n, size = n * nboot, T), nrow = nboot)
for(b in (1:nboot)) {
res <- fun(indb[b,  ], ...)
plot(res, xlab = "", ylab = "", lab = c(0, 0, 0))
res.boot[[b]] <- res
}
return(res.boot)
}

"boot"<-

function(ind, nboot, fun, ...)
{
# Tibshirani's efficient Bootstrap
# If fun is function of a matrix or returns multiple values
# it has to be transformed before

data <- matrix(sample(ind, size = length(ind) * nboot, replace = T),
nrow = nboot)
return(apply(data, 1, fun, ...))
}

#Elementary example of using boot

corr<- function(x, xdata)
{
cor(xdata[x, 1], xdata[x, 2])
}
j1000 <- boot(1:15, 1000, theta, laws)
hist(j1000 - 0.777)
q.bt <- quantile(junk1000, probs = c(0.01, 0.99))
#With greek label
hist(junk1000 - 0.777, nclass = 40, xlab = "q* - q", font = 13)

#Example of Bootstrapping a Mann-Whitney statistic
"mw"<-
function(x, y)
{
#Function that computes a Mann-Whitney statistic for two samples
n <- length(x)
m <- length(y)
rks <- rank(c(x, y))
wx <- sum(rks[1:n])
ux <- n * m + (m * (m + 1))/2 - wx
mwx <- 1 - ux/(m * n)
return(mwx)
}

"mwbt"<-
function(x, y, nboot = 100, fun = mw, ...)
{
bts <- rep(0, nboot)
for(b in (1:nboot)) {
xbts <- sample(x, length(x), replace = T)
ybts <- sample(y, length(y), replace = T)
bts[b] <- fun(xbts, ybts, ...)
}
return(bts)
}

"mwxy"<-
function(xy, n)
{
#	n <- length(x)
m <- length(xy) - n
rks <- rank(xy)
wx <- sum(rks[1:n])
ux <- n * m + (m * (m + 1))/2 - wx
mwx <- 1 - ux/(m * n)
return(mwx)
}


## Parametric bootstrap

If one has a parametric distribution which is well defined except for the pamameter , instead of drawing with replacement from the original sample, one can draw from the distribution , where is the estimation obtained from the original sample.

#Parametric Bootstrap
"boot.par"<-
function(model.sim = law.sim, n = 15, nboot = 66)
{
#Parametric bootstrap of correlation coefficient
#Law sim contains bivariate normal data

theta <- rep(0, nboot)
for(b in (1:nboot))
theta[b] <- corr(((b - 1) * n + 1):(b * n), law.sim)
return(theta)
}

#Example of using the functions and plotting the results

n1 <- rnorm(1000)
n2 <- rnorm(1000)
law.sim <- cbind(n1, 0.777 * n1 + sqrt(1 - (0.777^2)) * n2)
rbootpar <- boot.par()
hist(rbootpar)
par(new=T)
plot(density(rbootpar, width = 0.15), xlab = "", ylab = "",
lab = c(0, 0, 0),type = "l")
title("Parametric Bootstrap - Law School corr.", cex = 1.9)


## Smoothed Bootstrap

An intermediate solution between parametric and nonparametric bootsrapping is the smoothed bootstrap.

Instead of resampling directly from the empirical distribution , one smoothes it out first then the smoothed version is used to generate the new samples.

If a kernel smoother is used, for instance a Gaussian kernel, then generation from the smoothed distribution is easy since it is sufficient to resample with replacement and then perturb the sampled point by adding a small gaussian random variable (SILVERMAN B.W. & YOUNG G.A.)

#Smoothed Bootstrap

"boots"_function(x,nboot=100,fun,kernel="norm",h=1)
{
#Function doing a smoothed bootstrap with either
# a normal or an epaneichikov kernel
#h is the window width (can be tricky to choose)

n_nrow(as.matrix(x))
theta_rep(0,nboot)
for (k in (1:nboot))
{
if(kernel=="norm") eps_rnorm(n,mean=0,sd=1)
else eps_epan(n)
xb_x[sample(n,n,T)]+eps*h
theta[k]_fun(xb)
}
return(theta)
}

"epan"_function(n)
{
eps_rep(0,n)
v_matrix(runif(3*n,-1,1),nrow=n)
for(i in(1:n))
{
if ((abs(v[i,3])>=abs(v[i,2])) &&
(abs(v[i,3])>=abs(v[i,1])))
eps[i]_v[i,2]
else eps[i]_v[i,3]
}
return(eps)
}


## Different Uses for bootstrapping

### Quality of Estimates

#### Bias

:

If we take the statistic , (where denotes the empirical distribution function) and apply the bootstrap algorithm to its expectation then we obtain the estimate :

where the dot indicates averaging :

:

#### Confidence Intervals : The percentile method

As before we denote by the bootstrap distribution of , approximated by

The percentile method consists in taking the confidence interval for as being

Theoretically speaking this is equivalent to replacement of the unknown distribution by the estimate .
:

1. Input the level of the confidence interval
2. For b=1 to B do /* B is the number of bootstrap samples*/
3. For i=1 to n do /* Generation of Sample */
1. OBS=RANINT(1,n)
2. Add observation number OBS to the Sample
4. Compute
5. Combine the into a new data set THETA
6. Compute the and 'th quantiles for THETA.

# Enhancements

As can be predicted, the bigger , the number of bootstrap resamples, the better the approximations are, thus much recent work has been devoted to making the most of the computations :
• By choosing better quantities to bootstrap than the original estimate, through various transforms some of which require a double level bootstrap. (such as those in sections 2.1, 2.2 and 2.5 )
• By using techniques inspired from Monte Carlo methods : importance sampling, antithetic sampling. balanced sampling.
• By transforming the results obtained after an ordinary bootstrap process : bias correction EFRON 1982, accelerated bias correction EFRON 1987 for instance.

Preliminaries : Pivotal quantities
Construction of confidence intervals is based (ideally) on a pivotal quantity , that is a function of the sample and the parameter whose distribution is independant of the parameter , the sample, or any other unknown parameter. Thus for instance no knowledge is needed about the parameters and of a normal variable to construct confidence intervals for both of these parameters, because one has two pivotal quantities available :

Let denote the largest quantile of the distribution of the pivot , and the parameter space, set of all possible values for then

is a confidence set whose level is . Thus if , and denote the relevant quantiles for the distributions cited above the corresponding confidence intervals will be

and

Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level () of the confidence sets. Level error is sensitive to how far is from being a pivot.

We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation. Another is to use a cdf transform constructed from a first set of bootstrap simulations as explained in section 2.3.

## Bootstrap-t : Studentizing

Suppose that we are interested in a parameter , and that is an estimate based on the n-sample .

We will suppose that the asymptotic variance of is with a corresponding estimate .

The studentized bootstrap or bootstrap-t method (EFRON 1982, HALL 1988) is based on approximating the distribution function of by , distribution of .

Denote by the th quantile of the distribution of ,

Thus the equitailed bootstrap-t confidence interval of coverage is :

One way of finding an estimate is to use a second level bootstrap, this is rather expensive compuation-wise : one could prefer as an intermediate solution a jackknife estimate.
1. Compute the original estimate of : THETA(0)
2. For b=1 to B do :
1. Generate a sample
2. Compute the estimate of based on : THETA(b)
3. Use as original data and compute an estimate of the standard deviation of : SIGMA(b).
4. Compute =(THETA(b)-THETA(0))/ SIGMA(b)
3. Compute CNA and CNCA the and 'th quantiles for the sample formed by all the R's.
4. LO=THETA(0)-SQRT(N)*CNCA*SIGMA(0)
5. HI=THETA(0)-SQRT(N)*CNA*SIGMA(0)

## S functions

"bootsd" <-
function(x, theta, nbootsd = 25)
{
n <- length(x)
boo <- boot(x, fun = theta, nboot = nbootsd)
return(sqrt(((n - 1)/n) * var(boo)))
}

"bootst" <- function(x, fun = median, nboot = 100, nbootsd = 25)
{
# Studentized Bootstrap
boo <- rep(0, nboot)
for(b in (1:nboot)) {
boo[b] <- bootsd(x, theta = fun, nbootsd = nbootsd)
}
thetab <- boot(x, fun = fun, nboo = nboot)
theta0 <- fun(x)
thetasd <- ((thetab - theta0)/boo)
return(thetasd)
}


## Preliminary Transformations of the Estimate

One of the drawbacks of the above-defined method is that the variance estimate can tend to be quite erratic. In the particular case of bootstrapping a correlation coefficient one sometimes ends up with confidence intervals that contain [-1,1], which is rather disappointing.

However experiences have shown that preliminary transformation of such as Fishers transform (=arctanh ), followed by the Bootstrap-t and then retransformation of the confidence interval back to the original scale leads to much more dependable confidence intervals. Unfortunately such a handy transform is not always available but R. TIBSHIRANI, 1988 has suggested an algorithm for creating automatically defined variance stabilizing transformations.

## Pre-pivoting or Double Bootstrap

Based on the inverse CDF property :

i.e. that if is distributed according to then is uniform.

Denote by the c.d.f. of our root which is supposed to converge weakly to the c.d.f. as increases.

As we have seen the idea behind the bootstrap is to replace the unknown distribution function by its empirical counterpart , thus using as an estimate for the unknown cumulative distribution. Once validation of the procedure has been insured through verifying that

Then

In fact it has been shown that the distribution of is less dependent on than is that of , thus providing a better pivot.
:
1. For b=1 to B do
1. Generate sample a bootstrap sample of size from
2. Compute R(b)=
3. Initialize Z(b)=0
4. For k=1..K do
1. Generate sample a bootstrap sample of size from
2. Compute
3. If add one to Z(b)
5. Divide Z(b) by .
2. Compute CN1 :the 'th quantile of the 's
3. Compute the CN1'th quantile of the R's.

## Balanced Bootstrap : saving computations

DAVISON, HINKLEY AND SCHECHTMANN, 1986 introduced the idea that one could reduce the amount of simulations (=B) necessary to attain a given precision by using each of the sample observations exactly equally often.

The algorithm proposed by the authors was however rather expensive timewise ( ). We have a more economical one implemented proposed by Gleason .
:

1. Form the list
2. For to do
1. Exchange and

## Antithetic Sampling : more on saving computations

The idea comes from current monte-carlo techniques, and is detailed in unpublished work by HALL. Like other enhancements presented here, its objective is to reduce the estimate's variance.

Let and be two different estimates of an unknown parameter , if and can be chosen so as to be negatively correlated, then has half the variance of and .

The bootstrap version of this is based on an antithetic permutation of the sample. B ordinary resampling operations then supplies the equivalent of 2B resampling operations.

# Bootstrap and Randomization Tests

## An example : Infants walking

(from DANIEL W.W., p.216) :
Researchers wished to know if they could conclude that two populations of infants differ with respect to the mean age at which they walked alone. The following data (ages in months) were collected:
Sample from population A:
9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75
Sample from population B:
12.5, 9.5, 13.5, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0, 12.0
What should we conclude ? ()
The classical approach : a t-test

The software output from script in the appendix shows :
The mean for population A is : 10.375
The mean for population B is : 12.208
The difference is : 1.833
As the variances can be considered as equal, a classical test under the normal assumption gives a t-value of -3.1561 indicating a significant difference because the P-value is 0.0046.
Assumptions :

1. Random sampling of observations from the population
2. Equal population standard deviations for males and females
3. Normal distributions for test scores within groups
A permutation test approach Based on the fact that if there is no difference between the two populations then the result will be compatible to allocation at random of each observation to one of two groups (shuffling). Any statistical measure of the difference could be used here : mean difference, studentised (t) difference, or even a median difference.
1. Compute the mean difference between samples A and B: MAB(0).
2. For p=1 to P do
1. Randomly allocate 12 observations to sample A and 10 to sample B
2. Compute the mean difference MAB(p)
3. Is MAB(0) a typical value from the randomization distribution MAB(p),p=1..P?
(Compute #{MAB(p)MAB(0)}/P and #{MAB(p)MAB(0)}/P

A bootstrap approach Following the previous definitions of the bootstrap a possible bootstrap test could run as follows:

1. Compute MAB=mean difference between the two original samples
2. For b=1 to B=500 do :
1. Draw a bootstrap sample from population A.
2. Draw a bootstrap sample from population B.
3. Compute a bootstrap estimate of the mean difference :MD(b).
3. Construct the bootstrap distribution of MD, eventually smoothed and recentred at 0.
Situate MAB(0) with respect to this distribution.

## More generally

Approximate randomisation tests are frequently used to test independance of two sets of variables (which is the null hypothesis). Estimation of the distribution under this hypothesis is obtained by shuffling one variable (or set of variables) with respect to the other variable(s). (see MANLY, 1990, for an easy overview).

However it is important to note that there is no assumption on the fact that the observations form a random sample from some population. There is no inference that can be carried automatically over to a hypothetic population.

On the contrary the assumption underlying the bootstrap is that the observations form a random sample whose distribution approximates that of the population. Thus the conclusions drawn from a bootstrap test can be used to infer on the population.

NOREEN 1989 suggests using a combination of these two tests when inference is needed for testing whether two variables are stochastically independant in the population. However one must keep in mind that at the same time one tests whether the marginal distributions of the variables in the sample satisfactorially estimate those of the variables in the population.

# Finding out how good a parametric bootstrap procedure is through a simulation experience :

Confidence Intervals for the expectation of an exponential distribution
1. Choose n
2. Choose an expectation for :
3. For b=1 to B do :
1. Create a sample having an exponential distribution
( )
(In fact we use uniformly distributed on
and then compute )
2. Compute a bootstrap confidence interval for
3. Count if is in the interval and otherwise.

For B large enough, provides the probability that is in the bootstrap confidence interval.

In order for the experience to be really useful, one ought to retry with different sample sizes and different expectations, one could also retry with all sorts of different distributions.

### Using the Bootstrap to build Confidence Intervals

We call the unknown parameter and our estimate . Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of , we will be interested especially in its quantiles : denote the quantile by and the quantile by . By definition we have:

It is by writing this that you can build the correct intervals.

Now I am going to compare an answer to an estimation probalem both through maximum likelihood and through bootstrapping. First let us remind ourselves of notations:

# Maximum Likelihood Estimation

have joint density denoted

Given observed values , the liklihood of is the function

considered as a function of .

If the distribution is discrete, will be the frequency distribution function.

In words: Definition:
The maximum likelihood estimate (mle) of is that value of that maximises : it is the value that makes the observed data the most probable''.

If the are iid, then the likelihood simplifies to

Rather than maximising this product which can be quite tedious, we often use the fact that the logarithm is an increasing function so it will be equivalent to maximise the log likelihood:

## Maximum Likelihood of Multinomial Cell Probabilities

are counts in cells/ boxes 1 up to m, each box has a different probability (think of the boxes being bigger or smaller) and we fix the number of balls that fall to be : . The probability of each box is , with also a constraint: , this is a case in which the are NOT independent, the joint probability of a vector is called the multinomial ,and has the form:

Each box taken separately against all the other boxes is a binomial, this is an extension thereof.

We study the log-likelihood of this :

However we can't just go ahead and maximise this we have to take the constraint into account so we have to use the Lagrange multipliers again.

We use

## The Hardy Weinberg Example

This is a trinomial with three boxes: the probabilities are parametrized by:

Each of the counts is binomially distributed with probabilities as described above so that:

We can either use asymptotic maximum likelihood or the bootstrap, we are going to compare the two using Splus.

After some algebra we find:

Of course as usual we don't know we have to plug in the estimate:

Example
, Using the , and the estimate for the standard deviation is: , we will now compare to what a parametric bootstrap estimate gives:

The bootstrap procedure we are going to follow generates multinomials with the estimated probabilites and then computes for these new counts the bootstrap estimate for which we then compute the standard deviation.

Here is the Splus code I used:

multin <-
function(pvec, num)
{
#Function that generates the output of
#putting num balls into length(pvec) boxes,
#with probabilities each given by the vector pvec
run <- runif(num)
gener <- cut(run, cumsum(c(0, pvec)))
return(gener)
}

> theta<-
function(xvec)
{
return((2 * xvec[3] + xvec[2])/(2 * sum(xvec)))
}

bts.ml.mw<-
function(thetahat = 0.4247, nboot = 1000)
{
res.bt <- rep(0, nboot)
pv <- c((1 - thetahat)^2, 2 * thetahat * (1 - thetahat), thetahat^2)
for(b in (1:nboot)) {
xvec <- multin(pv, 1029)
res.bt[b] <- theta(xvec)
}
return(res.bt)
}

bt1 <- bts.ml.mw()
quantile(bt1, c(0.025, 0.975))
2.5%     97.5%
0.403292 0.4446186

quantile(bt1 - 0.4247, c(0.025, 0.975))
2.5%      97.5%
-0.02140797 0.01991856

> sqrt( var(bt1))
[1] 0.01074981


# The matlab statistics package bootstrap

Preliminaries : The matlab function eval :

EVAL Execute string with MATLAB expression.
EVAL(s), where s is a string, causes MATLAB to execute
the string as an expression or statement.

EVAL(s1,s2) provides the ability to catch errors.  It
executes string s1 and returns if the operation was
successful. If the operation generates an error,
string s2 is evaluated before returning. Think of this
as EVAL('try','catch').  The error string produced by the
failed 'try' can be obtained with LASTERR.

[X,Y,Z,...] = EVAL(s) returns output arguments from the
expression in string s.

The input strings to EVAL are often created by
concatenating substrings and variables inside square
brackets. For example:

Generate a sequence of matrices named M1 through M12:

for n = 1:12
eval(['M' num2str(n) ' = magic(n)'])
end

Run a selected M-file script.  The strings making up
the rows of matrix D must all have the same length.

D = ['odedemo '
'fitdemo '];
n = input('Select a demo number: ');
eval(D(n,:))



%BOOTSTRP Bootstrap statistics.
%   BOOTSTRP(NBOOT,BOOTFUN,D1,...) draws NBOOT bootstrap data samples and
%   analyzes them using the function, BOOTFUN. NBOOT must be a
%   positive integer.
%   BOOTSTRAP passes the (data) D1, D2, etc. to BOOTFUN.
%
%   [BOOTSTAT,BOOTSAM] = BOOTSTRP(...) Each row of BOOTSTAT contains
%   the results of BOOTFUN on one bootstrap sample. If BOOTFUN returns a matrix
%   then this output is converted to a long vector for storage in BOOTSTAT.
%   BOOTSAM is a matrix of indices into the row
% Initialize matrix to identify scalar arguments to bootfun.
scalard = zeros(nargin-2,1);

lp = '(';      % Left parenthesis string variable.
rp = ')';      % Right parenthesis string variable.
c  = ',';      % Comma string variable.
ds = 'd';      % 'd' as a string variable.

% Construct argument list for bootfun
for k = 3:nargin
dk = [ds,num2str(k-2)];
[row,col] = size(eval(dk));
if max(row,col) == 1
scalard(k-2) = 1;
end
if row == 1 & scalard(k-2) == 0
eval([dk,'=',dk,'(:);']);
row = col;
end
if k == 3
pstring = [lp,dk];
n = row;
if nargin == 3
pstring = [pstring,rp];
end
elseif k == nargin & nargin > 3
pstring = [pstring,c,dk,rp];
else
pstring = [pstring,c,dk];
end
end

% Create index matrix of bootstrap samples.
bootsam = unidrnd(n,n,nboot);

% Get result of bootfun on actual data and find its size.
thetafit = eval([bootfun,pstring]);
[ntheta ptheta] = size(thetafit);

% Initialize a matrix to contain the results of all the
%bootstrap calculations.
bootstat = zeros(nboot,ntheta*ptheta);

dbs = 'db';

% Do bootfun - nboot times.
for bootiter = 1:nboot

for k = 3:nargin
dk  = [ds,num2str(k-2)];
dbk = [dbs,num2str(k-2)];
if scalard(k-2) == 0
eval([dbk,'=',dk,'(bootsam(:,',num2str(bootiter),'),:);']);
else
eval([dbk,'=',dk,';']);
end

if k == 3
pstring = [lp,dbk];
n = row;
if nargin == 3
pstring = [pstring,rp];
end
elseif k == nargin & nargin > 3
pstring = [pstring,c,dbk,rp];
else
pstring = [pstring,c,dbk];
end
end
evalstr = [bootfun,pstring];
tmp = eval(evalstr);
bootstat(bootiter,:) = (tmp(:))';
end


# Bootstrapping Regression

There are three ways to bootstrap a regression situation:
• Suppose we believe the residual errors to be normally distributed:

Then estimate the residual's variance from the sample at hand in a classical manner, we'll call it . Then for each , , generate , and construct . Use the new sample of to generate estimates , that can then be used to build bootstrap confidence intervals as explained above.
• Suppose we only believe that the residuals are iid centered.

Then the are generated by oridnary resampling, with replacement from the original sample of 's.
• The simple non parametric resampling of the pairs from the origianl pairs may cause difficulties if repeated values can't be dealt with. Usually one treats this as a regression in which the weights are the composition coefficients from the simplex.

## Estimation of CART Bias

(see Breiman,Friedman,Olshen,Stone 1984)

Suppose we wanted to estimate the overall risk of a CART tree through the bootstrap:

We will call the rule built from the sample , is the response of size and the explanatory variable, we call the resubstitution estimate

sum taken over samples.

The true risk for is:

Where is the distribution of .

The difference

is a random variable because both elements depend on the learning sample through the elaborated rule .

the bias that we cannot know exactly, so we could imagine using the bootstrap to estimate it.

Now we are going to show that under certian conditions the bootstrap is going to behave badly.

Suppose that this data is not well adapted to a prediction of by and that the variables are independent.

Say we are trying to classify the observations, that is that the repsonse variable is a categorical variable, with say, clases, the prior probabilities for each class are unknown so taken to be Every classification rule will in fact have risk .

Suppose we construct a tree that goes all the way down to individual leaves, that is the rule will give exact classification for an observation it has already seen and will classify correctly an observation it is seeing for the first time with probability . Thus and , the expected value of the bias computed for the empirical distribution is ( any member of the original sample does not appear at all in the sample), thus

Recalling that

We will have the bootstrap estimate of bias will be of the true value, on average.

## Multivariate Regression

Several papers by Freedman and Peters, (Jasa 1984, Journ Bus. Econ. Stat. 1985) document the fact that the bootstrap does not give the correct answer for multivariate regression situations where the number of variables is of a similar order as the number of observations.

These two examples seem to suggest that much care must be used in applying the bootstrap in nonparametric situations!

## Dependent Data

When the independence assumption is fulfilled there are several ways to use the bootstrap, similar to the choices in ordinary regression:
• Bootstrap by blocks' of several consecutive observations (simialr to bootstrapping pairs).
• If one has an idea of the right lag-dependence, estimate the errors and then bootstrap them.

# Nonparametric Tests

## Univariate: Some reminders and examples

1. Splus example session
 zea_scan()
1:  188 96 168 176 153 172 177 163 146 173 186 168 177 184 96
16: 139 163 160 160 147 149 149 122 132 144 130 144 102 124 144
zea_matrix(zea,ncol=2)
zea
[,1] [,2]
[1,]  188  139
[2,]   96  163
[3,]  168  160
[4,]  176  160
[5,]  153  147
[6,]  172  149
[7,]  177  149
[8,]  163  122
[9,]  146  132
[10,]  173  144
[11,]  186  130
[12,]  168  144
[13,]  177  102
[14,]  184  124
[15,]   96  144

diff_zea[,2]-zea[,1]
> diff
[1] -49  67  -8 -16  -6 -23 -28 -41 -14 -29 -56 -24 -75 -60  48
> rank(abs(diff))
[1] 11 14  2  4  1  5  7  9  3  8 12  6 15 13 10
sign(diff)
[1] -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1
> sign(diff)*rank(abs(diff))
[1] -11  14  -2  -4  -1  -5  -7  -9  -3  -8 -12  -6 -15 -13  10
W_sum(wvec[wvec>0])
> W
[1] 24
args(wilcox.test)
function(x, y = NULL, alternative = "two.sided", mu = 0,
paired =F, exact =T,correct=T)
sum(wvec[wvec<0])
[1] -96
> wilcox.test(zea[,1],zea[,2],paired=T)
Exact Wilcoxon signed-rank test
data:  zea[, 1] and zea[, 2]
signed-rank statistic V = 96, n = 15, p-value = 0.0413
alternative hypothesis: true mu is not equal to 0

> wilcox.test(zea[,1],zea[,2],paired=T,exact=T,alternative="greater")
Exact Wilcoxon signed-rank test
data:  zea[, 1] and zea[, 2]
signed-rank statistic V = 96, n = 15, p-value = 0.0206
alternative hypothesis: true mu is greater than 0
> 15*16*31/24
[1] 310
> sqrt(310)
[1] 17.60682
(96-120)/17.6
[1] -1.363636


2. The original Gray code, Fisher s test

A. The original Gray code.

Let be the set of binary -tuples. This may be identified with the vertices of the usual -cube or with the set of all subsets of an element set. The original Gray code gives an ordered list of with the property that successive values differ only in a single place. For example, when such a list is

It is easy to give a recursive description of such a list, starting from the list for (namely 0,1). Given a list of length , form by putting a zero before each entry in , and a one before each entry in . Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from we get and then the list displayed above for . For the list becomes :

Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.

One way to express this is as follows : let be the binary representation of the integer , and let be the string of rank in the Gray code list. Then and . For example, when , the list above shows the string of rank is ; now . So . Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).

We study two uses for such Gray codes here: they provide a way of complete enumeration for Fisher's randomization procedure and for Hartigan's subset procedure. Similar procedures can be used for signed rank statistics and the usual two sample ( out of ) tests.

B. Fisher's randomization test.

Fisher (1935) gave a permutation justification for the usual test for paired observations . In his example (Darwin's Zea data) and were real numbers representing plant height for treated and untreated plants. Darwin had calculated the mean difference. Fisher gave a way of calibrating this by calculating and considering all possible sums with .

We observe that a Gray code allows us to systematically run through all patterns. One need only update the present value of the sum by changing a single value. Furthermore the symmetry of the problem saves a factor 2 of the computations. Figure 2.1 shows a histogram of the randomization distribution based on this procedure applied to the Zea data. The original value of the sum is 39.25 inches. This clearly falls in the tail of the distribution. In fact the exact proportion of sign patterns with a larger absolute sum is (as Fisher also calculated). Figure 2.1 Randomization Distribution of for Zea Data

For the remainder of this section we will use the exact distribution to compare two methods of approximation for the significance level of Fisher's randomization test.

Given let . Let and let be the number of choices of sign values such that is strictly larger than . For Fisher's example giving a p-value of .

The first approximation to uses the fact that is a sum of independant (non identically distributed) random variables. Thus with ,

Figure 2.2a shows a plot of versus x for the Zea data. The Normal approximation is quite good. It gives a p-value of 0.0538.

Fisher himself suggested a second approximation : using the studentized statistic

This is a one to one increasing function of , if . So the number of sign patterns such that is strictly larger than the observed equals . Thus Fisher's t-approximation is based on :

with the distribution function.

Remarks : The Gray code can be useful with any statistic and savings increase if efficient updating for single changes are available.

Fisher's test and this data set have been extensively discussed in the statistics literature. Fisher himself avoided complete computation by looking at the tails and coming in carefully. Usual approximations proceed by Monte Carlo, however Pagano and Tritchler (1983) give a fast Fourier algorithm for complete enumeration. We give a different Fourier implementation and further discussion in Section 2D. Lehmann (1975) and Tritchler (1984) discuss exact confidence intervals derived from Fisher's randomization test. Manly (1991) carries out the exact computation without using an updating procedure. Basu (1980) and his discussants debate the logic of such permutation'' tests. Maritz (1981) gives a clear presentation of more general two-sample location problems. The Gray code approach can be used for . It also adapts to the vector valued statistic studied by Eaton and Efron (1970).

## Comparing two distances

Suppose we have two distance matrices and beteen the same entities computed in two different ways. This statistic was used by F.N. David for comparing disease events to see if they were signs' of epidemics, by Henry Daniels, 1944, Biometrika, and by Knox.

Recently the permutation test based on thsi statistic has been renamed Mantel's test' by the ecological community (Manly 1992).

The statistic we look at is Which gives an indication of their correlation.

In fact we remark that if each distance matrix were vectorized, this is the equivalent of an uncentered covariance between the vectors.

If the two distances are small in the same places, and big in the same places the statistic will be large indicating a strong link between the distances.

In order to actually test the significance of such a link, a permutation test is performed: the null distribution is obtained by permuting the elements of to obtain what is denoted by , keeping the matrix symmetric, and recomputing the statistic:

The Splus progam enabling the implementation of this algorithm can be found below.

What will be useful to us here is the idea of a vectorial covariance between distance matrices, which we will apply to the special case of correlation matrices which are in fact similarities.

We define as in (Escoufier, 1973), the vector covariance between two matrices as their natural inner product:

and their vectorial correlation as

Remark:
If the matrices are reduced to centered vectors, this reduces to the classical definitions of covariance and correlation coefficient.

daniel<-
function(mata, matb, nperm)
{
if(ncol(mata) != ncol(matb))
stop("matrices should have same dimension")
if(nrow(mata) != nrow(matb))
stop("matrices should have same dimension")
r0 <- stat(mata, matb)
r <- matrix(0, nperm, 1)
for(i in (1:nperm)) {
matp <- permute(mata)
r[i] <- stat(matp, matb)
}
p <- as.numeric(r < r0)/(nperm + 1)
return(p, r0, r)
}
> permute
function(mata)
{
ap <- matrix(0, nrow(mata), nrow(mata))
ni <- sample(nrow(mata))
for(i in 1:nrow(mata)) {
for(j in 1:(i - 1)) {
ap[i, j] <- mata[ni[i], ni[j]]
}
}
return(ap)
}
> stat<-
function(mata, matb)
{
r0 <- 0
for(i in 2:nrow(mata))
for(j in (1:(i - 1))) {
r0 <- r0 + mata[i, j] * matb[i, j]
}
return(r0)
}


## Wald-Wolfowitz Test on the Line

Pool the data, rank the data , count the number of runs', sequences of observations that are from the same sample and follow each other.

More precisely, suppose we have two sample of observations, we will call them the 's and the 's.

Null hypothesis .

Test statistic is the number of runs , that is the number of consecutive sequences of identical labels. The null distribution of the test statistic can be found by a combinatorial argument:

as long as the ratio of the sample sizes stays a number. (bounded away from zero as they increase.)

## Smirnov Test on the line

Sort the univariate observations, is the number of 's that that have a rank smaller or equal to . is the same but for the 's. The statistic here is

Reject for large values of

## Minimal Spanning Tree Based Tests

This method is inspired by the Wald-Wolfowitz, and solves the problem that there is no natural multidimensional ordering'.

The main reference is Friedman and Rafsky 1979, Ann stat.,pp. 697-717.

One may either use multidimensional data or first make the data bidimensionnal by principal components. (or another dimension reducing technique), in order to be able to represent the tree in a graphic.

### Minimal Spanning Tree Algorithm

Not the travelling salesman problem. Algorithms include Prim's, Kruskals', I have implemented the greedy algorithm which is as follows: (it is a recursive algorithm) is a point.
• Suppose we have , add an edge to it by finding the minimal edge between a point in the tree and a point that is not, join these to make .

Algorithm:

1. Set , for
2. Do times:
• index at which the minimum is attained.
• Update list of nearest vertices: for do:
• if and

The output to create will be a vector of numbers that correspond to where the th point should be connected to.

Here are some examples with Splus:

mstree(vincr[1:10,  ])
> mstree(vincr[1:10,  ])
$x: [1] 4.20959759 -3.70445704 4.54166603 7.22903347 4.89603710 2.28360796 [7] 0.00000000 -2.00589728 1.03156424 0.04926162$y:
[1] -7.0463328 -6.4673190  0.4050449 -0.3888339 -2.2247181  0.0000000
[7]  0.0000000 -3.2911484 -0.8079203 -1.5704904
$mst: [1] 10 8 6 3 3 7 10 10 10 mst: vector of length nrow(x)-1 describing the edges in the minimal spanning tree. The ith value in this vector is an observation number, indicating that this observation and the ith observation should be linked in the minimal span- ning tree.$order:
[,1] [,2]
[1,]    4    7
[2,]    3   10
[3,]    5    6
[4,]    6    9
[5,]    7    8
[6,]   10    3
[7,]    1    1
[8,]    9    4
[9,]    8    5
[10,]    2    2
order:    matrix, of size nrow(x) by 2, giving two types of  ord-
ering:  The  first  column  presents the standard ordering
from one extreme of the minimal spanning tree to the  oth-
er.  This  starts on one end of a diameter and numbers the
points in a certain order so that points close together in
Euclidean  space  tend  to  be close in the sequence.  The
second column presents the radial ordering, based on  dis-
tance from the center of the minimal spanning tree.  These
can be used to detect clustering.   See  below  for  graph
theory definitions.


plot(x, y)   # plot original data
mst <- mstree(cbind(x, y), plane=F)   # minimal spanning tree
# show tree on plot
segments(x[seq(mst)], y[seq(mst)], x[mst], y[mst])

i <- rbind(iris[,,1], iris[,,2], iris[,,3])
tree <- mstree(i)    # multivariate planing
plot(tree, type="n")  # plot data in plane
text(tree, label=c(rep(1, 50), rep(2, 50), rep(3, 50))) # identify points

# get the absolute value stress e
distp <- dist(i)
dist2 <- dist(cbind(tree$x, tree$y))
e <- sum(abs(distp - dist2))/sum(distp)
0.1464094


### Two-sample test

• Pool the two samples together.
• Make a minimal spanning tree.
• Count the number of 'pure' edges, ie edges of the minimal spanning tree whose vertices come from the same sample.
• An equivalent statistic is provided by taking out all the edges that have mixed colors and counting how many separate' trees remain.

This extends to the case of more than two levels of treatment by doing the following:

### Many -sample test

• Pool the samples together.
• Make a minimal spanning tree.
• Count the number of 'pure' edges, ie edges of the minimal spanning tree whose vertices come from the same treatment level.

Example: Wine data As you may remember, this data was 14 physico-chemical composition varaibles (all continuous), that we are trying to relate to several categorical variables.

acp.vin <- acp.us(vin[, 1:14], q.choix = 2)
#Two components was the choice here
plot(acp.vin$ind.cords, type = "n") text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 1,  ], "1")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 2, ], "2") text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 3,  ], "3")

mst.wine <- mstree(acp.vin$ind.cords, plane = F) [1] 10 66 6 28 8 4 6 27 7 9 1 14 14 11 14 44 15 25 45 41 39 50 16 76 78 [26] 37 58 5 27 31 62 2 63 31 55 69 40 45 38 18 21 52 51 47 37 26 42 49 76 46 [51] 75 22 54 55 34 53 34 57 33 61 35 33 65 60 78 59 64 77 40 73 72 74 71 43 48 [76] 46 75 segments(acp.vin$ind.cords[seq(mst.wine), 1], acp.vin$ind.cords[seq(mst.wine), 2], acp.vin$ind.cords[mst.wine, 1], acp.vin\$ind.cords[mst.wine, 2])

title("Wines Minimal Spanning Tree")
> sum(vin[mst.wine,15]==vin[-78,15])
[1] 44


randclass_function(S=1000,data=as.numeric(vin[-78,15]),
compar=as.numeric(vin[mst.wine,15])){
same_rep(0,S)
n_length(compar)
for (i in (1:S))
same[i]_sum(as.numeric(data==compar[sample(n,n)]))
return(same)
}
r1_randclass()
> max(r1)
[1] 39
> max(r2)
[1] 40
hist(r1,nclass=50)


Matlab MSTREE algorithm:
this doesn't work but should be made to as part of the mstree project.

function out=mst(distm)
%Computes the minimum spanning tree
%form a matrix of distances between n objects
%
n=length(distm);
out=[-n*ones(1,n-1) 0];
distree=distm(n,:);
n1=(1:(n-1));
[a b vectd]=find(tril(distm));
[a b vecti]=find(tril(ones(n,1)* (1:n),-1));
for k =(1:(n-1))
%Find the nearest edge to the tree
nottree=find(out<0);
[distmin imin]=min(vectd(out<0));
%Find the index of the minimum
imin=n1(distm==distmin);
imin=imin(1);
out(imin)=-out(imin);
%Update list of nearest vertices
for i = (1:(n-1))
if (out(i)<0 & distm(i,imin)<distm(i,-out(i)))
out(i)=-imin;
end;
end
end


# Nonuniform number generation

The bible' for this is Luc Devroye's Non uniform Random Variate Generation Book, Springer Verlag, 1986.

## General Methods

### Inversion Method

As we saw when we spoke of the double bootstrap the following theorem is very useful:

This means that when we have a closed form formula for the inverse of the cdf which is not too expensive, we can use a random uniform and do the transformation that makes it have the correct density.

Example 1: Exponential

One can simplify this by remarking that and have the same distribution. Generate Other distributions that have nice closed form expressions are the CAuchy, Rayleigh, Triangular, Pareto..

Example 2: Correlated Random Variables
and have maximum positive correlation and have the distributions and , for negative correlation just take .

Example 3: Maxima

If closed forms are not available, but a good numerical solver is, then you can use that.

### Rejection Methods

Theorem: Suppose is a non zero density on and zero outside. (a,b may be infinite). Suppose we know a function which majorizes : on and take:

so that is a probability density.

We will choose to generate from, so it must be easier than to generate from. For instance, for bounded we can take the uniform on .

Algorithm:

1. Generate with density .
2. Generate uniform on , independently of

This is why it works:

We need more than one uniform to generate one random variable with the right distribution.

How many are needed? This will depend on .

The number of necessary trials before accepting will be a geometric with .

Example 1:
Suppose is a bounded density on a compact support, take ,
Repeat the folowing:

• Generate two independent uniforms, and on .
Until
Return X.

This is usually very inefficient, it is much better to take a bounding function that fits'.

Example 2:
Simulating a normal using an exponential as a bounding function: we use the fact that the absolute value of a normal has density:

majorizes , take , we can generate from the density easily because we know how to generate from an exponential of rate 1.

### Table Lookup

If the probabilities are rational numbers and the variable discrete, or well approximated by a discrete distribution you can use table lookup, which is very expensive to set-up but can be useful if you a very a very large number of random variables to generate:

Algorithm:

• Given , generate a table of length with entries equal to .
• Generate a uniform .
• Return
Examples : histogram data.

## Specialized Methods

### Polar methods for the Normal

Box - Muller
Fact:
If and are independent normals then their polar coordiantes are independent, , .

Modified to short-cut cosines and sines:
Generate a point uniformly in the disk(1): .
Call ,

Return ,

Exercice for next time (Tuesday April 22nd):
With matlab, do the flop count for both methods:

1. Which is faster?
2. What is the ratio (by how much is it faster)?

Next: EM algorithm Up: Computer Intensive Methods in Previous: Numerical Analysis for Statisticians   Index
Susan Holmes 2002-04-03