next_inactive up previous
Up: Lectures Previous: Confidence Intervals

Subsections

The Smoothed Bootstrap

We have seen how the parametric bootstrap and the nonparmaetric bootstrap differ by what is plugged into the statistical functional.

We want to estimate $\lambda_n(F)$ and we can use as an estimate either $\lambda_n(F_{\hat{\theta}})$ or $\lambda_n(\hat{F}_n)$. In fact there is an intermediary choice, that takes the empirical cdf $\hat{F}_n$ and smooths it a little, then we use the smoothed empirical cdf denoted by $\hat{F}_h$ and we plug it in.

This is especially useful when the bootstrap distribution is too discrete, mostly when the statistic $\hat{\theta}$ is a quantile, the median as we saw in the mouse data analysis had that problem.

Smoothing- a crash course

Suppose we have a bidimensional scatterplot we want to smooth, this could be a histogram or a regression type context, they are both of the same form. The simplest one to start one is when the $x$ abscissa, althou ordinal are discrete, such as ages rounded to decades. Then the y data appear along lines of the possible $x$'s.

The crosses, which are the conditional averages are a smooth of the scatter plot is some way.

Now suppose that the x's could be all over the place, we window them and take local averages.

The extreme case is when you take the whole x axes, then there is only one average, if you want you draw a line through it.

When the window is the smallest there is NO smoothing.

Again we want something gentler so we reduce the window width, and only take local averages. If we choose to differentiate within a window the points that are close to the abscisse at which we want to estimate the $y$ value by averaging, we can use a kernel weighting function.

Points that are close are given high weights, points further away are given lighter weights, on the boundary of the window the points won't count.

The weighting function is such that the sum of all the weights is 1. With no difference between weights, they are uniform. In fact the weighting function can be a probability density and often we take a Normal one.

Here is a nice webpage on smoothing, with available matlab softare.

Curve Fitting Example, Efron & Tibshirani, 7.3

loess.m is available in the course directory & loess is a built-in function in Splus.

Matlab procedure for bootstrapping the loess curve.

#N is the number of bootstrap.
N=500;
predmat=zeros(N, 101);

datasize=size(cholo,1);

clf;
plot(cholo(:,1), cholo(:,2), '.');   
hold on;

for i=1:N
  xind=unidrnd(datasize, datasize,1);
  x=cholo(xind,:);
  predmat(i,:)=loess(x(:,1), x(:,2), (0:100), .3, 1);     
  plot((0:100), predmat(i,:), '-.');     #Plot a sample bootstrap curve.   
end;

#Plot the 95\% pointwise confidence lines.
plot((0:100), prctile(predmat, 2.5), 'r-');
plot((0:100), prctile(predmat, 97.5), 'r-');

xlabel('Compliance');
ylabel('Improvement');
axis([-5, 105, -40, 120]);

Smoothing for variance stabilization

Page 164-166 Algorithm:
  1. Generate $B_1$ bootstrap samples $\mbox{${\cal X}$}_b^*$ and the bootstrap estimates $\{\hat{\theta}_b^*,b=1:B_1\}$.
  2. Fit a smooth curve to the pairs $(\hat{\theta}_b^*,\hat{se}(\hat{\theta}_b^*)$ to produce a smooth estimate of the function, we will call it $s(u)=se(\hat{\theta}\vert\theta=u)$.
  3. Use $g(x)=\int^x \frac{1}{s(u)}du$ as the variance stabilizing transformation. Find $g$ through numerical integration usually.
  4. Compute with $B_3$ bootstrap resamples, a bootstrap t interval for $\phi=g(\theta)$. (SE approximately one, so no denominator).
  5. Map back the endpoints of the interval through a $g^{(-1)}$ transformation.

boott               package:bootstrap               R Documentation

Bootstrap-t Confidence Limits

Description:

     See Efron and Tibshirani (1993) for details on this function.

Usage:

     boott(x,theta, ..., sdfun=MISSING, nbootsd=25, nboott=200,
           VS=FALSE, v.nbootg=100, v.nbootsd=25, v.nboott=200,
           perc=c(.001,.01,.025,.05,.10,.50,.90,.95,.975,.99,.999))

Arguments:

       x: a vector containing the data. Nonparametric bootstrap
          sampling is used. To bootstrap from more complex data
          structures (e.g. bivariate data) see the last example below.

   theta: function to be bootstrapped. Takes 'x' as an argument, and
          may take additional arguments (see below and last example).

     ...: any additional arguments to be passed to 'theta'

   sdfun: optional name of function for computing standard deviation of
          'theta' based on data 'x'. Should be of the form: 'sdmean <-
          function(x,nbootsd,theta,...)' where 'nbootsd'  is a dummy
          argument that is not used. If 'theta' is the mean, for
          example,  'sdmean <- function(x,nbootsd,theta,...)
          {sqrt(var(x)/length(x))}' . If 'sdfun' is missing, then
          'boott' uses an inner bootstrap loop to estimate the 
          standard deviation of 'theta(x)'

 nbootsd: The number of bootstrap samples used to estimate the standard
          deviation of 'theta(x)'

  nboott: The number of bootstrap samples used to estimate the
          distribution of the bootstrap T statistic.  200 is a bare
          minimum and 1000 or more is needed for  reliable  alpha %
          confidence points, alpha > .95 say.  Total number of
          bootstrap samples is  'nboott*nbootsd'.

      VS: If 'TRUE', a variance stabilizing transformation is
          estimated,  and the interval is constructed on the
          transformed scale, and then is mapped back to the original
          theta scale.  This can improve both the statistical
          properties of the intervals and speed up the computation. See
          the reference Tibshirani (1988) given below. If 'FALSE',
          variance stabilization is not performed.

v.nbootg: The number of bootstrap samples used to estimate the variance
         stabilizing transformation g.  Only used if 'VS=TRUE'.

v.nbootsd: The number of bootstrap samples used to estimate the
          standard deviation of 'theta(x)'.  Only used if 'VS=TRUE'.

v.nboott: The number of bootstrap samples used to estimate the
          distribution of  the bootstrap T statistic. Only used if
          'VS=TRUE'. Total number of bootstrap samples is
          'v.nbootg*v.nbootsd + v.nboott'.

    perc: Confidence points desired.

Value:

     list with the following components: 

confpoints: Estimated confidence points

theta, g: 'theta' and 'g' are only returned if 'VS=TRUE' was specified.
          '(theta[i],g[i]),  i=1,length(theta)'  represents the
          estimate of the variance stabilizing transformation 'g' at
          the points 'theta[i]'.

References:

     Tibshirani, R. (1988) "Variance stabilization and the bootstrap".
     Biometrika (1988) vol 75 no 3 pages 433-44.

     Hall, P. (1988) Theoretical comparison of bootstrap confidence
     intervals. Ann. Statisi. 16, 1-50.

     Efron, B. and Tibshirani, R. (1993) An Introduction to the
     Bootstrap. Chapman and Hall, New York, London.

Examples:

     #  estimated confidence points for the mean
     x <- rchisq(20,1)
     theta <- function(x){mean(x)}
     results <- boott(x,theta)
     # estimated confidence points for the mean, 
     #  using variance-stabilization bootstrap-T method
     results <-  boott(x,theta,VS=TRUE)
     results$confpoints          # gives confidence points
     # plot the estimated var stabilizing transformation
     plot(results$theta,results$g) 
     # use standard formula for stand dev of mean
     # rather than an inner bootstrap loop
     sdmean <- function(x, ...) 
         {sqrt(var(x)/length(x))}
     results <-  boott(x,theta,sdfun=sdmean) 

     # To bootstrap functions of more  complex data structures, 
     # write theta so that its argument x
     #  is the set of observation numbers  
     #  and simply  pass as data to boot the vector 1,2,..n. 
     # For example, to bootstrap
     # the correlation coefficient from a set of 15 data pairs:                    
          
     xdata <- matrix(rnorm(30),ncol=2)
     n <- 15
     theta <- function(x, xdata){ cor(xdata[x,1],xdata[x,2]) }
     results <- boott(1:n,theta, xdata)


next_inactive up previous
Up: Lectures Previous: Confidence Intervals
Susan Holmes 2004-05-19