next up previous
Next: The jackknife Up: Lectures Previous: Monte Carlo

Subsections

More about the theoretical underpinnings of the Bootstrap

Statistical Functionals

(Reference : Eric Lehmann, 1998,pp.381-438.)
We often speak of the asymptotic properties of the sample mean $\bar{X}$.These refer to the sequence $\bar{X}_n$. These functions are the same in some sense, for all sample size. The notion of statistical functional makes this clearer.

Suppose we are interested in real-valued parameters. We often have a situation where the parameter of interest is a function of the distribution function $F$, these are called statistical functionals.

\begin{displaymath}\theta=s(F)\end{displaymath}

Examples:

\begin{displaymath}\mu=E_F(X),\qquad \mu^{(k)}=E_F(X-E(X))^k , \qquad F^{-1}(p) \end{displaymath}

Goodness of fit statistics:

\begin{displaymath}\mbox{Kolmogorov-Smirnov 's } h(F)=sup_x \vert F(x)-F_0(x)\vert \end{displaymath}

is estimated by:

\begin{displaymath}h(\mbox{$\hat{F}_n$})=sup_x \vert\mbox{$\hat{F}_n$}(x)-F_0(x)\vert \end{displaymath}

Ratio of two means.

\begin{displaymath}\theta=\frac{\mu_1}{\mu_2}=\frac{E_{F_1}(X)}{E_{F_2}(X)}\end{displaymath}

We use the sample cdf

\begin{displaymath}\hat{F}_n=\frac{\char93 X_i \leq x}{n}=\frac{1}{n}\sum_{i=1}^n
\delta_{\{X_i\leq x\}}\end{displaymath}

as the nonparametric estimate of the unknown distribution $F$.

The usual estimates for these functionals are obtained by simply plugging in the empirical distribution function for the unknown theoretical one.

Thus taking into account that for any function $g$ we have:

\begin{displaymath}\int g(x)d\mbox{$\hat{F}_n$}(x)=\frac{1}{n}\sum_{i=1}^n g(x_i)\end{displaymath}

the plug-in estiamte for the mean is $\int x dF_n(x) =\frac{1}{n}\sum_{i=1}^n x_i$ the usual sample estimate, for the variance, it is the biased estimate:

\begin{displaymath}\hat{var_F(X)}=\int (x-E_{\mbox{$\hat{F}_n$}}(x))^2=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2
\end{displaymath}

Notions of Convergence

Convergence in Law
A sequence of cumulative distribution functions $H_n$ is said to converge in distribution to $H$ iff $H_n(x)\longrightarrow H(x)$ on all continuity points of $H$.

We say that if the random variable $Y_n$ has cdf $H_n$ and the rv $Y$ has cdf $H$, $Y_n$ converges in law to $Y$, and we write

\begin{displaymath}Y_n \stackrel{\cal{ L}}{\longrightarrow} Y\end{displaymath}

This does not mean that $Y_n$ and $Y$ are arbitrarily close, think of the random variables $U \sim U(0,1)$ and $1-U$.

Convergence in Probability

\begin{displaymath}Y_n \stackrel{P}{\longrightarrow} Y \quad
\forall \epsilon, P(\vert Y_n-c\vert < \epsilon) \longrightarrow 1\end{displaymath}

Note:
If $k_n Y_n \stackrel{\cal{ L}}{\longrightarrow} H$ where $H$ is a limit distribution and $k_n\longrightarrow \infty$ then $Y_n \stackrel{P}{\longrightarrow} 0$.

Why is the empirical cdf $\hat{F}_n$ a good estimator of F?

We showed in class that for fixed real $a$

\begin{displaymath}\sqrt{n}(\hat{F}_n(a)-F(a)) \stackrel{\cal{ L}}{\longrightarrow} \NN (0,F(a)(1-F(a)))\end{displaymath}

Because of the result noted above, this also ensures that $\hat{F}_n(a) \stackrel{P}{\longrightarrow} F(a)$, this is actually true uniformly in $a$ because Kolmogorovs statistic is pivotal

\begin{displaymath}d(\hat{F}_n,F)=sup_{x} \vert\hat{F}_n(x)-F(x)\vert =D_n \end{displaymath}

has a distribution that does not depend on $F$.

Definition:
A statsitic is said to be pivotal if its distribution does not depend on any unknown parameters.

Example: Student's $t$ statistic.

Generalized Statistical Functionals

When we want to evaluate an estimator, construct confidence intervals, etc.. we are usually interested in evaluating quantities that are functions of both the unknown distribution $F$, the empirical $\hat{F}_n$ and the sample size, here are some examples:
  1. The sampling distribution of the error:

    \begin{displaymath}\lambda_n(F,\hat{F}_n)=P_F(\sqrt{n}(\theta(\hat{F}_n)-\theta(F)))\end{displaymath}

  2. The bias:

    \begin{displaymath}\lambda_n(F,\hat{F}_n)=E_F(\theta(\hat{F}_n))-\theta(F)\end{displaymath}

  3. The standard error:

    \begin{displaymath}\lambda_n(F,\hat{F}_n)=\sqrt{E_F(\theta(\hat{F}_n)-\theta(F))^2}
\end{displaymath}

For each of these examples, what the bootstrap proposes is to replace $F$ by the empirical $\hat{F}_n$.

The bootstrap is said to work if

\begin{displaymath}\lambda_n(\hat{F}_n,\hat{F}_n^*)-\lambda_n(F,\hat{F}_n)\stackrel{P}{\longrightarrow} 0
\end{displaymath}

Example and Counterexample

Bootstrap of the maximum

Suppose we have a random variable $X$ uniformly distributed on $(0,\theta)$ where $\theta$ is the unkown parameter that we wish to estimate and whose sampling distribution we would like to know.

Theoretical Analysis

We showed in class that if we take the largest value of a sample of size $n$ to be the estimate of $\theta$, $\hat{\theta}=X_{(n)}$, then

\begin{displaymath}
P[\theta-c<X_{(n)}<\theta]=1-P[X_{(n)}<\theta-c]=1-(\frac{\theta-c}{\theta})^n
\end{displaymath}

so that $X_{(n)} \stackrel{P}{\longrightarrow} \theta$

As for the convergence in law:

\begin{displaymath}
P[n(\theta-X_{(n)})\leq x]=(1-\frac{x}{\theta n})^n
\end{displaymath}

and

\begin{displaymath}Distribution(X_{(n)})\longrightarrow
H(x)=1-e^{-\frac{x}{\theta}}
,\mbox{ as } n\longrightarrow \infty
\end{displaymath}

So that the sampling distribution of the maximum tends to be exponential, and depends on the unknown parameter.

I backed this up with a simulation experiment, I generated radnom uniforms from (0,1) and then multiplied them by 5 to simulate a U(0,5) distribution, I then computed many samples of size 15 from this distribution and took the maximums.

>> rand(3,10)                                       
ans =
  Columns 1 through 6
0.5804  0.8833  0.3863  0.2362  0.2711  0.6603   
0.7468  0.1463  0.9358  0.9170  0.8861  0.1625   
0.0295  0.8030  0.7004  0.2994  0.4198  0.0537   
  Columns 7 through 10 
 0.3936   0.6143    0.8899    0.2002
 0.4927   0.4924    0.0578    0.3766
 0.2206   0.8396    0.2538    0.0489
>> [v,i]=max(test1)
v =
Columns 1 through 7 
0.7468 0.8833 0.9358 0.9170 0.8861 0.6603 0.4927
Columns 8 through 10 
 0.8396    0.8899   0.3766
i =
  2   1    2    2     2     1   2   3   1   2

----------------------------------------------
function v=smaxi(B,n,maxi)
%Simulation of the uniform distribution
%on (0,maxi) with estimation of the maximum
%Samples of size n, B simulations
rands=maxi*rand(n,B);
[v,i]=max(rands);
--------------------------------------------

mm=smaxi(10000,15,5);
hist(mm,40)

This is what the histogram looks like:

As can be seen although the sample size was not very big, the sampling distribution is already quite close to exponential.

In the boosttrap example, we satrt with a given sample, sample1, that we will resample 10,000 times computing 10,000 maxima. We do not use loops but rather matrices that are more efficient both in Splus and matlab.
Non parametric Bootstrap
sample1=(5*rand(15,1));
  Columns 1 through 7 
1.1892 3.0412 4.4574 4.6107 4.6531 1.6721 3.9533
  Columns 8 through 14 
2.4167 2.7374 1.9671 2.7069 1.6799 2.4536 4.0456
  Column 15 
3.2760
>> [v,i]=max(sample1)
v =    4.6531
i =     5
>> indices=randint(n,B,n)+1 
indices =
   13    1   13    1    7   5     9   10    9    7
    2    6    1    6    8   3     1    3    2    4
   13   11    1    1    6   1    14   15    1    7
    6   10    1    5    9   7     4    1    5    5
   10   12    3    2    7   4    15   11   15   13
    2    2    3   14   12  10     8    4    9    9
   10   14    3    1    6  15     9   10    6    9
    2    6    6    9    1  11     3   13   12    8
    1    2    9    6   13   4     7    1   14   15
    9    9   14    6    1   9    14    3   15    3
    5   13    5    6   11  11    15    4    4   13
   10    8   12   14   14   7     7    4    9    7
    2    2    4   11    5   1     4    2    3    4
   13    5   10    7    8   4    13    3    9    1
   15    7    2    1    7   9    11   12    6    6
[out,i]= max(orig(indices))
  Columns 1 through 7 
4.6531    4.6531    4.6531  4.6531  4.6531  4.6531  4.6107
  Columns 8 through 10 
4.6107    4.6531    4.6531
function out=bmax(B,orig)
%Function to bootstrap the maximum
     [n,p]=size(orig);
     indices=randint(n,B,n)+1;
     [out,i]= max(orig(indices));

This is what the histogram looks like:

This shows a definite point mass at the sample maximum. In fact we can prove that the sample maximum has a point mass that stays large, whatever the sample size, because:

\begin{displaymath}1-P(X_{(n)}^*=X_{(n)})=1-P((X_{(n)} \in
\mbox{${\cal X}$}_n^*=1-(1-\frac{1}{n})^n\simeq 1-e^{-1}\end{displaymath}

There are several ways to fix this, Jo Romano has suggested a $k-out-of-n$ bootstrap and we could also use the extra information contained in the fact that we supposed that we knew what form the distribution was originally, ie Uniform, although the parameter is supposed unknown. This is also a plug in method, but called the parametric bootstrap.

Parametric Bootstrap

Knowing that the distribution function $F$ is restricted to a certain parametric family can help alot.

Maximum

Suppose that we want to do the maximum, still knowing that in fact the sample was drawn from a uniform with unkown upper bound $\theta$. We would be better off to generate many samples from the Uniform(0,$\hat{\theta}$), and look at the distribution of the maximum, there will of course be a slight bias to the left, but the distribution will be of the right form.

Correlation Coefficient

Suppose that in the law school data, the random variables $y,z$ are known to be Normal, with some unknown covariance structure, with correlation coefficient $\rho$.

Instead of new data by resampling we can generate new data by generating samples from the bivariate Normal, however we will have to plug in an estimate for the variance/covariance structure obtained from the original data.

Parametric Simulation

>> c=sqrt((1-.776^2)/.776^2)
c =    0.8128
function [ys,zs]=gennorm(B,my,mz,sy,sz,c)
%Simulation of the normal distribution
%with sy2,sz2,rho the variances and correlation
%C=sqrt((1-rho^2)/rho^2)
%and (my,mz) as the means
%B simulations
rs=randn(B,2);
r1=rs(:,1);
r2=rs(:,2);
ys=my+sy*r1;
zs=mz+(sz/(1+c^2))*(r1+c*r2);
>> [ys,zs]=gennorm(1000,my,mz,sy,sz,c);
>> corrcoef(ys,zs)
ans =
    1.0000    0.7558
    0.7558    1.0000

This is what the points looks like:

Parametric Bootstrap Simulations

 y=law15(:,1)
 z=law15(:,2)
>> mz=mean(z)
mz =
    3.0947
>> my=mean(y)
my =
  600.2667
>> var(z)
ans =    0.0593
>> var(y)
ans =   1.7468e+03
>> corrcoef(y,z)
ans =
    1.0000    0.7764
    0.7764    1.0000
>> cov(y,z)
ans =
   1.0e+03 *
    1.7468    0.0079
    0.0079    0.0001
>> sy=sqrt(var(y))
sy =   41.7945
>> sz=sqrt(var(z))
sz =
    0.2435
corrs=zeros(1,1000); 
for b=(1:1000)
[ys,zs]=gennorm(15,my,mz,sy,sz,c);
cor=corrcoef(ys,zs);    
corrs(b)=cor(1,2);
end
>> hist(corrs,40)

This is what the histogram looks like:

Here, you can compare to the 'true' sampling distribution for the law school data as obtained by sampling without replacement 100,000 times from the original 82 observation population.


next up previous
Next: The jackknife Up: Lectures Previous: Monte Carlo
Susan Holmes 2004-05-19