Next: Outline Up: Introduction to the Bootstrap Previous: Introduction to the Bootstrap

Subsections

# Logistics

Time:
MWT: Sequoia 200 11-12.

## Assignments

There will be three homeworks to hand in 20 % There will be three labs to hand in 20 %

and two bigger projects (midterm and final) 60 %

## Projects

General instructions:
The main component of the projects will be matlab, or R/Splus functions that perform certain analyses and produce graphics, these functions should be emailed to me and hardcopies sent to the TA's.

• First Part due Monday, May, 3rd to be handed in in class.
• Completed project: due Friday, June 4, 2004 at 12:00 pm

### General description

The term project should allow you to see the bootstrap applied to your field of interest, which also means that you can do a theoretical study if that is your primary interest.

Typically the project can be one of the following types, or a combination of elements from each.

1. A case study, if you have some original data, and a statistical problem that you want to solve with the bootstrap.
2. A comparative study, you would like to compare performance of the bootstrap with other methods in different situations.
3. Implementation of a new computational procedure,for instance, you could try and write a gray code program with a clever update for other statistics than the correlation coefficient, or use a fancy variance-reducing technique for the Monte Carlo step, or improve on the empirical distribution as the estimator of .
4. A theoretical study on how fast the bootstrap estimate converges, and how to improve it.
You are encouraged to talk to the instructor or teaching assistants about ideas for a project before you decide on the subject, I will put an appointment calendar on my door, you should take a time to come and talk with me about your project.

Some projects will involve considerably more effort than others, and thus have greater potential to earn an outstanding grade. While the complete project counts for 60% of your final grade, the biggest payoff of a more challenging project is in the opportunity it provides you to solidify and extend your understanding of the material in the course and to obtain practical experience in applying it to your own research concerns. The first part will count 20%, the second, 40%.

You may want to do a project using data you have from another course (whether from an experiment or through access to a data set somebody else has collected). This may be a good way to apply statistics to something you have thought about. If you do a project of this sort, you must make very clear which part of the project is done specifically for your statistics course and which part is just a review or copy of work you have done for another course.

### Midterm Project Content

The first part of the project should be about 5 pages long (without counting the bibliography that should be very complete). Length is not an asset if it is not associated with increased content!
1. A simple and clear exposition of the question you are addressing.
2. A situation of the problem in the wider context of contemporary statistics, with a review of available methods for such data and a few words on their advantages and disadvantages.
3. A proposed solution to the problem using either the bootstrap or another resampling procedure, with comparative merits of the bootstrap as proposed to other methods.
4. A flow chart of the various tasks to be undertaken, programming, testing the program on simulated data, testing the program on real data are all reasonable steps.

### Final Project Contents

1. A theoretical part: explanation of the method studied, its properties.
2. A computational part: an algorithm for implementing the method in matlab or S-plus, this should also be emailed to the TAs so it can be tested. Make sure your code is readable, so we can eventually do a little trouble shooting if necessary.
3. A data-analysis part: actual data are to be submitted to the method studied, or tables should show comparisons, or theoretical results should be outlined.

Analysis of a data set with your algorithm: Perform a statistical analysis of some data set from an experiment, survey, or secondary data source using Matlab Splus. You should pay critical attention to issues concerning how the data were collected as well as to the statistical analysis. (Depending on the nature of the data and your own relationship to it, you may want to give more or less emphasis to explanation of the data collection.) You should make sure that your data set has enough complexity (more than just a couple of variables, and a decent number of observations) to support an interesting analysis.

Computer output should be incorporated in the usual way, i.e. put tables in the text or at the end but do not hand in a pile of unedited computer output. Tables and figures should be numbered and captioned. No uncommented output will be considered. The quality of presentation will come into account for the final grade. (Incorporation of good quality graphics, careful text-processing, no superfluous output). You should put the text of your computer programs in an appendix.

Some ideas according to your area of expertise:

• Education, Psychology, Social Scientists: Methods such as regression analysis, multivariate analyses, clustering can be bootstrapped usefully.
• Biology : Analysis of DNA : distances, phylogenies are bootstrapped alot.
• Econometry : Time Series Data need special treatment because of the underlying dependence.
You should consult some of the bootstrap books I have put on reserve at the maths and computer science library.

#### Data in publications and books:

• Human Development Report, published annually for the United Nations Development Program. There are a number of other statistical reports from the UN and other international agencies like the International Labor Organization.
• Statistical Abstract of the United States. Full of all sorts of statistical tables.
• On the Net(see below), for instance the `Chance' project of Laurie Snell is very interesting.
Some books in other areas that include data sets are the following:
• Data: a Collection of Problems from Many Fields for the Student and Research Worker, by D.F. Andrews and A.M. Herzberg
• Case Studies in Biometry, edited by Lange, Ryan, Billard, Brillinger, Conquest and Greenhouse
Also, articles in books and journals sometimes contain the original data set and you may have an idea for a different analysis than the one which the author did. You should distinguish carefully between what you did and what was in the original article.

#### Data sources on the Internet:

There is an increasing amount of data available on the Internet. As with other Internet materials, there is some gold out there and a lot of pure junk. If you would like to browse around for data on a topic you are interested in, you can start from the Statistics Department home page http://www-stat.stanford.edu/links, and look under Journal, books, etc... There are special bases on the test for each area : Genbank for genetical data for instance, there are also sports almanachs online.

#### Journals:

There are many journals which include articles with statistical analyses at an accessible level; in some cases the original data sets are also included. Psychology, biology and medicine are areas in which many articles will include at least some statistics. Talk to instructors in your field about what journals make use of statistical methods.
• Population Studies
• Chance (a popularly-oriented statistics magazine)
• Ecology (particularly Volume 74, No. 6, a special issue on statistical methods)
• Journal of Experimental Zoology
• New England Journal of Medicine
• Public Opinion Quarterly
• Journal of Applied Psychology
• Proceedings of National Academy of Sciences,section Evolution

## Teaching

Instructor: Susan Holmes. Office hours: Wed at 2.30 and by email appointment to susan@stat.stanford.edu.

TA's Brit Katzen and Jie-Hua Chen

TA's office hours:
Brit Katzen (Sequoia 229) : Wed. 2:15 - 3:45
Jie-Hua (Sequoia 141) :Thur 4.-5

## Course Web Site

This will contain a bulletin board, homeworks, course summary, project description list, reading list, links to useful sites with in particular Splus and matlab tutorials, software information, etc...