Next: Representation of trees Up: Lecture Summaries Previous: The Substitution Model

# Maximum Likelihood Methods for Estimating Trees

For a statistician this is the easiest of the methods to understand. A parametric model is posulated, is a -dimensional vector that we explain below and is the tree's topology. Under this model the likelihood for each possible tree is separately computed for each character or site, the independence of sites then allows the total likelihood of the tree for all data to be computed by taking the product.

The first part of the vector of parameters comes from the substitution model as explained above. The number of other parameters that have to be specified depends on the complexity of the model. If a molecular clock is postulated, speciation times (splitting events) are the other parameters. Otherwise both the branch lengths and the different rates along those branches have to be parametrized.

{

setsize#2ptxxxxxxsetsize#2ptsplain #1#2#3#1<17#1<20 #1<24#1<29 #1<34#1<41

#3 #1#2#3 @#125<@@25 setsize#2pt @ pt @@ pt #3

} {

setsize#2ptxxxxxxsetsize#2ptsplain #1#2#3#1<17#1<20 #1<24#1<29 #1<34#1<41

#3 #1#2#3 @#125<@@25 setsize#2pt @ pt @@ pt #3

}

The substitution parameters are estimated from the data. A complete model including distributions of separation events is postulated and the likelihood can be computed for each possible tree by computing the likelihood of the tree given each site

This actually requires computing the likelihood of all the subtrees, so the method is recursive.

As the assumptions are essential, I present them here:

1. Each site in the sequence evolves independently.
2. Different lineages evolve independently.
3. Each site undergoes substitution at an expected rate which is chosen from a series of rates with a given distribution.
Fancier versions of the procedure enable different sites to have different evolution rates.

Many biologists won't use maximum likelihood because of the computational expense, each tree's likelihood computation is NP hard. This is a surprising exception to the usual rule that parametric methods are advantageous by their lesser computational needs. Others don't use the MLE because there seems to be little evidence that the assumptions are actually realistic in real biological applications.

Next: Representation of trees Up: Lecture Summaries Previous: The Substitution Model

Prof. Susan Holmes
Tue Mar 24 14:40:11 EST 1998