next up previous
Next: Representation of trees Up: Lecture Summaries Previous: The Substitution Model

Maximum Likelihood Methods for Estimating Trees

For a statistician this is the easiest of the methods to understand. A parametric model tex2html_wrap_inline1468 is posulated, tex2html_wrap_inline1140 is a tex2html_wrap_inline1472 -dimensional vector that we explain below and tex2html_wrap_inline1474 is the tree's topology. Under this model the likelihood for each possible tree tex2html_wrap_inline1474 is separately computed for each character or site, the independence of sites then allows the total likelihood of the tree for all data to be computed by taking the product.

The first part of the vector of parameters tex2html_wrap_inline1505 comes from the substitution model as explained above. The number of other parameters that have to be specified depends on the complexity of the model. If a molecular clock gif is postulated, speciation times tex2html_wrap_inline1480 (splitting events) are the other parameters. Otherwise both the branch lengths tex2html_wrap_inline1482 and the different rates along those branches have to be parametrized.

{

setsize#2ptxxxxxxsetsize#2ptsplain #1#2#3#1<17#1<20 #1<24#1<29 #1<34#1<41

#3 #1#2#3 @#125<@@25 setsize#2pt @ pt @@ pt #3

picture583

} {

setsize#2ptxxxxxxsetsize#2ptsplain #1#2#3#1<17#1<20 #1<24#1<29 #1<34#1<41

#3 #1#2#3 @#125<@@25 setsize#2pt @ pt @@ pt #3

picture663

}

The substitution parameters are estimated from the data. A complete model including distributions of separation events is postulated and the likelihood can be computed for each possible tree by computing the likelihood of the tree given each site tex2html_wrap_inline1498

displaymath1500

This actually requires computing the likelihood of all the subtrees, so the method is recursive.

displaymath1502

As the assumptions are essential, I present them here:

  1. Each site in the sequence evolves independently.
  2. Different lineages evolve independently.
  3. Each site undergoes substitution at an expected rate which is chosen from a series of rates with a given distribution.
Fancier versions of the procedure enable different sites to have different evolution rates.

Many biologists won't use maximum likelihood because of the computational expense, each tree's likelihood computation is NP hard. This is a surprising exception to the usual rule that parametric methods are advantageous by their lesser computational needs. Others don't use the MLE because there seems to be little evidence that the assumptions are actually realistic in real biological applications.


next up previous
Next: Representation of trees Up: Lecture Summaries Previous: The Substitution Model

Prof. Susan Holmes
Tue Mar 24 14:40:11 EST 1998