1 / 16

Molecular phylogenetics 3

Molecular phylogenetics 3. Level 3 Molecular Evolution and Bioinformatics Jim Provan. Page and Holmes: Sections 6.5-6. Maximum likelihood. Principle of likelihood suggests that the explanation that makes the observed outcome most probable is preferred More formally: L D = Pr ( D | H )

galia
Télécharger la présentation

Molecular phylogenetics 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular phylogenetics 3 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 6.5-6

  2. Maximum likelihood • Principle of likelihood suggests that the explanation that makes the observed outcome most probable is preferred • More formally: LD= Pr (D | H) • In a phylogenetic context: • D is the set of sequences being compared • H is a phylogenetic tree • The tree that makes the data the most probable evolutionary outcome is the maximum likelihood estimate of the phylogeny

  3. Models, data and hypotheses • Maximum likelihood requires three elements: • A model of sequence evolution • A tree • A data set • ML methods of tree building must solve two problems: • For a given tree topology, what set of branch lengths makes the observed data most likely • Which tree has the greatest likelihood

  4. S k ln L = ln Li i = 1 Models, data and hypotheses • Suppose we have two sequences, 1 and 2, separated by an average of d substitutions per site: d = mt • Given a model of substitution for each site we can compare the probability Pij(d) that two sequences separated by d would have nucleotides i and j: • For example, if sequence 1 had nucleotide A then PAG(d) is the probability that sequence 2 has a G in the corresponding position • The log likelihood of obtaining the observed sequences is the sum of the log likelihoods of each individual site:

  5. -2620 -2640 -2660 -2680 -2700 -2720 -2740 -2760 1 2 3 4 5 6 7 8 9 10 Models, data and hypotheses • What model? • Transition/transversion ratio • Base composition • Variation in rate across sites • In all but simplest models (e.g. Jukes-Cantor), differences in transition / transversion rates can be taken into account • Keeping other parameters constant, it is possible to calculate ML estimates of individual parameters

  6. Likelihood ratio tests • We can test alternative hypotheses concerning the same data using a likelihood ratio test: • Likelihood ratio statistic (D) is the ratio of the alternative hypothesis (H1) to the null hypothesis (H0) • Because likelihoods are often very small, it is more convenient to use log likelihoods: D = log L1 – log L0 where: • L1 is the maximum likelihood of the alternative hypothesis H1 • L0 is the maximum likelihood of the null hypothesis H0 • Can be used to test various hypotheses such as whether a particular model of evolution is valid, whether a molecular clock adequately describes the data or whether one phylogenetic hypothesis is better than another

  7. Observed value 170.70 100 120 140 160 180 Log Lmax – log Ltree Testing models • A model can be tested to measure how well it fits the observed data by comparing likelihood a tree and a model confers on the data (Ltree) with theoretical best (Lmax) • Likelihood ratio test can be performed to test the adequacy of the HKY85 model to describe the hominid mtDNA data set

  8. Clock No clock Gibbon Orang-utan Gorilla Chimp Human Gibbon Orang-utan Gorilla Chimp Human Log L = -2660.61 Log L = -2659.18 Testing rate variation • If sequences are evolving at different rates, then an ultrametric tree will give a poor representation of relationships between taxa: 2D = log Lno clock – log Lclock

  9. S k D = (log L(k, tree 1) - log L(k, tree 2)) = log Ltree 1 - log Ltree 2 i = 1 Human Chimp Gorilla Orang-utan Gibbon Human Chimp Gorilla Orang-utan Gibbon Human Chimp Gorilla Orang-utan Gibbon Log L = -2659.18 Log L = -2663.94 Log L = -2701.36 Comparing phylogenetic hypotheses • If two trees are not significantly different then the sum of these likelihood differences: will not be significantly different from zero

  10. Objections to likelihood • Requires an explicit model of evolution: • This is a strength, since it makes us aware of the assumptions being made • However, dependence on a model raises question of which model to use • Computationally expensive: • Finding the best combination of model and tree is technically difficult • Computing likelihood is also time consuming and it may be that there is more than one maximal likelihood value for a given tree • Suggested that likelihood is better for testing models rather than as an all-purpose phylogenetic tool

  11. Splits • In the above example, the split {{gorilla, orang-utan, gibbon},{human, chimp}} can be written as 00011 in binary notation, or 3 in decimal notation • One advantage is that we can refer to any split by a single number

  12. Spectral analysis • Provides a means of visualising support for each split: • In simple terms, consists of plotting the frequencies of each split in the data set • Straightforward if there is two states for each character Human G T C A T C A T C C 1 1 0 1 1 0 1 1 0 1 Chimp A T T A C C A T T C 0 1 1 1 0 0 1 1 1 1 Gorilla G T T G T T A T T A 1 1 1 0 1 1 1 1 1 0 Orang-utan A C C A C T C C C A 0 0 0 1 0 1 0 0 0 0 Gibbon A C C G C C C C C A 0 0 0 0 0 0 0 0 0 0 5 7 6 11 5 12 7 7 6 3

  13. 0.05 0.04 0.03 0.02 0.01 0.00 H C H C Go H Go C Go O Gi O H O C O Go Gi Go O C Gi H Gi Gi Spectral analysis

  14. Spectral analysis • Since all splits cannot coexist in the same tree, some method is needed to decide which splits to use to construct the tree: • Five “trivial” splits will be in every tree • One possible solution is to choose the two mutually compatible, non-trivial splits which have the best support: • In this case, the best non-trivial split is {Orang-utan, Gibbon} • The next best supported split is {Human, Chimp}, which is compatible with this split • This gives the basic topology {{Human, Chimp}, Gorilla, {Orang-utan, Gibbon}} • Problems with spectral analysis: • Computationally expensive (half a million splits for 20 sequences) • Potential for more than two character states

  15. H 1 O B 1 H O B H C O C G 2 G B C G Split decomposition 1 2 3 4 5 6 7 8 9 HumanT C C T T A A A A ChimpT T C T A T A A A GorillaT T A C A A T A A Orang-utanC C A C A A A T A GibbonC C A C A A A A T

  16. 3 4 1 O B 3 1 H O B H 2 2 2 2 C G C G 3 3 4 3 3 H O 5 8 3 4 1 2 2 9 B 6 7 3 3 4 C G Split decomposition

More Related