BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011

BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011

Maximum likelihood methods • So far we have only considered a single site (configuration). The likelihood for all sites is the product of the likelihoods for each site if all the sites evolve independently. • Suppose there are s homologous sequences each with N nucleotides. Let Dnbe the n-th column of the multiple alignment. • For a tree T, let be the likelihood of tree T for the n-th site, where q1, q2,…, qm are the unknown parameters such as the branch length. Using the previous case as an example, we have,

Maximum likelihood methods • For simplicity, let’s assume the sequences are homogenous, i.e., all sites evolve at the same rate, then the likelihood function for the entire sequence for the tree T is, • Here, we treat L as a function of the parameters. We then search for the values of q1, q2,…, qm that maximize L given the topology of the tree T, this value of L is called a ML value of the tree T. • Finding the ML value can be a slow process. • We do this for all possible tree topologies, and identify the one that has the largest ML value as the inferred phylogenetic tree of the s sequences. • Clearly, different substitution models may result in different trees. • When the number of OTUs is larger, a heuristic trees search algorithm should be used for evaluating the alternative trees.

Heuristic tree search using predefined clusters • Although the tree space could be very large, majority of them have extremely low likelihood values for a certain OTUs. • So we can safely ignore these unpromising trees, and focus on the promising ones. • To reduce the searching space, we can predefine clusters if their relationships are known as the input. • Then the problem becomes to examine the (105) possible trees generated by connecting these predefined groups, instead of an astronomically large number of unrooted trees:

Heuristic tree search using predefined clusters • The ML value is computed for each tree, the one with the largest ML value is returned as the inferred tree. • As this algorithm examines all possible trees, so the global optimum is guaranteed if the predefined groups are correct. • When the simple J-C model was used, and a homogenous substitution rate is assumed, the resulting ML tree is similar to the NJ and parsimony tree with the problem of misplacing tree shrews inside the primate group.

Maximum likelihood trees for primates • However, when the more sophisticated HKY substitution model, plus six g-distribution rate categories and invariant sites were used, the tree constructed by the ML method places the tree shrews outside of the primate group. • Nevertheless, there are three trifurcations on this tree, indicating that at a trifurcation point, any of the three clusters can be an outgroup of the other two, and the three trees have the same ML value.

Comparison of parsimony and maximum likelihood methods • Parsimony methods have only one assumption that the changes on the branches are equally possible, however, this assumption may not hold. • Because of the few assumptions are used in parsimony methods, their proponents believe that these methods can be applied to any sequence data. • Parsimony method is also relatively fast, so can be applied to larger data sets. • ML methods make assumptions about the evolutionary models. • ML methods need to optimize all these parameters to find the ML value, therefore they are computationally intensive, and are very slow. • When evolutionary models are properly selected, ML methods tend to achieve better results than parsimony methods.

Heuristic tree search using quartet puzzling • The quartet puzzling algorithm is very fast heuristic algorithm for exploring the promising trees. Step 1: Computer ML values of the three trees for all possible four sequences 1 2 3 4 5 6 The best ML tree 1 3 1 2 1 2 For each possible 4 sequences 2 4 3 4 4 3 trees

Heuristic tree search using quartet puzzling Step 2: Randomly pick up four sequences, place them in the tree according to their best ML tree. 1 2 4 3 Step 3: Randomly pick up a remaining sequence, and add it to the tree, such that growing tree has a maximum number of best ML quartet trees. Repeat this process until all sequences are added to the tree. For example, if sequence 5 is randomly picked, and if one or both of the following trees are the best ML quartet trees involving 1, 2, 3, 4, and 5: 1 2 4 2 2 3 5 3 5 1 5 then, the resulting tree will be, 4 3

Heuristic tree search using quartet puzzling Then last sequence 6 is added to the tree. If the following has the best ML among all quartet trees containing sequence 6, 6 3 1 4 Then the resulting tree will be 6 2 2 1 5 Add sequence 6 5 1 4 3 4 3 • The whole process is repeated many times with the sequences being selected in different orders. The resulting tree will depend on the order of sequence selections. • The tree that happens most frequently will be chosen as the inferred tree.

Bayesian phylogenetic methods • Bayesian theorem: if A and B are two events, then • If T1, T2, …, and Tn, are events that partitions the sample space, and D is an event from the sample space, then, T1 T2 T3 T4 T5 T6 D T7 T8 T9 T10 T11 T12

Bayesian phylogenetic methods tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10 ……. treen • For N OTUs, we can have n=(2N-5)!! possible unrooted trees, which is a partition of the tree space. Let D be the alignment of the N OUTs, but we do not know which tree is most likely to account for D. • In the ML method, we compute the probability (likelihood) that D can be generated by each tree: • L(treei)=P(D/treei). • We find the maximum likelihood ML=max [P(D/treei)] by changing the parameters (branch length or substitution rates) on each tree i, and return the tree that has largest ML. • In Bayesian methods, we compute the probability that a tree can be generated by the observed alignment of the N OTUs, which is called the posterior probability,

Bayesian phylogenetic methods • Using Bayesian theorem, we have, • Calculation of the denominator of the posterior probability can difficulty, because we have to numerate all possible trees, and their branch length or substitution rate. • However, the value of the denominator is a constant for all possible trees, thus the posterior probability of each tree is only proportional to the likelihood of the tree multiplied by the prior probability. • If we can generate a large number of trees, such that the frequency of a tree is proportional to its likelihood of the tree multiplied by the prior probability, then the posterior probability can be easily computed by, where, P(treei) is called the prior probability.

The Markov chain Monte Carlo method for sampling • Markov chain Monte Carlo (MCMC) is a method for generating a sample from the entire sample space, such that the frequency of each individual in the sample is propotional to the likelihood to generate the observed data. • If we have no preference for choosing a tree before seeing the data, we can use a non-informative uniform prior probability, therefore, • The MCMC method begins with a trial tree T1 and compute its likelihood, L1, a move is then made on this tree that changes it by a small amount on any of the following parameters, 1. Branch length; 2. Rate of substitution; 3. Topology by a nearest neighbor interchange tree move.

The Markov chain Monte Carlo method for sampling • The likelihood of the new tree T2, L2 is computed, which is usually slightly different from L1. If L2 > L1, then T2 is accepted, and it becomes an element in the sample If L2 < L1, then T2 is accepted with probability L2 / L1. This rule of selection is call the Metropolis algorithm. • Therefore the MCMC method favors hill-climbing moves, but also allows downhill moves with the a certain probability. • The result will be that the equilibrium probabilities of observing the different trees in the sample are given by the likelihoods of the trees. • To see this, suppose that we have only two trees, so MCMC moves back and forward between them with transition probabilities r12 and r21. r12 T1 T2 r21

The Markov chain Monte Carlo method for sampling • Let p1 and p2 be the equilibrium probabilities of these trees in the sample. Then at equilibrium, the probabilities of observing these trees during the sampling process should be constant, • This property is called detailed balance. To have trees in the sample to be proportional to their likelihoods, we need to set Therefore, we have, • This means that to generate the desired sample, we should set the ratio of transitional probability to be equal to the ratio of likelihoods. • The MCMC algorithm just does this, because, if L2> L1, we set r12=1, r21= L1/L2; therefore, r21/r12= L1/L2. if L2< L1, we set r12= L2/L1, and r21=1; therefore, r21/r12= L1/L2.

The top four trees for the Platyrrhini group by MCMC • To compute likelihoods, HKY substitution model, plus six g-distribution rate categories and invariant sites are used. • The most parts o the tree are well defined, except the following groups. The positions of Capuchin is varying The same as in the tree constructed by NJ and parsimony methods

The top seven trees for principle groups by MCMC • The uncertainty of these trees indicate that more sequences are needed to solve the problem. The same as by NJ and parsimony The positions of Capuchin is varying

Popular phylogenetic tree construction programs • PHYLIP • PAUP (Phylogenetic Analysis Using Parsimony) • Developed by Joseph Felsenstein; • Implements most known distance methods such as UPGAM and NJ, maximum parsimony and ML methods; • The most recent release is version 3.69, which contains more than 50 programs; • Command line interface; • The package can be freely downloaded at http://evolution.genetics.washington.edu/phylip.html • Written by David Swofford; • Includes parsimony, distance matrix, invariants, and maximum likelihood methods and many indices and statistical tests; • Described at http://paup.csit.fsu.edu/ • Unfortunately, it is now commercialized by Sinauer Associates, selling for $85-150/package.

Popular phylogenetic tree construction programs • MEGA (Molecular Evolutionary Genetic Analysis) • TREE-PUZZLE • Developed by Sudhir Kumar and colleagues; • Contains parsimony, distance and likelihood methods for molecular data (nucleic acid sequences and protein sequences); • Can do bootstrapping, consensus trees, and a variety of data editing tasks; • Has sequence alignment function using an implementation of ClustalW; • A GUI based program; • Contain tree display functions. • Written by KorbinianStrimmer; • A program for maximum likelihood analysis for nucleotide and amino acid alignments; • Infers phylogenies by quartet puzzling;

Popular phylogenetic tree construction programs • TREE-PUZZLE • MrBayes • Tree View • Supports all popular models of sequence evolution of nucleotides and proteins, and can take rate heterogeneity among sites into account; • Compatible with PHYLIP files; • The current version also has features for parallel computation using the MPI message-passing interface if this is available; • Freely available at http://www.tree-puzzle.de/. • A program for the Bayesian estimation of phylogenetic trees. • Ability to analyze nucleotide, amino acid, restriction site, and morphological data • Freely available at http://mrbayes.csit.fsu.edu/ • A program for visualization and printing trees; • Free at http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011

BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011

Presentation Transcript

A. An overview of molecular biology

MOLECULAR MARKER TECHNOLOGIES

Molecular Interactions

Introduction to Molecular Genetics

Adaptive Design Methods in Clinical Trials

Introduction to Molecular Epidemiology

Molecular Basis of Cancer Carcinogenesis

Molecular Diagnostics – How To Get Started

900 MHz NMR

CHEMISTRY XL-14A MOLECULAR SHAPE AND STRUCTURE PART II

Part I.3 Methods in Molecular Cell Biology

Molecular Biology

3. Optimization Methods for Molecular Modeling

Contents (1)

Phylogenetic analysis

Computational Molecular Biology

Miscellaneous “Hot” Topics

Stoichiometry: Chemical Calculations

Welcome Each of You to My Molecular Biology Class

Phylogenetic Inference