Phylogenetic analysis

Phylogenetic analysis • Selecting sequences • Outgroup sequences • Alignment • Choice of method • Example using one method

Three most important choices • Which sequences to include • Outgroup sequences • Alignment

“Outgroup” sequences be included • The best outgroup sequences are sequences clearly outside the group being studied, but not too far out. • Multiple outgroup sequences should be chosen. • The outgroup sequences are included in the data matrix just like the other sequences. • They will be used to root the tree.

Methods of phylogenetic analysis • Parsimony (Cladistics) • Maximum likelihood • Bayesian • Genetic distance (Neighbor-joining, etc.)

Parsimony (Cladistics) • Willi Hennig. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. • 1966. Phylogenetic systematics. • Evidence comes from characters • Goal: build most parsimonious tree

Finding the most parsimonious tree • Goal- fewest evolutionary steps (optimality criterion) • Fewest a.a. changes • Fewest base changes • Many tree topologies are tested, choosing the best. • Unrooted • Rooting the tree comes later.

Rooting the tree • The outgroup taxa are included in the data matrix just like the other taxa. • Once the best tree is found, it is “rooted” along the branch connecting the outgroup and ingroup taxa.

What to do in case of a tie- consensus • A “strict” consensus tree is one in which the branches not present on all trees are collapsed, resulting in polytomies. • A “50% majority rule” consensus tree is one in which the branches not present on 50% of the trees are collapsed, resulting in polytomies. • Trees with many polytomies are said to be less resolved than trees with few or no polytomies.

Why are Maximum Likelihood and Bayesian methods considered an improvement over parsimony? • + They allow for a model of molecular evolution to be specified. • Not all changes from one base to another (or from one a.a. to another) are equally likely. • Not all positions have the same probabilty of change. • - They require that the correct model be specified.

What is Maximum Likelihood (ML)? • Just like parsimony, ML examines lots of trees and picks the best one. • However, the optimality criteria differ. • Parsimony -- fewest changes. • ML -- maximizes the probability of observing the data (aligned sequences), given a model of molecular evolution.

Models of molecular evolution • Substitution matrix • For proteins, this is the (observed) probability of one amino acid changing to another. • For DNA, it is the probability of one base changing to another. • Site-to-site variation in rate of change • Some sites don’t vary. • Among those that do, they vary at different rates.

Why is using a correct model of molecular evolution better than using parsimony? • Under some conditions, parsimony chooses the wrong tree (long branch attraction). • Methods using a model are more precise and result in fewer exact ties, generally. • For example, changes between two chemically similar a.a.’s can be used as “similarity”. Under parsimony all differences are simply “different”. • Models usually choose a single best tree, whereas parsimony usually chooses a large set of most parsimonious trees. • Branch length estimates are more accurate with a model.

What is Bayesian phylogenetic analysis? • Just like ML, we search for the best trees that are consistent with both the model and the data. • Optimality criterion: • -- maximizes the probability of the tree, given the data (aligned sequences) and the model of molecular evolution. • Bayesian analysis is the only one that automatically provides confidence estimates (similar to bootstrap values) for each node.

Example - Bayesian analysis of signal transduction proteins • Using ProtTest to find out how the sequences are evolving • Informing MrBayes of the model of molecular evolution • Using MrBayes to get the phylogeny • Making a figure

MrBayes doesn’t know when it has run long enough -- you decide. Average standard deviation of split frequencies: < 0.01

A B C D E B A E D C

What is Neighbor-joining (NJ)? • NJ is an algorithm for building a tree. • There is no optimality criterion. • First, a matrix of distances between all pairs of sequences is computed. • A substitution matrix is needed to do this. • Then, one pair is chosen from among all possible pairs, because combining them best minimizes the length of the tree.

Neighbor-joining • NJ is very fast. • There is no optimality criterion. • This means there is no way to assess its success. • There is also no way to say whether a “best” tree is significantly better that a set of “next best” trees. (mt Eve) • The tree it chooses is not always the shortest. Distances are estimated from noisy data and early mistakes in NJ can’t be revisited.

Large data sets • If you have over 50 sequences, or if you have very long sequences (hundreds of proteins) ProtTest and MrBayes may take more than a couple of days to finish. • Parsimony is much faster. • It allows node support (bootstrap values) to be calculated. • It doesn’t require a model of molecular evolution. • PAUP* can read nexus files. • NJ is faster still. Sometimes it is the only method that is fast enough. • A default model of molecular evolution must be used.

DNA sequences should be used when sequences are highly similar • Use a very similar procedure. • Use MrModelTest instead of ProtTest.

Summary • Three most important choices • Which sequences to include • Outgroup sequences • Alignment • Choice of method - Bayesian • Example - Look on Ned’s Computational Corner for more details.

Phylogenetic analysis