Molecular phylogenetics 1

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 6.1-2

Sites Sequences Distances vs. discrete characters • This division is based on how the data are treated: • Distance methods first convert aligned sequences into a pairwise distance matrix, then input that matrix into a tree building method • Discrete methods consider each nucleotide site (or function of each site) separately

1 Sites 3 1 6 2 Sequences 4 5 3 7 2 4 Parsimony tree Distances vs. discrete characters

Sites 1 Sequences 3 2 1 2 1 1 2 4 Distance tree Distances vs. discrete characters

Distances vs. discrete characters • Trees obtained by parsimony (a discrete method) and minimum evolution (a distance method) are identical in topology and branch lengths: • Parsimony analysis identifies seven substitutions and places them on the five branches of the tree • Distance tree apportions observed distances between sequences over branches of the tree • Under parsimony each site requires one change, which gives a total of seven changes • Summing the branch lengths of the distance tree gives the same value: 2 + 1 + 2 + 1 + 1 = 7 • Parsimony tree gives additional information: which site contributes to which branch plus ancestral states

Clustering methods vs. search methods • Cluster methods follow a set of steps (an algorithm) and arrive at a tree: • Advantages: • Easy to implement, resulting in very fast computer programs • Always produce a single tree • Disadvantages: • Results obtained from simple clustering algorithms often depend on the order in which sequences are added to the growing tree • Do not allow evaluation of competing hypotheses: two different trees could explain data equally well but no way of measuring fit between tree and data

Start tree Decide where to place next sequence Add next sequence to tree A D A A D ? B C B C B C A D A D A E E ? D B C B C B C A clustering method Round 1 Round 2

Search methods • Tree-building methods in this class use optimality criteria to choose among the set of all possible trees: • Criterion is used to assign a “score” or “rank” to each tree which is a function of the relationship between the tree and the data • Require an explicit function relating tree and data (e.g. a model of how sequences evolve) • Allow comparison of how well competing hypotheses of evolutionary relationships fit the data • Major disadvantage is that optimality methods are computationally very expensive: • For a given data set and tree, what is the optimality value? • Which of all possible trees has the maximum optimality value?

4 6 =1 15 =8 A A A A A A A A A A A A E E E D D D D B B B B C C C C E E E D D D B A A B A B C C C E E E B B B B B B B B C C C C E E E D D D D D D D D C C C C E E E 11 12 13 =1 14 10 =8 7 5 =1 An optimality method

Non-deterministic polynomial- completeness problems • Non-deterministic polynomial-completeness problems represent a set of problems with no efficient algorithm for their solution known to exist • Problem of finding the optimal evolutionary tree for a variety of criteria (e.g. minimum evolution, maximum parsimony) is NP-complete: • For even a reasonable number of sequences (e.g. 20) it is impossible to guarantee that the optimal tree has been found • In such cases, we must rely on heuristics to find something approaching the best tree, but this may be far from optimal • Human mitochondrial DNA - different researchers obtained quite different trees using different heuristic searches

An heuristic method

Subtree methods • The effectiveness of an heuristic search depends in part on the number of trees examined, which can be computationally demanding • An alternative approach is to divide the set of sequences into smaller sets and find optimal trees for these subsets: • Smallest unrooted tree is a quartet • Each quartet has three possible unrooted trees • Quartet puzzling follows these two steps: • For each quartet, identify the optimal tree • Take all four-sequence trees from step 1 and assemble them into a tree • Due to homoplasy, the best tree will usually be the one which contains most quartets (but this is an NP-complete problem as well)

Type of data Distances Nucleotide sites Clustering algorithm Tree-building method Optimality criterion Comparing tree-building methods UPGMA Neighbour joining Maximum parsimony Minimum evolution Maximum likelihood

Comparing tree-building methods • Efficiency: • Effectively the time in which a computer program can find a tree • Since virtually all optimality methods are NP-complete, efficient tree searching algorithms that guarantee the best tree are unlikely • Some optimality criteria can be evaluated quicker than others: heuristic searches using parsimony can explore a much larger number of trees than a search using likelihood • Power: • Measure of how much data are needed before we can be reasonably sure of arriving at the correct result • A method may be theoretically appealing, but if it requires huge numbers of sites it is not practical

Comparing tree-building methods • Consistency: • Will the method converge on the true tree as data are added? • Inconsistent methods will fail even if data are continually added • Robustness: • All tree-building methods make (implicit or explicit) assumptions about evolutionary processes • Sensitivity to violations of the underlying model which return poor estimates of phylogeny e.g. assumption of a molecular clock • Falsifiability: • The ability to tell whether these assumptions have been violated i.e. that we should not be using the method at all!

Molecular phylogenetics 1