creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/

CIS786, Lecture 5 Usman Roshan

Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT

Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT • The DCM story continues…

Disk Covering Methods (DCMs) • DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. • DCMs to date • DCM1: for improving statistical performance of distance-based methods. • DCM2: for improving heuristic search for MP and ML • DCM3: latest, fastest, and best (in accuracy and optimality) DCM

2. Compute subtrees using a base method 1. Decompose sequences into overlapping subproblems 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary DCM2 technique for speeding up MP searches

DCM1(NJ)

Computing tree for one threshold

Error as a function of evolutionary rate NJ DCM1-NJ+MP

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

DCM3 decomposition - example

Approx centroid-edge DCM3 decomposition – example • Locate the centroid edge e (O(n) time) • Set the closest leaves around e to be the separator (O(n) time) • Remaining leaves in subtrees around e form the subsets (unioned with the separator)

DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

DCM3 decomposition on 500 rbcL genes (Zilla dataset) • DCM3 decomposition • Blue: separator (and subset) • Red: subset 2 • Pink: subset 3 • Yellow: subset 4 • Vizualization produced by graphviz • program---draws graph according to • specified distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is small • Subsets are small • Compact subsets

0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs TNT DCM2 DCM3 Rec-DCM3 • Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. • DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. • DCM3 followed by TNT-ratchet doesn’t improve over TNT • Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT

Local optimum Cost Global optimum Phylogenetic trees Local optima is a problem

Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

Iterated local search: Recursive-Iterative-DCM3 Local search Local optimum Recursive-DCM3 Local search Output of Recursive-DCM3

TNT DCM2 DCM3 Rec-DCM3 Rec-I-DCM3 0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs for solving MP Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?

Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. What about the default TNT search? • We select some real and large datasets. (Previously we showed that TNT reaches best known scores on small datasets) • We run 5 trials of TNT for two weeks and 5 of Rec-I-DCM3(TNT) for one week on each dataset

2000 Eukaryotes rRNA

6722 3-domain+2-org rRNA

13921 Proteobacteria rRNA

How to run Rec-I-DCM3 then? • Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them? • Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. • Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes

Maximum likelihood

Maximum likelihood • Four problems • Given tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given tree and edge lengths find likelihood of tree: polynomial time dynamic programming

Second case Ron Shamir’s lectures

Second case Exponential time summation! Ron Shamir’s lectures

Second case Exponential time summation! Can be solved in polytime using dynamic programmming ---similar to computing MP scores Ron Shamir’s lectures

Second case-DP

Second case-DP Complexity?

Second case-DP Complexity? For each node and each site we do k^2 work, so total is mnk^2

Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity

Third case • Assign arbitrary values to all edge lengths except one t_rv • Now optimize function of one parameter using EM or Newton Raphson • Repeat for other edges • Stop when improvement in likelihood is less than delta

Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity • Given data find tree with best likelihood: unknown complexity

ML is a very hard problem • Number of potential trees grows exponentially This is  the number of atoms in the universe 10^80

Local search • Greedy-ML tree followed by local search using TBR moves • Software packages: PAUP*, PHYLIP, PhyML, RAxML • We now look at RAxML in detail • Major RAxML innovations • Good starting trees • Subtree rearrangements • Lazy rescoring

TBR

creativecommons/licenses/by-sa/2.0/