1.11k likes | 1.28k Vues
http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 5. Usman Roshan. Previously. DCM decompositions in detail DCM1 improved significantly over NJ DCM2 did not always improve over TNT (for solving MP) New DCM3 improved over DCM2 but not better than TNT. Previously.
E N D
CIS786, Lecture 5 Usman Roshan
Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT
Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT • The DCM story continues…
Disk Covering Methods (DCMs) • DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. • DCMs to date • DCM1: for improving statistical performance of distance-based methods. • DCM2: for improving heuristic search for MP and ML • DCM3: latest, fastest, and best (in accuracy and optimality) DCM
2. Compute subtrees using a base method 1. Decompose sequences into overlapping subproblems 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary DCM2 technique for speeding up MP searches
Error as a function of evolutionary rate NJ DCM1-NJ+MP
I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.
DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets
Approx centroid-edge DCM3 decomposition – example • Locate the centroid edge e (O(n) time) • Set the closest leaves around e to be the separator (O(n) time) • Remaining leaves in subtrees around e form the subsets (unioned with the separator)
DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets
DCM3 decomposition on 500 rbcL genes (Zilla dataset) • DCM3 decomposition • Blue: separator (and subset) • Red: subset 2 • Pink: subset 3 • Yellow: subset 4 • Vizualization produced by graphviz • program---draws graph according to • specified distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is small • Subsets are small • Compact subsets
0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs TNT DCM2 DCM3 Rec-DCM3 • Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. • DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. • DCM3 followed by TNT-ratchet doesn’t improve over TNT • Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT
Local optimum Cost Global optimum Phylogenetic trees Local optima is a problem
Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours
Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation
Iterated local search: Recursive-Iterative-DCM3 Local search Local optimum Recursive-DCM3 Local search Output of Recursive-DCM3
TNT DCM2 DCM3 Rec-DCM3 Rec-I-DCM3 0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs for solving MP Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet
I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.
I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.
I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.
I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.
I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.
Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?
Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. What about the default TNT search? • We select some real and large datasets. (Previously we showed that TNT reaches best known scores on small datasets) • We run 5 trials of TNT for two weeks and 5 of Rec-I-DCM3(TNT) for one week on each dataset
How to run Rec-I-DCM3 then? • Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them? • Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. • Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes
Maximum likelihood • Four problems • Given tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given tree and edge lengths find likelihood of tree: polynomial time dynamic programming
Second case Ron Shamir’s lectures
Second case Exponential time summation! Ron Shamir’s lectures
Second case Exponential time summation! Can be solved in polytime using dynamic programmming ---similar to computing MP scores Ron Shamir’s lectures
Second case-DP Complexity?
Second case-DP Complexity? For each node and each site we do k^2 work, so total is mnk^2
Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity
Third case • Assign arbitrary values to all edge lengths except one t_rv • Now optimize function of one parameter using EM or Newton Raphson • Repeat for other edges • Stop when improvement in likelihood is less than delta
Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity • Given data find tree with best likelihood: unknown complexity
ML is a very hard problem • Number of potential trees grows exponentially This is the number of atoms in the universe 10^80
Local search • Greedy-ML tree followed by local search using TBR moves • Software packages: PAUP*, PHYLIP, PhyML, RAxML • We now look at RAxML in detail • Major RAxML innovations • Good starting trees • Subtree rearrangements • Lazy rescoring