1 / 111

creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 5. Usman Roshan. Previously. DCM decompositions in detail DCM1 improved significantly over NJ DCM2 did not always improve over TNT (for solving MP) New DCM3 improved over DCM2 but not better than TNT. Previously.

eyal
Télécharger la présentation

creativecommons/licenses/by-sa/2.0/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://creativecommons.org/licenses/by-sa/2.0/

  2. CIS786, Lecture 5 Usman Roshan

  3. Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT

  4. Previously • DCM decompositions in detail • DCM1 improved significantly over NJ • DCM2 did not always improve over TNT (for solving MP) • New DCM3 improved over DCM2 but not better than TNT • The DCM story continues…

  5. Disk Covering Methods (DCMs) • DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. • DCMs to date • DCM1: for improving statistical performance of distance-based methods. • DCM2: for improving heuristic search for MP and ML • DCM3: latest, fastest, and best (in accuracy and optimality) DCM

  6. 2. Compute subtrees using a base method 1. Decompose sequences into overlapping subproblems 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary DCM2 technique for speeding up MP searches

  7. DCM1(NJ)

  8. Computing tree for one threshold

  9. Error as a function of evolutionary rate NJ DCM1-NJ+MP

  10. I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

  11. DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

  12. DCM3 decomposition - example

  13. Approx centroid-edge DCM3 decomposition – example • Locate the centroid edge e (O(n) time) • Set the closest leaves around e to be the separator (O(n) time) • Remaining leaves in subtrees around e form the subsets (unioned with the separator)

  14. DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

  15. DCM3 decomposition on 500 rbcL genes (Zilla dataset) • DCM3 decomposition • Blue: separator (and subset) • Red: subset 2 • Pink: subset 3 • Yellow: subset 4 • Vizualization produced by graphviz • program---draws graph according to • specified distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is small • Subsets are small • Compact subsets

  16. 0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs TNT DCM2 DCM3 Rec-DCM3 • Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. • DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. • DCM3 followed by TNT-ratchet doesn’t improve over TNT • Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT

  17. Local optimum Cost Global optimum Phylogenetic trees Local optima is a problem

  18. Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

  19. Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

  20. Iterated local search: Recursive-Iterative-DCM3 Local search Local optimum Recursive-DCM3 Local search Output of Recursive-DCM3

  21. TNT DCM2 DCM3 Rec-DCM3 Rec-I-DCM3 0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs for solving MP Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet

  22. I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

  23. I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

  24. I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

  25. I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

  26. I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

  27. Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?

  28. Improving upon TNT • But what happens after 24 hours? • We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. What about the default TNT search? • We select some real and large datasets. (Previously we showed that TNT reaches best known scores on small datasets) • We run 5 trials of TNT for two weeks and 5 of Rec-I-DCM3(TNT) for one week on each dataset

  29. 2000 Eukaryotes rRNA

  30. 6722 3-domain+2-org rRNA

  31. 13921 Proteobacteria rRNA

  32. How to run Rec-I-DCM3 then? • Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them? • Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. • Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes

  33. Maximum likelihood

  34. Maximum likelihood • Four problems • Given tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given tree and edge lengths find likelihood of tree: polynomial time dynamic programming

  35. Second case Ron Shamir’s lectures

  36. Second case Exponential time summation! Ron Shamir’s lectures

  37. Second case Exponential time summation! Can be solved in polytime using dynamic programmming ---similar to computing MP scores Ron Shamir’s lectures

  38. Second case-DP

  39. Second case-DP

  40. Second case-DP Complexity?

  41. Second case-DP Complexity? For each node and each site we do k^2 work, so total is mnk^2

  42. Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity

  43. Third case • Assign arbitrary values to all edge lengths except one t_rv • Now optimize function of one parameter using EM or Newton Raphson • Repeat for other edges • Stop when improvement in likelihood is less than delta

  44. Maximum likelihood • Four problems • Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time • Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming • Given data and tree, find likelihood: unknown complexity • Given data find tree with best likelihood: unknown complexity

  45. ML is a very hard problem • Number of potential trees grows exponentially This is  the number of atoms in the universe 10^80

  46. Local search • Greedy-ML tree followed by local search using TBR moves • Software packages: PAUP*, PHYLIP, PhyML, RAxML • We now look at RAxML in detail • Major RAxML innovations • Good starting trees • Subtree rearrangements • Lazy rescoring

  47. TBR

  48. TBR

  49. TBR

  50. TBR

More Related