1 / 79

creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 3. Usman Roshan. Maximum Parsimony. Character based method NP-hard (reduction to the Steiner tree problem) Widely-used in phylogenetics Slower than NJ but more accurate Faster than ML Assumes i.i.d. Maximum Parsimony.

titus
Télécharger la présentation

creativecommons/licenses/by-sa/2.0/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://creativecommons.org/licenses/by-sa/2.0/

  2. CIS786, Lecture 3 Usman Roshan

  3. Maximum Parsimony • Character based method • NP-hard (reduction to the Steiner tree problem) • Widely-used in phylogenetics • Slower than NJ but more accurate • Faster than ML • Assumes i.i.d.

  4. Maximum Parsimony • Input: Set S of n aligned sequences of length k • Output: A phylogenetic tree T • leaf-labeled by sequences in S • additional sequences of length k labeling the internal nodes of T such that is minimized.

  5. Maximum parsimony (example) • Input: Four sequences • ACT • ACA • GTT • GTA • Question: which of the three trees has the best MP scores?

  6. Maximum Parsimony ACT ACT ACA GTA GTT GTT ACA GTA GTA ACA ACT GTT

  7. Maximum Parsimony ACT ACT ACA GTA GTT GTA ACA ACT 2 1 1 3 3 2 GTT GTT ACA GTA MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

  8. Optimal labeling can be computed in linear time O(nk) GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Finding the optimal MP tree is NP-hard Maximum Parsimony: computational complexity

  9. Local optimum Cost Global optimum Phylogenetic trees Local search strategies

  10. Local search for MP • Determine a candidate solution s • While s is not a local minimum • Find a neighbor s’ of s such that MP(s’)<MP(s) • If found set s=s’ • Else return s and exit • Time complexity: unknown---could take forever or end quickly depending on starting tree and local move • Need to specify how to construct starting tree and local move

  11. Starting tree for MP • Random phylogeny---O(n) time • Greedy-MP

  12. Greedy-MP Greedy-MP takes O(n^3k) time

  13. If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling Sort all 3n subtrees using bucket sort in O(n) Starting from small subtrees compute optimal labelings For each subtree rooted at v, the optimal labelings of children nodes is already computed Total time: O(nk) Faster Greedy MP3-way labeling

  14. If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling Sort all 3n subtrees using bucket sort in O(n) Starting from small subtrees compute optimal labelings For each subtree rooted at v, the optimal labelings of children nodes is already computed Total time: O(nk) Faster Greedy MP3-way labeling With optimal labeling it takes constant Time to compute MP score for each Edge and so total Greedy-MP time Is O(n^2k)

  15. For each edge we get two different topologies Neighborhood size is 2n-6 Local moves for MP: NNI

  16. Neighborhood size is quadratic in number of taxa Computing the minimum number of SPR moves between two rooted phylogenies is NP-hard Local moves for MP: SPR

  17. Local moves for MP: TBR • Neighborhood size is cubic in number of taxa • Computing the minimum number of TBR moves between two rooted phylogenies is NP-hard

  18. Tree Bisection and Reconnection (TBR)

  19. Tree Bisection and Reconnection (TBR) Delete an edge

  20. Tree Bisection and Reconnection (TBR)

  21. Tree Bisection and Reconnection (TBR) Reconnect the trees with a new edge that bifurcates an edge in each tree

  22. Local optima is a problem

  23. Iterated local search: escape local optima by perturbation Local search Local optimum

  24. Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Output of perturbation

  25. Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

  26. ILS for MP • Ratchet • Iterative-DCM3 • TNT

  27. Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

  28. Ratchet • Perturbation input: alignment and phylogeny • Sample with replacement p% of sites and reweigh them to w • Perform local search on modified dataset starting from the input phylogeny • Reset the alignment to original after completion and output the local minimum

  29. Ratchet: escaping local minimaby data perturbation Local search Local optimum Ratchet search Local search Output of ratchet

  30. Ratchet: escaping local minimaby data perturbation Local search Local optimum Ratchet search Local search Output of ratchet But how well does this perform? We have to examine this experimentally on real data

  31. Experimental methodology for MP on real data • Collect alignments of real datasets • Usually constructed using ClustalW • Followed by manual (eye) adjustments • Must be reliable to get sensible tree! • Run methods for a fixed time period • Compare MP scores as a function of time • Examine how scores improve over time • Rate of convergence of different methods (not sequence length but as a function of time)

  32. Experimental methodology for MP on real data • We use rRNA and DNA alignments • Obtained from researchers and public databases • We run iterative improvement and ratchet each for 24 hours beginning from a randomized greedy MP tree • Each method was run five times and average scores were plotted • We use PAUP*---very widely used software package for various types of phylogenetic analysis

  33. 500 aligned rbcL sequences (Zilla dataset)

  34. 854 aligned rbcL sequences

  35. 2000 aligned Eukaryotes

  36. 7180 aligned 3domain

  37. 13921 aligned Proteobacteria

  38. Comparison of MP heuristics • What about other techniques for escaping local minima? • TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms • Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace • Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found • Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree

  39. Comparison of MP heuristics • What about other techniques for escaping local minima? • TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms • Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace • Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found • Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree How does this compare to PAUP*-ratchet?

  40. Experimental methodology for MP on real data • We use rRNA and DNA alignments • Obtained from researchers and public databases • We run PAUP*-ratchet, TNT-default, and TNT-ratchet each for 24 hours beginning from randomized greedy MP trees • Each method was run five times on each dataset and average scores were plotted

  41. 500 aligned rbcL sequences (Zilla dataset)

  42. 854 aligned rbcL sequences

  43. 2000 aligned Eukaryotes

  44. 7180 aligned 3domain

  45. 13921 aligned Proteobacteria

  46. Can we do even better? Yes! But first let’s look at Disk-Covering Methods

  47. Disk Covering Methods (DCMs) • DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. • DCMs to date • DCM1: for improving statistical performance of distance-based methods. • DCM2: for improving heuristic search for MP and ML • DCM3: latest, fastest, and best (in accuracy and optimality) DCM

  48. 2. Compute subtrees using a base method 1. Decompose sequences into overlapping subproblems 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary DCM2 technique for speeding up MP searches

  49. DCM2 decomposition • DCM2 • Input: distance matrix d, threshold , sequences S • Algorithm: • 1a. Compute a threshold graph G using q and d • 1b. Perform a minimum weight triangulation of G • Find separator X in G which minimizes max where are the connected components of G – X • Output subproblems as .

  50. Threshold graph • Add edges until graph is connected • Perform minimum weight triangulation • NP-hard • Triangulated graph=perfect elimination ordering (PEO) • Max cliques can be determined in linear time • Use greedy triangulation heuristic: compute PEO by adding vertices which minimize largest edge added • Worst case is O(n^3) but fast in practice

More Related