1 / 51

NP-hardness and Phylogeny Reconstruction

NP-hardness and Phylogeny Reconstruction. Tandy Warnow Department of Computer Sciences University of Texas at Austin. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human. Gorilla. Chimpanzee. -3 mil yrs. AAGACTT. AAGACTT. -2 mil yrs. AAG G C C T.

gala
Télécharger la présentation

NP-hardness and Phylogeny Reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin

  2. Phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee

  3. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  4. Evolution informs about everything in biology • Big genome sequencing projects just produce data -- so what? • Evolutionary history relates all organisms and genes, and helps us understand and predict • interactions between genes (genetic networks) • drug design • predicting functions of genes • influenza vaccine development • origins and spread of disease • origins and migrations of humans

  5. Molecular Systematics U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

  6. Major methods for phylogeny reconstruction • Biology: Polynomial time methods (good enough for small datasets), and local search heuristics for NP-hard optimization problems • Linguistics: an exact algorithm for an NP-hard optimization problem

  7. Outline for the rest of the talk • NP-hard and polynomial time problems • Phylogeny reconstruction in biology: the NP-hard maximum parsimony problem, and how we can solve it better • Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly • An open problem from whole genome phylogeny • Thoughts about computational biology, and the role of mathematics in this field

  8. Polynomial-time problems • Shortest path: Given edge-weighted graph G = (V,E) and two vertices, v and w, find shortest path from v to w (O(n2) time) • 2-colorability: Given graph G = (V,E), determine if we can assign two colors to the vertices of G so that no edge connects vertices of the same color (O(n+m) time) • 3-clique: Given graph G = (V,E), determine if G contains a 3-clique (O(n3) time) For all these, n=|V| and m=|E|.

  9. NP-hard problems Some problems seem “hard” to solve: • Hamilton path: Given graph G , determine if G has a simple path going through every vertex • 3-colorability: Given graph G, determine if G can be properly 3-colored • Max-clique: Given graph G, find a largest clique in the graph

  10. Technical definition of NP-hard • NP is the class of decision problems for which “yes” instances can be “proven” in polynomial time. (Example: I can prove to you that a graph has a 3-coloring by presenting that 3-coloring to you. So 3-coloring is in NP.) • Definition: A problem X is NP-hard if every problem in NP can be reduced to X in polynomial time (yes-instances mapped to yes-instances, and no-instances mapped to no-instances). So 2-coloring can be reduced to 3-coloring • Definition: A problem X is in P if it is in NP and can be solved in polynomial time.

  11. P vs. NP, continued • The “big” question in theoretical computer science is: • Is it possible to solve an NP-hard problem in polynomial time? • If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

  12. Coping with NP-hard problems Since NP-hard problems may not be solvable in polynomial time, the options are: • Solve the problem exactly (but use lots of time on some inputs) • Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

  13. Example: Maximum Clique • Exact solution: find largest k so that some subset of size k is a clique. Runs in O(nk) time. • Heuristic: Pick a vertex at random, and greedily assemble a set which is a clique, and stop when you can’t add any more vertices. Repeat until tired (or bored, or running out of time, or …). How do we evaluate the running time, or accuracy?

  14. General comments for NP-hard optimization problems • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. • You may not know when you have an optimal solution, if you use a heuristic. • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. (But this may not be true for biology.)

  15. Major methods for phylogeny reconstruction • Biology: Polynomial time methods (good enough for small datasets), and local search heuristics for NP-hard optimization problems • Linguistics: an exact algorithm for an NP-hard optimization problem

  16. Polynomial time methods • Quartet-based methods: • Construct trees on all 4-leaf subsets • Combine quartet trees into tree on full dataset • Distance-based methods: • Estimate pairwise distance matrix dij • Find tree T and edge-weights w(e) so that dTij approximates dij • For both methods, if there are no errors (in quartet trees or pairwise distances) then the correct tree can be obtained in polynomial time. Otherwise, optimization problems are NP-hard. • Polytime heuristics along these lines are popular.

  17. Phylogeny reconstruction • In biology, the most popular approaches for reconstructing phylogenetic trees are heuristics for Maximum Parsimony (NP-hard) or Maximum Likelihood (conjectured to be NP-hard) • In historical linguistics, a new approach based upon exactly solving the NP-hard Perfect Phylogeny problem has been useful.

  18. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  19. Maximum Parsimony • Given a set S of strings of the same length over a fixed alphabet, find a tree T leaf-labelled by S and with all internal nodes labelled by strings of the same length over the same alphabet which minimizes the sum of the edge lengths. • Motivation: seeks to minimize the total number of point mutations needed to explain the data • NP-hard

  20. Maximum Parsimony ACT ACT ACA GTA GTT GTT ACA GTA GTA ACA ACT GTT

  21. Maximum Parsimony ACT ACT ACA GTA GTT GTA ACA ACT 2 1 1 3 3 2 GTT GTT ACA GTA MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

  22. Optimal labeling can be computed in linear time O(nk) GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Finding the optimal MP tree is NP-hard Maximum Parsimony: computational complexity

  23. Solving MP (maximum parsimony) and ML (maximum likelihood) • Why are MP and ML hard? The search space is huge -- there are (2n-5)!! trees, it is easy to get stuck in local optima, and there can be many optimal trees. • Why try to solve MP or ML? Our experimental studies show that polynomial time algorithms don’t do as well as MP or ML when trees are big and have high rates of evolution. • Why solve MP and ML well? Because trees can change in biologically significant ways with small changes in objective criterion. Local optimum MP score Global optimum Phylogenetic trees

  24. MP/ML heuristics Fake study Performance of hill-climbing heuristic MP score of best trees Time

  25. Speeding up MP/ML heuristics Fake study Performance of hill-climbing heuristic MP score of best trees Desired Performance Time

  26. Iterative-DCM3 vs Ratchet

  27. Iterative-DCM3 vs Ratchet

  28. Comments • Developing heuristics with good performance takes mathematical insights, but may not involve proofs. Even so, it’s really important. • Extracting information from the set of optimal (and near-optimal) solutions is a major open problem. • Other types of data (gene orders, morphology) present novel challenges. • Reticulate evolution detection and reconstruction is a major open problem.

  29. Ringe-Warnow Phylogenetic Tree of Indo-European

  30. Phylogenies of Languages • Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear) • The result can be modelled as a rooted tree • The interesting thing is that many characteristics of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

  31. Historical Linguistic Data • A character is a function that maps a set of languages, L, to a set of states. • Three kinds of characters: • Phonological (sound changes) • Lexical (meanings based on a wordlist) • Morphological (grammatical features)

  32. “Homoplasy-Free” Evolution (perfect phylogenies) YES NO

  33. The Perfect Phylogeny Problem • Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S. • The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

  34. Triangulated Graphs • A graph is triangulated if it has no simple cycles of size four or more.

  35. Triangulating Colored Graphs:An Example A graph that can be c-triangulated

  36. Triangulating Colored Graphs:An Example A graph that can be c-triangulated

  37. Triangulating Colored Graphs:An Example A graph that cannot be c-triangulated

  38. Triangulating Colored Graphs (TCG) Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.

  39. The PP and TCG Problems • Buneman’s Theorem:A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated. • The PP and TCG problems are polynomially equivalent and NP-hard.

  40. Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

  41. Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

  42. Some special cases are easy • Binary character perfect phylogeny solvable in linear time • r-state characters solvable in polynomial time for each r (combinatorial algorithm) • Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph) • k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory)

  43. The Indo-European (IE) Dataset • 24 languages • 22 phonological characters, 15 morphological characters, and 333 lexical characters • Total number of working characters is 390 (multiple character coding, and parallel development) • A phylogenetic tree T on the IE dataset (Ringe, Taylor and Warnow) • T is compatible with all but 22 characters: 16 (18) monomorphic and 6 polymorphic • Resolves most of the significant controversies in Indo-European evolution; shows however that Germanic is a problem (not treelike)

  44. Phylogenetic Tree of the IE Dataset

  45. An open problem to take home… computing the “transposition” distance between two genomes (important in whole genome phylogeny reconstruction)

  46. Genomes As Signed Permutations 1 –5 3 4 -2 -6or6 2 -4 –3 5 –1 etc.

  47. 1 2 3 –8 –7 –6 –5 -4 9 10 1 2 3 9 -8 –7 –6 –5 –4 10 1 2 3 9 4 5 6 7 8 10 Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 • Inversion (Reversal) • Transposition • Inverted Transposition

  48. An open problem to play with • Given two permutations on 1,2,…n, compute the minimum “transposition” distance (unknown computational complexity) • (The corresponding problem for inversion distances involves very beautiful graph theory and algorithms.)

  49. Summary • NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions • Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

  50. Acknowledgements • NSF and the David and Lucile Packard Foundation (funding) • Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics) • Students: Usman Roshan and Luay Nakhleh

More Related