1 / 54

Solving Phylogenetic Trees

Solving Phylogenetic Trees. Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO. Table of Contents. Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human.

bryce
Télécharger la présentation

Solving Phylogenetic Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO Benjamin Loyle 2004 Cse 397

  2. Table of Contents • Problem & Term Definitions • A DCM*-NJ Solution • Performance Measurements • Possible Improvements Benjamin Loyle 2004 Cse 397

  3. Phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee Benjamin Loyle 2004 Cse 397

  4. -3 mil yrs AAGACTT -2 mil yrs AAGGCCT AAGGCCT TGGACTT TGGACTT -1 mil yrs AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution Benjamin Loyle 2004 Cse 397

  5. Problem Definition • The Tree of Life • Connecting all living organisms • All encompassing • Find evolution from simple beginnings • Even smaller relations are tough • Impossible • Infer possible ancestral history. Benjamin Loyle 2004 Cse 397

  6. So what…. • Genome sequencing provides entire map of a species, why link them? • We can understand evolution • Viable drug testing and design • Predict the function of genes • Influenza evolution Benjamin Loyle 2004 Cse 397

  7. Why is that a problem? • Over 8 million organisms • Current solutions are NP-hard • Computing a few hundred species takes years • Error is a very large factor Benjamin Loyle 2004 Cse 397

  8. What do we want? • Input • A collection of nodes such as taxa or protein strings to compare in a tree • Output • A topological link to compare those nodes to each other • When do we want it? • FAST! Benjamin Loyle 2004 Cse 397

  9. Preparing the input • Create a distance matrix • Sum up all of the known distances into a matrix sized n x n • N is the number of nodes or taxa • Found with sequence comparison Benjamin Loyle 2004 Cse 397

  10. Distance Matrix Take 5 separate DNA strings A : GATCCATGA B : GATCTATGC C : GTCCCATTT D : AATCCGATC E : TCTCGATAG The distance between A and B is 2 The distance between A and C is 4 This is subjective based on what your criteria are. Benjamin Loyle 2004 Cse 397

  11. Distance Matrix • Lets start with an example matrix A B C D E A B C D E Benjamin Loyle 2004 Cse 397

  12. Lets make it simple (constrain the input) • Lets keep the distance between nodes within a certain limit • From F -> G • F and G have the largest distance; they are the most dissimilar of any nodes. • This is called the diameter of the tree • Lets keep the length of the input (length of the strings) polynomial. Benjamin Loyle 2004 Cse 397

  13. ERROR?!?!!? • All trees are inferred, how do you ever know if you’re right? • How accurate do we have to be? • We can create data sets to test trees that we create and assume that it will then work in the real world Benjamin Loyle 2004 Cse 397

  14. Data Sets • JC Model • Sites evolve independent • Sites change with the same probability • Changes are single character changes • Ie. A -> G or T -> C • The expectation of change is a Poisson variable (e) Benjamin Loyle 2004 Cse 397

  15. More Data Sets • K2P Model • Based on JC Model • Allows for probability of transitions to tranversions • It’s more likely for A and T to switch and G and C to switch • Normally set to twice as likely Benjamin Loyle 2004 Cse 397

  16. Data Use • Using these data sets we can create our own evolution of data. • Start with one “ancestor” and create evolutions • Plug the evolutions back and see if you get what you started with Benjamin Loyle 2004 Cse 397

  17. Aspects of Trees • Topology • The method in which nodes are connected to each other • “Are we really connected to apes directly, or just linked long before we could be considered mammals?” • Distance • The sum of the weighted edges to reach one node from another Benjamin Loyle 2004 Cse 397

  18. What can distance tell us? • The distance between nodes IS the evolutionary distance between the nodes • The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred. Benjamin Loyle 2004 Cse 397

  19. Current Techniques • Maximum Parsimony • Minimize the total number of evolutionary events • Find the tree that has a minimum amount of changes from ancestors • Maximum Likelihood • Probability based • Which tree is most probable to occur based on current data Benjamin Loyle 2004 Cse 397

  20. More Techniques • Neighbor Joining • Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization • It shrinks the distance matrix by considering two ‘neighbors’ as one node Benjamin Loyle 2004 Cse 397

  21. Learning Neighbor Joining • It will become apparent later on, but lets learn how to do Neighbor Joining (NJ) A B C D E A B C D E Benjamin Loyle 2004 Cse 397

  22. NJ Part 1 • First start with a “star tree” E A D B C Benjamin Loyle 2004 Cse 397

  23. NJ Part 2 • Combine the closest two nodes (from distance matrix) • In our case it is node A and B at distance 3 E A D B C Benjamin Loyle 2004 Cse 397

  24. NJ Part 3 • Repeat this until you have added n-2 nodes (3) • N-2 will make it a binary tree, so we only have to include one more node. E A D B C Benjamin Loyle 2004 Cse 397

  25. Are we done? • ML and MP, even in heuristic form take too long for large data sets • NJ has poor topological accuracy, especially for large diameter trees • We need something that works for large diameter trees and can be run fast. Benjamin Loyle 2004 Cse 397

  26. Here’s what we want • Our Goal • An “Absolute Fast Converging” Method •  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €. • Simply: Lets make it in polynomial time within a degree of error. Benjamin Loyle 2004 Cse 397

  27. A DCM* - NJ Solution • 2 Phase construction of a final phylogenetic tree given a distance matrix d. • Phase 1 : Create a set of plausible trees for the distance matrix • Phase 2 : Find the best fitting tree Benjamin Loyle 2004 Cse 397

  28. Phase 1 • For each q in {dij}, compute a tree tq • Let T = { tq : q in {dij} } Benjamin Loyle 2004 Cse 397

  29. Finding tq • Step 1: Compute Thresh(d,q) • Step 2: Triangulate Thresh(d,q) • Step 3: Compute a NJ Tree for all maximal cliques • Step 4: Merge the subtrees into a supertree Benjamin Loyle 2004 Cse 397

  30. What does that mean • Breaking the problem up • Create a threshold of diameters to break the problem into • A bunch of smaller diameter trees (cliques) • Apply NJ to those cliques • Merge them back Benjamin Loyle 2004 Cse 397

  31. Finding tq (terms) • Threshold Graph • Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if dij <= q. Benjamin Loyle 2004 Cse 397

  32. Threshold • Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E • So q = 67 Benjamin Loyle 2004 Cse 397

  33. Distance Matrix • Our old example matrix A B C D E A B C D E Benjamin Loyle 2004 Cse 397

  34. With q = D15 = 67 C 47 A 67 D 63 B E 16 Benjamin Loyle 2004 Cse 397

  35. Triangulating • A graph is triangulated if any cycle with four or more vertices has a chord • That is, an edge joining two nonconsecutive vertices of the cycle. • Our example is already triangulated, but lets look at another Benjamin Loyle 2004 Cse 397

  36. 5 W X 5 5 Y Z 5 Triangulating Lets say this is for q = 5 10 and 15 would Not be in the graph 10 To triangulate this graph you add the edge length 10. 15 Benjamin Loyle 2004 Cse 397

  37. Maximal Cliques • A clique that cannot be enlarged by the addition of another vertex. • Recall our original threshold graph which is triangulated: Benjamin Loyle 2004 Cse 397

  38. Triangulated Threshold Graph • Our old Graph C 47 A 67 D 63 B E 16 Benjamin Loyle 2004 Cse 397

  39. Clique Our maximal cliques would be: {A, B, E} {C, D} Benjamin Loyle 2004 Cse 397

  40. Create Trees for the Cliques • We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} • How do we make these trees? • Remember NJ? Benjamin Loyle 2004 Cse 397

  41. Tree {A, B, E} and {C,D} A E B C D Benjamin Loyle 2004 Cse 397

  42. Merge your separate trees together. • Create one Supertree • This is done by creating a minimum set of edges in the trees and calling that the “backbone” • This is it’s own doctorial thesis, so lets do a little hand waving Benjamin Loyle 2004 Cse 397

  43. That sounds like NP-hard! • Computing Threshold is Polynomial • Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance. • Maximal cliques is only polynomial if the data input is triangulated (which it is!). • If all previous are done, creating a supertree can be done in polynomial time as well. Benjamin Loyle 2004 Cse 397

  44. Where are we now? • We now have a finalized phylogeny created for from smaller trees in our matrix joined together • Remember we started from all possible size of smaller trees. Benjamin Loyle 2004 Cse 397

  45. Phase 2 • Which one is right? • Found using the SQS (Short Quartet Support) method • Let T be a tree in S (made from part 1) • Break the data into sets of four taxa • {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc • Reduce the larger tree to only hold “one set” • These are called Quartets Benjamin Loyle 2004 Cse 397

  46. SQS - A Guide • Q(T) is the set of trees induced by T on each set of four leaves. • Let Qw (different Q) be a set of quartets with diameter less than or equal to w • Find the maximum w where the quartets are inclusive of the nodes of the tree • This w is the “support” of that tree Benjamin Loyle 2004 Cse 397

  47. SQS - Refrased • Qw is the set of quartet trees which have a diameter <= w • Support of T is the max w where Qw is a subset of Q(T) • Support is our “quality measure” • What are we exactly measuring?, Benjamin Loyle 2004 Cse 397

  48. Qw = A B D D E C A B A B C D E A B C D E Benjamin Loyle 2004 Cse 397

  49. SQS Method • Return the tree in which the support of that tree is the maximum. • If more than one such tree exists return the tree found first. • This is the tree with the smallest original diameter (remember from phase 1) Benjamin Loyle 2004 Cse 397

  50. How do we know we’re right? • Compare it to the data set we created • Look at Robinson-Foulds accuracy • Remove one edge in the tree we’ve created. • We now have two trees • Is there anyway to create the same set of leaves by removing one edge in our data set? • If no, add a ‘point’ of error. • Repeat this for all edges • When the value is not zero then the trees are not identical Benjamin Loyle 2004 Cse 397

More Related