540 likes | 819 Vues
Solving Phylogenetic Trees. Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO. Table of Contents. Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human.
E N D
Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO Benjamin Loyle 2004 Cse 397
Table of Contents • Problem & Term Definitions • A DCM*-NJ Solution • Performance Measurements • Possible Improvements Benjamin Loyle 2004 Cse 397
Phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee Benjamin Loyle 2004 Cse 397
-3 mil yrs AAGACTT -2 mil yrs AAGGCCT AAGGCCT TGGACTT TGGACTT -1 mil yrs AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution Benjamin Loyle 2004 Cse 397
Problem Definition • The Tree of Life • Connecting all living organisms • All encompassing • Find evolution from simple beginnings • Even smaller relations are tough • Impossible • Infer possible ancestral history. Benjamin Loyle 2004 Cse 397
So what…. • Genome sequencing provides entire map of a species, why link them? • We can understand evolution • Viable drug testing and design • Predict the function of genes • Influenza evolution Benjamin Loyle 2004 Cse 397
Why is that a problem? • Over 8 million organisms • Current solutions are NP-hard • Computing a few hundred species takes years • Error is a very large factor Benjamin Loyle 2004 Cse 397
What do we want? • Input • A collection of nodes such as taxa or protein strings to compare in a tree • Output • A topological link to compare those nodes to each other • When do we want it? • FAST! Benjamin Loyle 2004 Cse 397
Preparing the input • Create a distance matrix • Sum up all of the known distances into a matrix sized n x n • N is the number of nodes or taxa • Found with sequence comparison Benjamin Loyle 2004 Cse 397
Distance Matrix Take 5 separate DNA strings A : GATCCATGA B : GATCTATGC C : GTCCCATTT D : AATCCGATC E : TCTCGATAG The distance between A and B is 2 The distance between A and C is 4 This is subjective based on what your criteria are. Benjamin Loyle 2004 Cse 397
Distance Matrix • Lets start with an example matrix A B C D E A B C D E Benjamin Loyle 2004 Cse 397
Lets make it simple (constrain the input) • Lets keep the distance between nodes within a certain limit • From F -> G • F and G have the largest distance; they are the most dissimilar of any nodes. • This is called the diameter of the tree • Lets keep the length of the input (length of the strings) polynomial. Benjamin Loyle 2004 Cse 397
ERROR?!?!!? • All trees are inferred, how do you ever know if you’re right? • How accurate do we have to be? • We can create data sets to test trees that we create and assume that it will then work in the real world Benjamin Loyle 2004 Cse 397
Data Sets • JC Model • Sites evolve independent • Sites change with the same probability • Changes are single character changes • Ie. A -> G or T -> C • The expectation of change is a Poisson variable (e) Benjamin Loyle 2004 Cse 397
More Data Sets • K2P Model • Based on JC Model • Allows for probability of transitions to tranversions • It’s more likely for A and T to switch and G and C to switch • Normally set to twice as likely Benjamin Loyle 2004 Cse 397
Data Use • Using these data sets we can create our own evolution of data. • Start with one “ancestor” and create evolutions • Plug the evolutions back and see if you get what you started with Benjamin Loyle 2004 Cse 397
Aspects of Trees • Topology • The method in which nodes are connected to each other • “Are we really connected to apes directly, or just linked long before we could be considered mammals?” • Distance • The sum of the weighted edges to reach one node from another Benjamin Loyle 2004 Cse 397
What can distance tell us? • The distance between nodes IS the evolutionary distance between the nodes • The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred. Benjamin Loyle 2004 Cse 397
Current Techniques • Maximum Parsimony • Minimize the total number of evolutionary events • Find the tree that has a minimum amount of changes from ancestors • Maximum Likelihood • Probability based • Which tree is most probable to occur based on current data Benjamin Loyle 2004 Cse 397
More Techniques • Neighbor Joining • Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization • It shrinks the distance matrix by considering two ‘neighbors’ as one node Benjamin Loyle 2004 Cse 397
Learning Neighbor Joining • It will become apparent later on, but lets learn how to do Neighbor Joining (NJ) A B C D E A B C D E Benjamin Loyle 2004 Cse 397
NJ Part 1 • First start with a “star tree” E A D B C Benjamin Loyle 2004 Cse 397
NJ Part 2 • Combine the closest two nodes (from distance matrix) • In our case it is node A and B at distance 3 E A D B C Benjamin Loyle 2004 Cse 397
NJ Part 3 • Repeat this until you have added n-2 nodes (3) • N-2 will make it a binary tree, so we only have to include one more node. E A D B C Benjamin Loyle 2004 Cse 397
Are we done? • ML and MP, even in heuristic form take too long for large data sets • NJ has poor topological accuracy, especially for large diameter trees • We need something that works for large diameter trees and can be run fast. Benjamin Loyle 2004 Cse 397
Here’s what we want • Our Goal • An “Absolute Fast Converging” Method • is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €. • Simply: Lets make it in polynomial time within a degree of error. Benjamin Loyle 2004 Cse 397
A DCM* - NJ Solution • 2 Phase construction of a final phylogenetic tree given a distance matrix d. • Phase 1 : Create a set of plausible trees for the distance matrix • Phase 2 : Find the best fitting tree Benjamin Loyle 2004 Cse 397
Phase 1 • For each q in {dij}, compute a tree tq • Let T = { tq : q in {dij} } Benjamin Loyle 2004 Cse 397
Finding tq • Step 1: Compute Thresh(d,q) • Step 2: Triangulate Thresh(d,q) • Step 3: Compute a NJ Tree for all maximal cliques • Step 4: Merge the subtrees into a supertree Benjamin Loyle 2004 Cse 397
What does that mean • Breaking the problem up • Create a threshold of diameters to break the problem into • A bunch of smaller diameter trees (cliques) • Apply NJ to those cliques • Merge them back Benjamin Loyle 2004 Cse 397
Finding tq (terms) • Threshold Graph • Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if dij <= q. Benjamin Loyle 2004 Cse 397
Threshold • Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E • So q = 67 Benjamin Loyle 2004 Cse 397
Distance Matrix • Our old example matrix A B C D E A B C D E Benjamin Loyle 2004 Cse 397
With q = D15 = 67 C 47 A 67 D 63 B E 16 Benjamin Loyle 2004 Cse 397
Triangulating • A graph is triangulated if any cycle with four or more vertices has a chord • That is, an edge joining two nonconsecutive vertices of the cycle. • Our example is already triangulated, but lets look at another Benjamin Loyle 2004 Cse 397
5 W X 5 5 Y Z 5 Triangulating Lets say this is for q = 5 10 and 15 would Not be in the graph 10 To triangulate this graph you add the edge length 10. 15 Benjamin Loyle 2004 Cse 397
Maximal Cliques • A clique that cannot be enlarged by the addition of another vertex. • Recall our original threshold graph which is triangulated: Benjamin Loyle 2004 Cse 397
Triangulated Threshold Graph • Our old Graph C 47 A 67 D 63 B E 16 Benjamin Loyle 2004 Cse 397
Clique Our maximal cliques would be: {A, B, E} {C, D} Benjamin Loyle 2004 Cse 397
Create Trees for the Cliques • We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} • How do we make these trees? • Remember NJ? Benjamin Loyle 2004 Cse 397
Tree {A, B, E} and {C,D} A E B C D Benjamin Loyle 2004 Cse 397
Merge your separate trees together. • Create one Supertree • This is done by creating a minimum set of edges in the trees and calling that the “backbone” • This is it’s own doctorial thesis, so lets do a little hand waving Benjamin Loyle 2004 Cse 397
That sounds like NP-hard! • Computing Threshold is Polynomial • Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance. • Maximal cliques is only polynomial if the data input is triangulated (which it is!). • If all previous are done, creating a supertree can be done in polynomial time as well. Benjamin Loyle 2004 Cse 397
Where are we now? • We now have a finalized phylogeny created for from smaller trees in our matrix joined together • Remember we started from all possible size of smaller trees. Benjamin Loyle 2004 Cse 397
Phase 2 • Which one is right? • Found using the SQS (Short Quartet Support) method • Let T be a tree in S (made from part 1) • Break the data into sets of four taxa • {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc • Reduce the larger tree to only hold “one set” • These are called Quartets Benjamin Loyle 2004 Cse 397
SQS - A Guide • Q(T) is the set of trees induced by T on each set of four leaves. • Let Qw (different Q) be a set of quartets with diameter less than or equal to w • Find the maximum w where the quartets are inclusive of the nodes of the tree • This w is the “support” of that tree Benjamin Loyle 2004 Cse 397
SQS - Refrased • Qw is the set of quartet trees which have a diameter <= w • Support of T is the max w where Qw is a subset of Q(T) • Support is our “quality measure” • What are we exactly measuring?, Benjamin Loyle 2004 Cse 397
Qw = A B D D E C A B A B C D E A B C D E Benjamin Loyle 2004 Cse 397
SQS Method • Return the tree in which the support of that tree is the maximum. • If more than one such tree exists return the tree found first. • This is the tree with the smallest original diameter (remember from phase 1) Benjamin Loyle 2004 Cse 397
How do we know we’re right? • Compare it to the data set we created • Look at Robinson-Foulds accuracy • Remove one edge in the tree we’ve created. • We now have two trees • Is there anyway to create the same set of leaves by removing one edge in our data set? • If no, add a ‘point’ of error. • Repeat this for all edges • When the value is not zero then the trees are not identical Benjamin Loyle 2004 Cse 397