1 / 44

Phylogeny II : Parsimony, ML, SEMPHY

Phylogeny II : Parsimony, ML, SEMPHY. branch. internal node. leaf. Phylogenetic Tree. Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2. Character Based Methods. We start with a multiple alignments Assumptions: All sequences are homologous

kyrie
Télécharger la présentation

Phylogeny II : Parsimony, ML, SEMPHY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogeny II : Parsimony, ML, SEMPHY .

  2. branch internal node leaf Phylogenetic Tree • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2

  3. Character Based Methods • We start with a multiple alignments • Assumptions: • All sequences are homologous • Each position in alignment is homologous • Positions evolve independently • No gaps • Seek to explain the evolution of each position in the alignment

  4. Parsimony • Character-based method Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place

  5. Simple Example • Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position • Minimal tree has one evolutionary change: C T C T C C C T T  C

  6. Aardvark Bison Chimp Dog Elephant Another Example • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA

  7. Evaluating Parsimony Scores • How do we compute the Parsimony score for a given tree? • Traditional Parsimony • Each base change has a cost of 1 • Weighted Parsimony • Each change is weighted by the score c(a,b)

  8. a g a Traditional Parsimony a {a} • Solved independently for each position • Linear time solution a {a,g}

  9. Evaluating Weighted Parsimony Dynamic programming on the tree Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: • if k is node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b)) Termination: • cost of tree is minaS(r,a) where r is the root

  10. Cost of Evaluating Parsimony • Score is evaluated on each position independetly. Scores are then summed over all positions. • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

  11. Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

  12. Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

  13. Maximum Parsimony How many substitutions? MP

  14. 0 0 0 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  15. 0 3 0 3 0 3 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  16. G T 3 C A C G C 3 T A C G T 3 A C C Maximum Parsimony 4 1 - G 2 - C 3 - T 4 - A

  17. 0 3 2 0 3 2 0 3 2 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  18. 0 3 2 2 0 3 2 2 0 3 2 1 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  19. G A 2 A G A G A 2 A G A A G 1 A G A Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G

  20. 0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 Maximum Parsimony

  21. Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14

  22. Searching for Trees

  23. Searching for the Optimal Tree • Exhaustive Search • Very intensive • Branch and Bound • A compromise • Heuristic • Fast • Usually starts with NJ

  24. branch internal node leaf Phylogenetic Tree Assumptions • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2 • Lengths t = {ti} for each branch • Phylogenetic tree = (Topology, Lengths) = (T,t)

  25. Probabilistic Methods • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. • Background probabilities: q(a) • Mutation probabilities: P(a|b, t) • Models for evolutionary mutations • Jukes Cantor • Kimura 2-parameter model • Such models are used to derive the probabilities

  26. Jukes Cantor model • A model for mutation rates • Mutation occurs at a constant rate • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

  27. Kimura 2-parameter model • Allows a different rate for transitions and transversions.

  28. Mutation Probabilities • The rate matrix R is used to derive the mutation probability matrix S: • S is obtained by integration. For Jukes Cantor: • q can be obtained by setting t to infinity

  29. A C G T Mutation Probabilities • Both models satisfy the following properties: • Lack of memory: • Reversibility: • Exist stationary probabilities {Pa} s.t.

  30. Probabilistic Approach • Given P,q, the tree topology and branch lengths, we can compute: x5 t4 x4 t2 t3 t1 x1 x2 x3

  31. Computing the Tree Likelihood • We are interested in the probability of observed data given tree and branch “lengths”: • Computed by summing over internal nodes • This can be done efficiently using a tree upward traversal pass.

  32. Tree Likelihood Computation • Define P(Lk|a)= prob. of leaves below node k given that xk=a • Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise • Iteration: if k is node with children i and j, then • Termination:Likelihood is

  33. Maximum Likelihood (ML) • Score each tree by • Assumption of independent positions • Branch lengths t can be optimized • Gradient ascent • EM • We look for the highest scoring tree • Exhaustive • Sampling methods (Metropolis)

  34. T3 T4 T2 Tn T1 Optimal Tree Search • Perform search over possible topologies Parameter space Parametric optimization (EM) Local Maxima

  35. Computational Problem • Such procedures are computationally expensive! • Computation of optimal parameters, per candidate, requires non-trivial optimization step. • Spend non-negligible computation on a candidate, even if it is a low scoring one. • In practice, such learning procedures can only consider small sets of candidate structures

  36. Structural EM Idea:Use parameters found for current topology to help evaluate new topologies. Outline: • Perform search in (T, t) space. • Use EM-like iterations: • E-step: use current solution to compute expected sufficient statistics for all topologies • M-step: select new topology based on these expected sufficient statistics

  37. Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j Define: Find: topology T that maximizes F is a linear function of Si,j The Complete-Data Scenario Suppose we observe H, the ancestral sequences.

  38. Expected Likelihood • Start with a tree (T0,t0) • Compute Formal justification: • Define: Theorem: Consequence: improvement in expected score improvement in likelihood

  39. Weights: Original Tree (T0,t0) Compute: Algorithm Outline Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N2M)

  40. Weights: Find: Compute: Algorithm Outline Pairwise weights This stage also computes the branch length for each pair (i,j)

  41. Weights: Find: Construct bifurcation T1 Compute: Algorithm Outline Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’)  Q(T0,t0) Thus, l(T’,t’)  l(T0,t0)

  42. Weights: Find: Construct bifurcation T1 Compute: Algorithm Outline Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T1,t’) =l(T’,t’)  l(T0,t0)

  43. Assessing trees: the Bootstrap • Often we don’t trust the tree found as the “correct” one. • Bootstrapping: • Sample (with replacement) n positions from the alignment • Learn the best tree for each sample • Look for tree features which are frequent in all trees. • For some models this procedure approximates the tree posterior P(T| X1,…,Xn)

  44. Weights: Find: Compute: Algorithm Outline Construct bifurcation T1 New Tree Thm: l(T1,t1)  l(T0,t0) These steps are then repeated until convergence

More Related