1 / 65

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination. Dan Gusfield UC Davis. Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yufeng Wu. Reconstructing the Evolution of Binary Bio-Sequences. Perfect Phylogeny (tree) model

jamessexton
Télécharger la présentation

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yufeng Wu.

  2. Reconstructing the Evolution of Binary Bio-Sequences • Perfect Phylogeny (tree) model • Phylogenetic Networks (DAG) with recombination • Phylogenetic Networks with disjoint cycles: Galled-Trees • Phylogenetic Networks with unconstrained cycles: Blobbed-Trees • Combinatorial Structure and Efficient Algorithms • Efficiently Computed Lower Bounds on the number of recombinations needed

  3. Geneological or Phylogenetic Networks • The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations. • Also relates to phylogenetic-based haplotyping. • Some of the algorithmic and mathematical results also have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.

  4. The Perfect Phylogeny Model for binary sequences sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves

  5. When can a set of sequences be derived on a perfect phylogenywith the all-0 root? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

  6. A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 3 00010 2 10100 5 pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. 10000 01010 01011 Real sequence histories often involve recombination.

  7. Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix).

  8. Network with Recombination 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101

  9. Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”

  10. Elements of a Phylogenetic Network (single crossover recombination) • Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. • A choice of ancestral sequence at the root. • Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. • A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. • The network derives the sequences that label the leaves.

  11. A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

  12. Which Phylogenetic Networks are meaningful? Given M we want a phylogenetic network that derives M, but which one? A: A perfect phylogeny (tree) if possible. As little deviation from a tree, if a tree is not possible. Use as little recombination or gene-conversion as possible.

  13. Minimizing recombinations • Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. • However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. • Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem). Find a network generating M that has some biologically-motivated structural properties.

  14. Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004) Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time.

  15. Recombination Cycles • In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. • The cycle specified by those two paths is called a ``recombination cycle”.

  16. Galled-Trees A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle. A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.

  17. A galled-tree generating the sequences generated by the prior network. 4 3 1 s p a: 00010 3 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

  18. Sales pitch for Galled-Trees Galled-trees represent a small deviation from true trees. There are sufficient applications where it is plausible that a galled tree exists that generates the sequences. Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable recombinations may be a galled-tree. The number of recombinations is never more than m/2. Moreover, when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination. A galled-tree for M is ``almost unique” - implications for reconstructing the correct history.

  19. Old (Aug. 2003) Results • O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence. • Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution. • Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.

  20. New work We derive the galled-tree results in a more general setting that addresses unconstrained recombination cycles and multiple crossover recombination. This also solves the problem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral sequence is known in advance.

  21. Blobbed-trees: generalizing galled-trees • In a phylogenetic network a maximal set of intersecting cycles is called a blob. • Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. • So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

  22. Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship? Ugly tangled network inside the blob.

  23. Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

  24. 1 2 3 4 5 Incompatibility Graph a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

  25. Simple Fact If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j. (This is a general fact for all phylogenetic networks.) Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

  26. A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

  27. Simple Consequence of the simple fact All sites on the same (non-trivial) connected component of the incompatibility graph must be on the same blob in any blobbed-tree. Follows by transitivity. So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.

  28. Key Result about Galls: For galls, the converse of the simple consequence is also true. Two sites that are in different (non-trivial) connected components cannot be placed on the same gall in any phylogenetic network for M. Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the incompatibility graph. All compatible sites can be put on edges outside of the galls. This was the key to the galled-tree solution.

  29. Incompatibility Graph A galled-tree generating the sequences generated by the prior network. 4 4 3 1 3 2 5 1 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

  30. Reduced Galled-Trees • A galled-tree is called reducedif every gall only contains conflicted sites. • Theorem: If M can be derived on a galled-tree, it can be derived on a reduced galled-tree. • The number of recombination nodes in a reduced galled-tree equals the number of connected components of the conflict graph, which is the minimum number of recombinations possible in any galled-tree.

  31. The main new result: The Partition Theorem For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of the incompatibility graph. The compatible sites can always be put on edges outside of any blob. Moreover, the tree part of T(M) is unique. And it is easy to find the tree part. This is weaker than the result for galled-trees: it replaces “must” with “can”.

  32. My proof is direct and self-contained, but the result may also be derivable from well-established results about splits, for example, from facts known about Bunemann graphs. Thanks to Mike Steel for pointing this out. Can also be derived from earlier work on splits by Bandelt and Dress.

  33. Algorithmically • Finding the tree part of the blobbed-tree is easy. • Determining the sequences labeling the exterior nodes on any blob is easy. • Determining a “good” structure inside a blob B is the problem of generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be generated with only a single (possibly multiple-crossover) recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

  34. The main open question • Is there always a blobbed-tree where each blob has all and only the sites of a single connected component of the conflict graph, which minimizes the number of recombinations over all possible phylogenetic networks for M? • Proof attempt by splitting blob B into B’ and B”. The catch is the possibility of a recombination node in B that is needed in both B’ and B”.

  35. If Yes, Then any method that computes a lower bound on the number of needed recombinations should be applied separately on sequences defined by the sites on each connected component, and the results added together. This may be true even if the desired result is not. It is worth trying to prove this.

  36. How to arrange the sites on a blob Given a single connected component of the conflict graph with k sites, how do we arrange those k sites to generate the required sequences, using only one (multiple-crossover) recombination, or using a multiple-crossover recombination with the fewest cross-overs?

  37. Arranging the sites We will describe an O(n^3) time method to arrange all of the blobs. O(n^2) time is possible with a more complex method when only single-crossover recombinations are allowed.

  38. A needed fact in words Let Q be a gall for the sites on connected-component C of the conflict graph. Let M[C] be the matrix M restricted to the sites on C. Let LQ[C] be the sequences labeling the nodes of Q, restricted to the sites on C. Claim: The two sets of sequences are identical, i.e., M[C] = LQ[C].

  39. The idea for arranging the sites of C on B: via a short movie

  40. 4 Q 3 001 010 1 s 101 p a 2 110 b c, e, f, g d

  41. 4 Q 3 001 010 1 101 a b c, e, f, g 110 d

  42. 4 B 3 001 010 1 101 a c, e, f, g: 010 b: 101 110 d Blob B minus the recombination node is a perfect phylogeny for M[C] minus the recombinant sequence; all sites are on one or two paths from the root; and the two end sequences of those paths can recombine at point r to recreate the recombinant sequence.

  43. The point If we remove the recombinant node from B, we have a phylogenetic tree (no cycles) for the remaining sequences and hence a perfect phylogenetic tree for the sequences in M[C] minus the recombinant sequence of B. The sites in this tree are on one or two paths. Moreover, the two end sequences on that perfect phylogeny can recombine to create the removed recombinant sequence.

  44. The algorithm for arranging a blob B for C 1.Form the matrix M[C]. 2. For each row of M[C], remove the row, see if there is a perfect phylogeny for the remaining rows. If yes, see if the sites are in one or two paths, and the end sequences can generate the removed row by a recombination. Fact: Every row that works gives a permitted arrangement of the sites on B.

  45. Optimality of Galled-Trees Theorem: (G,H,B,B) The minimum number of recombination nodes in any phylogenetic network for M is at least the number of non-trivial connected components of the incompatibility graph. Hence, when the sequences on each blob on T(M) can be generated with a single recombination node, the blobbed-tree minimizes the number of recombination nodes over all phylogenetic networks and all choices of ancestral sequence. This solves the root-unknown galled-tree problem in polynomial time. Code is on the web.

  46. The number of arrangements on agall (all-0 ancestral sequence) Following the algorithmic approach above, one can prove that the number of arrangements of any gall is at most three, and this happens only if the gall has two sites. If the gall has more than two sites, then the number of arrangements is at most two. If the gall has four or more sites, with at least two sites on each side of the recombination point (not the side of the gall) then the arrangement is forced and unique.

  47. Practical and Accurate Lower Bound Computation • Unconstrained recombination. • Want efficient computation of lower bounds on the minimum number of recombinations needed. Many methods: HK, Haplotype, (history), Connected Comps, Composite Interval Composite, Gall-Interval Composite Method, Subsets, Recmin, Composite ILP. Some are specific for recombination, and some apply to any reticulation event.

  48. The grandfather of all lower bounds - HK 1985 • Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. • The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.

  49. HK Lower Bound 1 2 3 4 5

More Related