1 / 33

Chaining Algorithms Simplified

This article provides a simplified explanation of chaining algorithms for genome sequence alignment. It covers the basics of genome organization, the need for sequence alignment, and introduces the anchor-based strategy for computing optimal global chains. The article also discusses various types of gap costs and presents both graph-based and sparse dynamic programming solutions for the global chaining problem.

amysliwiec
Télécharger la présentation

Chaining Algorithms Simplified

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chaining Algorithms Simplified Mohamed Ibrahim Abouelhoda University of Ulm and Cairo University 2007

  2. The Genome The Genome • The total DNA content in a cell  a string over an alphabet of 4 characters {A, C, G, T} • Encodes the necessary information for the existence and reproduction of an organism ..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG.. Gene Gene

  3. In-silico identification of regions of similarity and difference among two or multiple genomes Similar (conserved) regions Similar (conserved) function necessary for the organism Different regions Traits unique to one organism Objectives: • Basic science: Understanding how genomes function, organize, replicate, and evolve • Industry and Healthcare: Increasing organism productivity and finding drugs Computational Comparative Genomics What about? Genome of Organism B Genome of Organism A

  4. T_ACAATCAA TCAC_ _TCAC Sequence Alignment Traditional solutions S1 S1 TCACAA TCACAA TACAATCAA S2 TCACTCAC S2 CAAATCA CAAATCA Local Sequence Alignment Global Sequence Alignment Sequence Alignment is not suitable for comparing genomic sequences • Dynamic programming algorithms take time (k=number of genomes, N=average genome length)

  5. The fragments can be computed using an index data structure Abouelhoda. Kurtz, Ohlebusch, 2004 The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain • A fragment: Different characters Genome 1 GACCGCGCA CACCGCGCT Genome 2 Exact Fragments (e.g., maximal exact matches)

  6. Fragment Representation • Box-Line Representation • Geometric Representation: Each fragment is represented by a hyper-rectangle in kD space, each axis corresponds to one sequence S2 T C A C T C A C S1 T A C A A T C A A S2 T C A C T C A C T A C A A T C A A S1 Box-line Representation Geometric Representation

  7. The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain First Genome G1 Second Genome G2

  8. ..TCACAATCAA.. .. TCATA_TCAA.. The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain First Genome G1 Second Genome G2 anchors

  9. The Global Chaining Problem The Local Chaining Problem Chaining Algorithms

  10. First Genome G1 Second Genome G2 Third Genome G3 The Global Chaining Problem Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi+1, fi) is the gap cost of connectingfi+1 to fi • The weight of a fragment is for example its length

  11. The Global Chaining Problem First Genome G1 Second Genome G2 Third Genome G3 fi fi+1 Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi+1, fi) is the gap cost of connectingfi+1 to fi • The weight of a fragment is for example its length

  12. Notions • A fragment fiis represented as a hyper-rectangle in a k-dimensional space. • A fragment fiis identified with its start and end points: start(fi) and end( fi). • We add two imaginary fragments O and t with weight zero. • Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k

  13. ACCYYYACC ACC YYY _ _ACC ACC_ XX ACC ACC_ _ _ XXACC f A C C YYY A C C ACCXXACC Types of Gap Costs • The gap costs g can be described geometrically: L1 L∞ ACC_XXACC ACC XX _ _ _ _ _ ACC ACCYYY ACC ACC_ _ YYY_ _ ACC ACC_ ZZ ACC ACC _ _ _ _ _ ZZACC

  14. A Graph-based Solution • The score of a chain C is score(C)= ∑i [ fi .weight - g(fi,fi-1)] • An optimal chain is a chain of maximum score • A highest-scoring path in the graph is an optimal chain • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} • A graph based solution takes O(n2) time.

  15. Sparse Dynamic Programming • Chaining algorithms are sparse dynamic programming D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992 • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} where fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= k g( fi , fj ) is the gap cost of connecting fi to fj j T C G C C C C G T T A C G T C C G C A T i

  16. Sparse Dynamic Programming • Chaining algorithms are sparse dynamic programming D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992 • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} where fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= k g( fi , fj ) is the gap cost of connecting fi to fj j Y Y Y Y Y Y Y Y Y Y • The string characters are not given, only positions • In extreme cases, you can enumerate all matches and consider others as gaps  sparse dynamic programming (chaining) is used to compute alignment directly  selecting gap cost function is critical X X X X X X X X X X i

  17. A Geometric-based Solution • The max function in the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} can be replaced by range maximum query (RMQ) fj.score=fj.weight+RMQ{O, start(fj)} • RMQ (Range Maximum Query) Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum. • If the gap cost is zero, a RMQ returns the end point of the fragment fisuch that is maximum. If all the fragments have the same weight (length) and no gap cost  we are solving the LCS problem

  18. The Algorithm without gap cost • Line-sweep algorithm 1. Sort the start and end points of the fragments w.r.t. x1 2. If a start point of a fragment, say fj, is scanned apply the RMQ(O, (start(fj).x1, start(fj).x2, …,start(fj).xk)) to the set of active end points and update the score of the end point of fragment fj. 3. Otherwise, add the end point to the set of active end points (already scanned end points). • Becaue of the sorting step, the dimension of the RMQ can be reduced to k-1  we can useRMQ(O, (start(fj).x2, …, start(fj).xk)) • For comparing two sequences, the RMQ dimension is 1  we can use priority queues to find an optimal fragment inO(log log m) • But the complexity is dominated by the sorting, unless the fragments are computed in order. • Priority queue is complicated to implement

  19. The Complexity of the Algorithm • The algorithm complexity depends on the data structure supporting RMQ Semi-dynamic data structure Dynamic data structure • Constructed point by point- Points are explicitly inserted, deleted- Less space, because some covered fragments can be deleted- Very difficult to implement- Works for on-line chaining • Constructed for all point at once- Points are not inserted/deleted, rather activated/inactivated- More space, all fragments remain in memory- Easier to implement- Works for off-line chaining

  20. For k=2, the total complexity is O(n log n) time and O(n) space The Complexity of the Algorithm RMQ using semi-dynamic range tree Willard, 1985 • supported by fractional cascading. • enhanced with priority queues. • D is implemented as a range tree Johnson, 1982 van Emde Boas, 1977 • For n fragments and dimension d, the RMQ and activation takes: O(n log d-1n loglog n) time and O(n log d-1 n) space • Since d= k-1>1, the complexity of the algorithm is O(n log k-2 n log log n) time and O(n log k-2 n) space

  21. time andO(n) space time andO(n) space • For k=2, the total complexity is O(n log n) time and O(n) space The Complexity of the Algorithm RMQ using semi-dynamic kd-tree • For n fragments and dimension d>1, the RMQ and activation takes: Lee-Wong 1977 • Since d= k-1>1, the complexity of the algorithm is The running time can be speeded-up in practice using some programming tricks Bently, 1990

  22. kd-trees

  23. kd-trees vs. Range Trees • d stands for dimension • C stands for construction • Q stands for query and activation time • For 4 strains E. coli, the range tree did not fit in memory; estimated space consumption is 7.1 Gb

  24. f A C C XXX A C C ACCYYACC Including Gap Costs Recall the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic. fj.score=fj.weight+RMQ{O, start(fj)}

  25. Including Gap Costs in L1 • We define the geometric cost of a fragment f as follows: gc( f) = d1(t, end(f)) f where d1(t, end(f) is the distance in the L1 metric between t and end(f). f 2 f 1.score - g( f 1, f) > f 2.score - g( f 2, f) iff f 1.score - gc( f 1) > f 2.score - gc( f 2) f 1 • gc( f) is a constant that can be precomputed and attached to the fragment’s weight • We activate fragment with f .score - gc( f ) instead of f.score The inclusion of gap cost can be done with no extra cost  the same complexity as the algorithm with no gap cost

  26. First Genome G1 Second Genome G2 Third Genome G3 The Local Chaining Problem Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments has score: score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi, fi-1) is the gap cost of connectingfi to fi-1 • The weight of a fragment is for example its length or its statistical significance • A local chain C is called optimal if its score is maximum over all other chains.

  27. First Genome G1 Second Genome G2 Third Genome G3 The Local Chaining Problem Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments has score: score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi, fi-1) is the gap cost of connectingfi to fi-1 • The weight of a fragment is for example its length or its statistical significance • A local chain C is called optimal if its score is maximum over all other chains.

  28. Geometric Solution fj The recurrence fj.score=fj.weight+max{0, fi.score-g( fi , fj ): fi<<fj} can be written as fj.score=fj.weight+RMQ{O, start(fj)} • But we have to check if fj.score=fj.weight+f’.score >= 0, f’=RMQ{O, start(fj)} then Connect f’ to fj else Start a new chain, starting with fj

  29. Comparing two bacterial genomes C. trachmoatish Red points: Forward fragments Green points: Reverse fragments C. pneumonia The two genomes: 1- C. trachomatis(1.2 Mbp) 2- C. pneumoniae(1.2Mbp) • Fragments of the type maximal exact matches of minimum length 12 • Total number of fragments 288,899

  30. Comparing two bacterial genomes Chains C. trachmoatis Termini of Replication C. pneumonia The two genomes: 1- C. trachomatis(1.2 Mbp) 2- C. pneumoniae(1.2Mbp) • Fragments of the type maximal multiple exact matches of minimum length 12 • Total number of fragments 288,899 • CoCoNUT is fast: it takes minutes to compute fragments and local chains; a task that took hours by previous methods

  31. Conclusions • Chaining Algorithms are efficient for comparative genomics • More variations needed for real applications in biology, i.e., limiting range search, considering overlaps • CoCoNUT is a system for comparative genomics containing various variations of the chaining algorithms • Global and local chaining are analogous to global and local sequence alignment • kd-tree is superior to range tree in practice

  32. More on Chaining Algorithms

  33. Thanks for attention

More Related