Chaining Algorithms Simplified

Chaining Algorithms Simplified Mohamed Ibrahim Abouelhoda University of Ulm and Cairo University 2007

The Genome The Genome • The total DNA content in a cell  a string over an alphabet of 4 characters {A, C, G, T} • Encodes the necessary information for the existence and reproduction of an organism ..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG.. Gene Gene

In-silico identification of regions of similarity and difference among two or multiple genomes Similar (conserved) regions Similar (conserved) function necessary for the organism Different regions Traits unique to one organism Objectives: • Basic science: Understanding how genomes function, organize, replicate, and evolve • Industry and Healthcare: Increasing organism productivity and finding drugs Computational Comparative Genomics What about? Genome of Organism B Genome of Organism A

T_ACAATCAA TCAC_ _TCAC Sequence Alignment Traditional solutions S1 S1 TCACAA TCACAA TACAATCAA S2 TCACTCAC S2 CAAATCA CAAATCA Local Sequence Alignment Global Sequence Alignment Sequence Alignment is not suitable for comparing genomic sequences • Dynamic programming algorithms take time (k=number of genomes, N=average genome length)

The fragments can be computed using an index data structure Abouelhoda. Kurtz, Ohlebusch, 2004 The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain • A fragment: Different characters Genome 1 GACCGCGCA CACCGCGCT Genome 2 Exact Fragments (e.g., maximal exact matches)

Fragment Representation • Box-Line Representation • Geometric Representation: Each fragment is represented by a hyper-rectangle in kD space, each axis corresponds to one sequence S2 T C A C T C A C S1 T A C A A T C A A S2 T C A C T C A C T A C A A T C A A S1 Box-line Representation Geometric Representation

The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain First Genome G1 Second Genome G2

..TCACAATCAA.. .. TCATA_TCAA.. The Anchor-based Strategy • Composed of three phases: Computation of fragments (similar regions) among genomes Computation of an optimal global chain or chains of colinear non-overlapping fragments Detailed alignment of the regions between the fragments of the chain First Genome G1 Second Genome G2 anchors

The Global Chaining Problem The Local Chaining Problem Chaining Algorithms

First Genome G1 Second Genome G2 Third Genome G3 The Global Chaining Problem Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi+1, fi) is the gap cost of connectingfi+1 to fi • The weight of a fragment is for example its length

The Global Chaining Problem First Genome G1 Second Genome G2 Third Genome G3 fi fi+1 Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi+1, fi) is the gap cost of connectingfi+1 to fi • The weight of a fragment is for example its length

Notions • A fragment fiis represented as a hyper-rectangle in a k-dimensional space. • A fragment fiis identified with its start and end points: start(fi) and end( fi). • We add two imaginary fragments O and t with weight zero. • Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k

ACCYYYACC ACC YYY _ _ACC ACC_ XX ACC ACC_ _ _ XXACC f A C C YYY A C C ACCXXACC Types of Gap Costs • The gap costs g can be described geometrically: L1 L∞ ACC_XXACC ACC XX _ _ _ _ _ ACC ACCYYY ACC ACC_ _ YYY_ _ ACC ACC_ ZZ ACC ACC _ _ _ _ _ ZZACC

A Graph-based Solution • The score of a chain C is score(C)= ∑i [ fi .weight - g(fi,fi-1)] • An optimal chain is a chain of maximum score • A highest-scoring path in the graph is an optimal chain • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} • A graph based solution takes O(n2) time.

Sparse Dynamic Programming • Chaining algorithms are sparse dynamic programming D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992 • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} where fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= k g( fi , fj ) is the gap cost of connecting fi to fj j T C G C C C C G T T A C G T C C G C A T i

Sparse Dynamic Programming • Chaining algorithms are sparse dynamic programming D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992 • The maximum score can be computed by the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} where fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= k g( fi , fj ) is the gap cost of connecting fi to fj j Y Y Y Y Y Y Y Y Y Y • The string characters are not given, only positions • In extreme cases, you can enumerate all matches and consider others as gaps  sparse dynamic programming (chaining) is used to compute alignment directly  selecting gap cost function is critical X X X X X X X X X X i

A Geometric-based Solution • The max function in the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} can be replaced by range maximum query (RMQ) fj.score=fj.weight+RMQ{O, start(fj)} • RMQ (Range Maximum Query) Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum. • If the gap cost is zero, a RMQ returns the end point of the fragment fisuch that is maximum. If all the fragments have the same weight (length) and no gap cost  we are solving the LCS problem

The Algorithm without gap cost • Line-sweep algorithm 1. Sort the start and end points of the fragments w.r.t. x1 2. If a start point of a fragment, say fj, is scanned apply the RMQ(O, (start(fj).x1, start(fj).x2, …,start(fj).xk)) to the set of active end points and update the score of the end point of fragment fj. 3. Otherwise, add the end point to the set of active end points (already scanned end points). • Becaue of the sorting step, the dimension of the RMQ can be reduced to k-1  we can useRMQ(O, (start(fj).x2, …, start(fj).xk)) • For comparing two sequences, the RMQ dimension is 1  we can use priority queues to find an optimal fragment inO(log log m) • But the complexity is dominated by the sorting, unless the fragments are computed in order. • Priority queue is complicated to implement

The Complexity of the Algorithm • The algorithm complexity depends on the data structure supporting RMQ Semi-dynamic data structure Dynamic data structure • Constructed point by point- Points are explicitly inserted, deleted- Less space, because some covered fragments can be deleted- Very difficult to implement- Works for on-line chaining • Constructed for all point at once- Points are not inserted/deleted, rather activated/inactivated- More space, all fragments remain in memory- Easier to implement- Works for off-line chaining

For k=2, the total complexity is O(n log n) time and O(n) space The Complexity of the Algorithm RMQ using semi-dynamic range tree Willard, 1985 • supported by fractional cascading. • enhanced with priority queues. • D is implemented as a range tree Johnson, 1982 van Emde Boas, 1977 • For n fragments and dimension d, the RMQ and activation takes: O(n log d-1n loglog n) time and O(n log d-1 n) space • Since d= k-1>1, the complexity of the algorithm is O(n log k-2 n log log n) time and O(n log k-2 n) space

time andO(n) space time andO(n) space • For k=2, the total complexity is O(n log n) time and O(n) space The Complexity of the Algorithm RMQ using semi-dynamic kd-tree • For n fragments and dimension d>1, the RMQ and activation takes: Lee-Wong 1977 • Since d= k-1>1, the complexity of the algorithm is The running time can be speeded-up in practice using some programming tricks Bently, 1990

kd-trees

kd-trees vs. Range Trees • d stands for dimension • C stands for construction • Q stands for query and activation time • For 4 strains E. coli, the range tree did not fit in memory; estimated space consumption is 7.1 Gb

f A C C XXX A C C ACCYYACC Including Gap Costs Recall the recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj} The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic. fj.score=fj.weight+RMQ{O, start(fj)}

Including Gap Costs in L1 • We define the geometric cost of a fragment f as follows: gc( f) = d1(t, end(f)) f where d1(t, end(f) is the distance in the L1 metric between t and end(f). f 2 f 1.score - g( f 1, f) > f 2.score - g( f 2, f) iff f 1.score - gc( f 1) > f 2.score - gc( f 2) f 1 • gc( f) is a constant that can be precomputed and attached to the fragment’s weight • We activate fragment with f .score - gc( f ) instead of f.score The inclusion of gap cost can be done with no extra cost  the same complexity as the algorithm with no gap cost

First Genome G1 Second Genome G2 Third Genome G3 The Local Chaining Problem Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments has score: score(C)= ∑ifi .weight - ∑ig(fi,fi-1) where g(fi, fi-1) is the gap cost of connectingfi to fi-1 • The weight of a fragment is for example its length or its statistical significance • A local chain C is called optimal if its score is maximum over all other chains.

Geometric Solution fj The recurrence fj.score=fj.weight+max{0, fi.score-g( fi , fj ): fi<<fj} can be written as fj.score=fj.weight+RMQ{O, start(fj)} • But we have to check if fj.score=fj.weight+f’.score >= 0, f’=RMQ{O, start(fj)} then Connect f’ to fj else Start a new chain, starting with fj

Comparing two bacterial genomes C. trachmoatish Red points: Forward fragments Green points: Reverse fragments C. pneumonia The two genomes: 1- C. trachomatis(1.2 Mbp) 2- C. pneumoniae(1.2Mbp) • Fragments of the type maximal exact matches of minimum length 12 • Total number of fragments 288,899

Comparing two bacterial genomes Chains C. trachmoatis Termini of Replication C. pneumonia The two genomes: 1- C. trachomatis(1.2 Mbp) 2- C. pneumoniae(1.2Mbp) • Fragments of the type maximal multiple exact matches of minimum length 12 • Total number of fragments 288,899 • CoCoNUT is fast: it takes minutes to compute fragments and local chains; a task that took hours by previous methods

Conclusions • Chaining Algorithms are efficient for comparative genomics • More variations needed for real applications in biology, i.e., limiting range search, considering overlaps • CoCoNUT is a system for comparative genomics containing various variations of the chaining algorithms • Global and local chaining are analogous to global and local sequence alignment • kd-tree is superior to range tree in practice

More on Chaining Algorithms

Thanks for attention

Chaining Algorithms Simplified