Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

Sparse Normalized Local AlignmentNadav EfratyGad M. Landau

Background - Global similarity LCS- Longest Common Subsequence

Background - LCS milestones • (1977) Hirschberg - Algorithms for the longest common subsequence problem. • (1977) Hunt, Szymanski - A fast algorithm for computing longest common subsequence. • (1987) Apostolico, Guerra – The longest common subsequence problem revisited. • (1992) Eppstein, Galil, Giancarlo, Italiano - Sparse dynamic programming I: linear cost functions.

Background - Global Vs. Local • Global alignment algorithms compute the similarity grade of the entire input strings, by computing the best path from the first to the last entry of the table. • Local alignment algorithms report the most similar substring pair according to their scoring scheme.

Background – Smith Waterman algorithm(1981) T(i,0)=T(0,j)=0 , for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n) T(i,j)=max{0,T(i-1,j-1)+ S(Yi,Xj) , T(i-1,j)+ D(Yi) , T(i,j-1)+ I(Xj)} D(Yi) = I(Xj) = -0.4, S(Yi,Xj) = { 1 if Yi = Xj -0.3 if Yi ≠ Xj

40 -30 31 10 41 40 40/100 70/10000 Background – The Smith Waterman algorithm The weaknesses of the Smith Waterman algorithm (according to Arslan, Eğecioğluand Pevzner): • Maximal score vs. maximal degree of similarity. What would reflect higher similarity level? 71(score)/10,000(symbols) or 70/200 • Mosaic effect - Lack of ability to discard poorly conserved intermediate segments. • Shadow effect - Short alignments may not be detected because they are overlapped by longer alignments. • The sparsity of the essential data is not exploited. This cannot be fixed by post processing

Background – Normalized local alignment • The statistical significance of the local alignment depends on both its score and length. • Thus, the solution for these weaknesses is: Normalization • Instead of maximizing S(X’,Y’), maximize S(X’,Y’)/(|X’|+|Y’|). • Under that scoring scheme, one match is always an optimal alignment. Thus, a minimal length or a minimal score constraint is needed.

Background – Normalized sequence alignment • The algorithm of Arslan, Eğecioğlu and Pevzner (2001) converge to the optimal normalized alignment value through iterations of the Smith Waterman algorithm. • They solve the problem SCORE(X’,Y’)/(|X’|+|Y’|+L), where L is a constant that controls the amount of normalization. • The ratio between L and |X’|+|Y’| determines the influence of L on the value of the alignment. • The time complexity of their algorithm is O(n2logn).

Our approach • Maximize LCS(X’,Y’)/(|X’|+|Y’|). • It can be viewed as measure of the density of the matches. • A minimal length or score constraint, M, must be enforced, and we chose the score constraint (the value of LCS(X’,Y’)) • The value of M is problem related.

The naïve O(rLloglogn) normalized local LCS algorithm

X 0 n 0 Y X 0 J J’ n 0 (i,j) i m Y (i’,j’) i’ m Definitions • A chain is a sequence of matches that is strictly increasing in both components. • The length of a chain from match (i,j) to match (i’,j’) is i’-i+j’-j. • A k-chain(i,j) is the shortest chain of k matches starting from (i,j). • The normalized value of k-chain(i,j) is k divided by its length.

The naïve algorithm • For each match (i,j), construct k-chain(i,j) for 1≤k≤L (L=LCS(X,Y)). • Computing the best chains starting from each match guarantees that the optimal chain will not be missed. • Examine all the k-chains with k≥M of all matches and report either: • The k-chains with the highest normalized value. • k-chains whose normalized value exceeds a predefined threshold.

Problem: k-chain(i,j)is not necessarily the prefix of (k+1)-chain(i,j). a b c a d e c f h c g b f h e c g g g f d e f

Solution: construct (k+1)-chain(i,j) by concatenating (i,j) to k-chain(i’,j’) . a b c a d e c f h c g b f h e c g g g f d e f

0 J n 0 0 n i 0 m m Question: How to find the proper k-chain(i’,j’)? • If there is only one candidate ((i,j) is in the range of a single match (i’,j’)),it is clear. • What If there are two candidates ((i,j) is in the mutual range of two matches)?

Lemma: A mutual range of two matches is owned completely by one of them. X n 0 0 Y m

We use the lemma in order to maintain L data structures. In the k data structure: • All the matches are the heads of k-chains. • Each match owns the range to its left. Computing (k+1)-chain of a match is done by concatenating it to the owner of the range it is in. Row 0 Row i

The algorithm • Preprocessing: • create the list of matches of each row. • Process the matches row by row, from bottom up. For the matches of row i: • Stage 1: Construct k-chains 1≤k≤L. • Stage 2: Update the data structures with the matches of row i and their k-chains. They will be used for the computation of next rows. • Examine all k-chains of all matches and report the ones with the highest normalized value.

Complexity analysis • Preprocessing- O(nlogΣY). • Stage 1- • For each of the r matches we construct at most L k-chains, with total complexity of O(rLloglogn), when Johnson Trees are used by our data structures. • Stage 2- • Each of the r matches is inserted and extracted at most once to each of the data structures, and the total complexity is again O(rLloglogn).

Complexity analysis • Checking all k-chains of all matches and reporting the best alignments consumes O(rL) time. • Total time complexity of this algorithm is O(nlogΣY + rLloglogn). • Space complexity is O(rL+nL). • r matches with (at most) L records each. • The space of L Johnson Trees of size n.

The O(rMloglogn) normalized local LCS algorithm

The O(rMloglogn) normalized local LCS algorithm The algorithm reports the best possible local alignment (value and substrings). • This section is divided to: • Computing the highest normalized value. • Constructing the longest optimal alignment.

Computing the highest normalized value Definition: A sub-chain of a k-Chain is a path that contains a sequence of x ≤ k consecutive matches of the k-Chain. It does not have to start or end at a match. a b c a d e c f h c g b f h e c g

Computing the highest normalized value Claim: When a k-chain is split into a number of non overlapping consecutive sub-chains, the value of the k-chain is at most equal to the value of the best sub-chain. 103 + 2+ 3 +2 = 40 14+ 5+ 12 +9 10 5 + 2 + 3 + 1 = 40 20 + 8 + 12 + 4

Computing the highest normalized value Result: • Any k-chain with k≥M may be split to non overlapping consecutive sub-chains of M matches, followed by a last sub-chain of up to 2M-1 matches. • The normalized value of the best sub-chain will be at least equal to that of the k-chain. Assume M = 3. 10 3 + 3 +4 10-chain = = 40 12 + 14 +14

1 2 3 4 5 1 2 3 4 5 4/10 Vs. 3/8 Computing the highest normalized value • A sub-chains of less than M matches may not be reported. • Sub-chains of 2M matches or more, can be split into shorter sub-chains of M to 2M-1 matches. Question: Is it sufficient to construct all the sub- chains of exactly M matches?

Computing the highest normalized value • The algorithm: For each match construct all the k-chains, for k≤2M-1. • The algorithm constructs all these chains, that are, in fact, the sub-chains of all the longer k-chains. • A longer chain cannot be better than its best sub-chain. • This algorithm reports the highest normalized value of a sub-chain which is equal to the highest normalized value of a chain.

Constructing the longest optimal alignment Definition: A perfect alignment is an alignment of two identical strings. Its normalized value is ½. Lemma: unless the optimal alignment is perfect, the longest optimal alignment has no more than 2M-1 matches.

0/3 0/2 10/30 10/30 10/35 = < Constructing the longest optimal alignment Proof: Assume there is a chain with more than 2M-1 matches whose normalized value is the optimal, denoted by LO. • LOmaybe split to a number of sub-chains of M matches, followed by a single sub-chain of between M and 2M-1 matches. • The normalized value of each such sub-chain must be equal to that of LO, otherwise, LO is not optimal. • Each such sub-chain must start and end at a match, otherwise, the normalized value of the chain comprised of the same matches will be higher than that of LO.

M/S 2M/2S M/S Constructing the longest optimal alignment • The tails and heads of the sub-chains from which LO is comprised must be next to each other. It’s number of matches is M+1 and its length is S+2. • Since < , < . Thus, we found a chain of M+1 matches whose normalized value is higher than that of LO, in contradiction to the optimality of LO. M 1 M M + 1 S 2 S S + 2

Closing remarks

The advantages of the new algorithm • Ideal for textual local comparison as well as for screening bio sequences. • Normalized and thus, does not suffer from the shadow and mosaic effects. • A straight forward approach to the minimal constraint.

The advantages of the new algorithm • the minimal constraint is problem related rather than input related. • If we refer to it as a constant, the complexity of the algorithm is O(rloglogn). • Since for textual comparison we can expect r<<n2, the complexity may be even better than that of the non normalized local similarity algorithms.

The advantages of the new algorithm • The O(rMloglogn) algorithm computes the optimal normalized alignments. • The advantage of the O(rLloglogn) algorithm is that it can report all the long alignment that exceed a predefined value and not only the short optimal alignments.

Questions

The end

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau