1 / 36

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau. Background - Global similarity. LCS- Longest Common Subsequence. Background - LCS milestones. (1977) Hirschberg - Algorithms for the longest common subsequence problem.

sabina
Télécharger la présentation

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Normalized Local AlignmentNadav EfratyGad M. Landau

  2. Background - Global similarity LCS- Longest Common Subsequence

  3. Background - LCS milestones • (1977) Hirschberg - Algorithms for the longest common subsequence problem. • (1977) Hunt, Szymanski - A fast algorithm for computing longest common subsequence. • (1987) Apostolico, Guerra – The longest common subsequence problem revisited. • (1992) Eppstein, Galil, Giancarlo, Italiano - Sparse dynamic programming I: linear cost functions.

  4. Background - Global Vs. Local • Global alignment algorithms compute the similarity grade of the entire input strings, by computing the best path from the first to the last entry of the table. • Local alignment algorithms report the most similar substring pair according to their scoring scheme.

  5. Background – Smith Waterman algorithm(1981) T(i,0)=T(0,j)=0 , for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n) T(i,j)=max{0,T(i-1,j-1)+ S(Yi,Xj) , T(i-1,j)+ D(Yi) , T(i,j-1)+ I(Xj)} D(Yi) = I(Xj) = -0.4, S(Yi,Xj) = { 1 if Yi = Xj -0.3 if Yi ≠ Xj

  6. 40 -30 31 10 41 40 40/100 70/10000 Background – The Smith Waterman algorithm The weaknesses of the Smith Waterman algorithm (according to Arslan, Eğecioğluand Pevzner): • Maximal score vs. maximal degree of similarity. What would reflect higher similarity level? 71(score)/10,000(symbols) or 70/200 • Mosaic effect - Lack of ability to discard poorly conserved intermediate segments. • Shadow effect - Short alignments may not be detected because they are overlapped by longer alignments. • The sparsity of the essential data is not exploited. This cannot be fixed by post processing

  7. Background – Normalized local alignment • The statistical significance of the local alignment depends on both its score and length. • Thus, the solution for these weaknesses is: Normalization • Instead of maximizing S(X’,Y’), maximize S(X’,Y’)/(|X’|+|Y’|). • Under that scoring scheme, one match is always an optimal alignment. Thus, a minimal length or a minimal score constraint is needed.

  8. Background – Normalized sequence alignment • The algorithm of Arslan, Eğecioğlu and Pevzner (2001) converge to the optimal normalized alignment value through iterations of the Smith Waterman algorithm. • They solve the problem SCORE(X’,Y’)/(|X’|+|Y’|+L), where L is a constant that controls the amount of normalization. • The ratio between L and |X’|+|Y’| determines the influence of L on the value of the alignment. • The time complexity of their algorithm is O(n2logn).

  9. Our approach • Maximize LCS(X’,Y’)/(|X’|+|Y’|). • It can be viewed as measure of the density of the matches. • A minimal length or score constraint, M, must be enforced, and we chose the score constraint (the value of LCS(X’,Y’)) • The value of M is problem related.

  10. The naïve O(rLloglogn) normalized local LCS algorithm

  11. X 0 n 0 Y X 0 J J’ n 0 (i,j) i m Y (i’,j’) i’ m Definitions • A chain is a sequence of matches that is strictly increasing in both components. • The length of a chain from match (i,j) to match (i’,j’) is i’-i+j’-j. • A k-chain(i,j) is the shortest chain of k matches starting from (i,j). • The normalized value of k-chain(i,j) is k divided by its length.

  12. The naïve algorithm • For each match (i,j), construct k-chain(i,j) for 1≤k≤L (L=LCS(X,Y)). • Computing the best chains starting from each match guarantees that the optimal chain will not be missed. • Examine all the k-chains with k≥M of all matches and report either: • The k-chains with the highest normalized value. • k-chains whose normalized value exceeds a predefined threshold.

  13. Problem: k-chain(i,j)is not necessarily the prefix of (k+1)-chain(i,j). a b c a d e c f h c g b f h e c g g g f d e f

  14. Solution: construct (k+1)-chain(i,j) by concatenating (i,j) to k-chain(i’,j’) . a b c a d e c f h c g b f h e c g g g f d e f

  15. 0 J n 0 0 n i 0 m m Question: How to find the proper k-chain(i’,j’)? • If there is only one candidate ((i,j) is in the range of a single match (i’,j’)),it is clear. • What If there are two candidates ((i,j) is in the mutual range of two matches)?

  16. Lemma: A mutual range of two matches is owned completely by one of them. X n 0 0 Y m

  17. We use the lemma in order to maintain L data structures. In the k data structure: • All the matches are the heads of k-chains. • Each match owns the range to its left. Computing (k+1)-chain of a match is done by concatenating it to the owner of the range it is in. Row 0 Row i

  18. The algorithm • Preprocessing: • create the list of matches of each row. • Process the matches row by row, from bottom up. For the matches of row i: • Stage 1: Construct k-chains 1≤k≤L. • Stage 2: Update the data structures with the matches of row i and their k-chains. They will be used for the computation of next rows. • Examine all k-chains of all matches and report the ones with the highest normalized value.

  19. Complexity analysis • Preprocessing- O(nlogΣY). • Stage 1- • For each of the r matches we construct at most L k-chains, with total complexity of O(rLloglogn), when Johnson Trees are used by our data structures. • Stage 2- • Each of the r matches is inserted and extracted at most once to each of the data structures, and the total complexity is again O(rLloglogn).

  20. Complexity analysis • Checking all k-chains of all matches and reporting the best alignments consumes O(rL) time. • Total time complexity of this algorithm is O(nlogΣY + rLloglogn). • Space complexity is O(rL+nL). • r matches with (at most) L records each. • The space of L Johnson Trees of size n.

  21. The O(rMloglogn) normalized local LCS algorithm

  22. The O(rMloglogn) normalized local LCS algorithm The algorithm reports the best possible local alignment (value and substrings). • This section is divided to: • Computing the highest normalized value. • Constructing the longest optimal alignment.

  23. Computing the highest normalized value Definition: A sub-chain of a k-Chain is a path that contains a sequence of x ≤ k consecutive matches of the k-Chain. It does not have to start or end at a match. a b c a d e c f h c g b f h e c g

  24. Computing the highest normalized value Claim: When a k-chain is split into a number of non overlapping consecutive sub-chains, the value of the k-chain is at most equal to the value of the best sub-chain. 103 + 2+ 3 +2 = 40 14+ 5+ 12 +9 10 5 + 2 + 3 + 1 = 40 20 + 8 + 12 + 4

  25. Computing the highest normalized value Result: • Any k-chain with k≥M may be split to non overlapping consecutive sub-chains of M matches, followed by a last sub-chain of up to 2M-1 matches. • The normalized value of the best sub-chain will be at least equal to that of the k-chain. Assume M = 3. 10 3 + 3 +4 10-chain = = 40 12 + 14 +14

  26. 1 2 3 4 5 1 2 3 4 5 4/10 Vs. 3/8 Computing the highest normalized value • A sub-chains of less than M matches may not be reported. • Sub-chains of 2M matches or more, can be split into shorter sub-chains of M to 2M-1 matches. Question: Is it sufficient to construct all the sub- chains of exactly M matches?

  27. Computing the highest normalized value • The algorithm: For each match construct all the k-chains, for k≤2M-1. • The algorithm constructs all these chains, that are, in fact, the sub-chains of all the longer k-chains. • A longer chain cannot be better than its best sub-chain. • This algorithm reports the highest normalized value of a sub-chain which is equal to the highest normalized value of a chain.

  28. Constructing the longest optimal alignment Definition: A perfect alignment is an alignment of two identical strings. Its normalized value is ½. Lemma: unless the optimal alignment is perfect, the longest optimal alignment has no more than 2M-1 matches.

  29. 0/3 0/2 10/30 10/30 10/35 = < Constructing the longest optimal alignment Proof: Assume there is a chain with more than 2M-1 matches whose normalized value is the optimal, denoted by LO. • LOmaybe split to a number of sub-chains of M matches, followed by a single sub-chain of between M and 2M-1 matches. • The normalized value of each such sub-chain must be equal to that of LO, otherwise, LO is not optimal. • Each such sub-chain must start and end at a match, otherwise, the normalized value of the chain comprised of the same matches will be higher than that of LO.

  30. M/S 2M/2S M/S Constructing the longest optimal alignment • The tails and heads of the sub-chains from which LO is comprised must be next to each other. It’s number of matches is M+1 and its length is S+2. • Since < , < . Thus, we found a chain of M+1 matches whose normalized value is higher than that of LO, in contradiction to the optimality of LO. M 1 M M + 1 S 2 S S + 2

  31. Closing remarks

  32. The advantages of the new algorithm • Ideal for textual local comparison as well as for screening bio sequences. • Normalized and thus, does not suffer from the shadow and mosaic effects. • A straight forward approach to the minimal constraint.

  33. The advantages of the new algorithm • the minimal constraint is problem related rather than input related. • If we refer to it as a constant, the complexity of the algorithm is O(rloglogn). • Since for textual comparison we can expect r<<n2, the complexity may be even better than that of the non normalized local similarity algorithms.

  34. The advantages of the new algorithm • The O(rMloglogn) algorithm computes the optimal normalized alignments. • The advantage of the O(rLloglogn) algorithm is that it can report all the long alignment that exceed a predefined value and not only the short optimal alignments.

  35. Questions

  36. The end

More Related