1 / 43

A Sub-quadratic Sequence Alignment Algorithm

A Sub-quadratic Sequence Alignment Algorithm. a. a. c. g. a. c. g. a. 6. 7. 3. 4. 1. 5. 2. 8. 0. c. 1. t. 2. a. 3. c. 4. g. 5. a. 6. g. 7. a. 8. Global alignment. Alignment graph for S = aacgacga , T = ctacgaga. V( i,j ) = max {

yachi
Télécharger la présentation

A Sub-quadratic Sequence Alignment Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Sub-quadratic Sequence Alignment Algorithm

  2. a a c g a c g a 6 7 3 4 1 5 2 8 0 c 1 t 2 a 3 c 4 g 5 a 6 g 7 a 8 Global alignment Alignment graph for S = aacgacga, T = ctacgaga • V(i,j) = max { V(i-1,j-1) + (S[i], T[j]), V(i-1,j) + (S[i], -), V(i,j-1) + (-, T[j]) } Complexity: O(n2)

  3. Four Russian algorithm

  4. Unrestricted scoring function

  5. c a t 1 4 5 4 3 0 2 0 3 2 1 g g a g c g Main idea: Compress the sequences LZ-78: Divide the sequence into distinct words • T = ctacgaga • S = aacgacga Trie Trie The number of distinct words:

  6. Trie for S g a 0 c 1 3 g 2 a c g 4 a Triefor T g a t 0 g c 3 2 1 5 g a c a c a c g 4 a a a g Main idea 3 2 4 1 0 • Compute the alignment score in each block • Propagate the scores between the adjacent blocks 1 2 3 4 5

  7. Main idea • Compress the sequence into words • Pre-compute the score for each block • Do alignment between blocks • Note: • Replace normal characters by words • Operate on blocks

  8. LZ-78 Compress the sequence

  9. c a t 1 4 5 4 3 0 2 0 3 2 1 g g a g c g LZ-78 LZ-78: Divide the sequence into distinct words • T = ctacgaga • S = aacgacga Trie Trie The number of distinct words:

  10. LZ-78 • Theorem (Lempel and Ziv): • Constant alphabet sequence S • The maximal number of distinct phrases in S is O(n/log n). • Tighter upper bound: O(hn/log n) • h is the entropy factor – a real number, 0 < h  1 • Entropy is small sequence is repetitive

  11. Compute the alignment score in each block

  12. a c g a g a c a c a c g a a a g Compute the alignment score in each block 3 2 4 1 0 1 2 3 4 5

  13. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Given • Input border: I • Block • Compute • Output border: O

  14. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O Matrices • I[i] : is the input border value • DIST[i,j] : weight of the optimal path • From entry i of the input border • To entry j of its output border • OUT[i,j] : merges the information from input row I and DIST • OUT[i,j]=I[i] + DIST[i,j] • O[j] = max{OUT[i,j] for i=1..n}

  15. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O DIST and OUT matrix example Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) OUT matrix max col

  16. For each block, given two sub-sequence S1, S2 • Compute (from scratch) DIST in (n*m) time • Given I and DIST, compute OUT in (n*m) time • Given OUT[i,j], Compute O in (m*n) time

  17. Revise • Compress the sequence • Pre-compute DIST[i,j] for each block • Compute border values of each blocks • Remaining questions • How to compute DIST[i,j] efficiently? • How to compute O[j] from I[i] and DIST[i,j] efficiently? 3 2 4 1 0 1 2 3 4 5

  18. Compute O[j] Efficiently

  19. Compute O[j] efficiently • For each block of two sub-sequences S1, S2 • Given • I[i] • DIST[i,j] • Compute • O[j]

  20. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O DIST and OUT matrix example Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) OUT matrix max col

  21. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Compute O without explicit OUT Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) SMAWK

  22. Given DIST[i,j], I[i] we can compute O[j] in O(n+m) • Without creating OUT[i,j] • How? Why?

  23. Why? • Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. • Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: • Convex condition:M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d. • Concave condition:M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.

  24. How? • Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.

  25. a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Why DIST[i,j] is totally monotone? a b The concave condition If b-c is better than a-c, then b-d is better than a-d. c d

  26. Other problem • Rectangle problem of DIST • Set upper right corner of OUT to - • Set lower left corner of OUT to -(n+i-1)*k • Preserve the totally monotone property of OUT

  27. Compute DIST[I,j] efficiently

  28. Trie for S g a 0 c 1 3 g 2 a c g 4 a Triefor T g a t 0 g c 3 2 1 5 g a c a c a c g 4 a a a g Compute DIST[i,j] for block(5/4) 3 2 4 1 0 1 2 3 4 5

  29. I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

  30. I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

  31. I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

  32. I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

  33. I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

  34. Only column m in DIST[i,j] is new • DIST block can be updated in O(m+n)

  35. Mantaining direct access to DIST table

  36. Trie for S Trie for T 0 0 t a a g c 2 3 1 3 1 c g g 2 5 4 g 4 1 3 2 4 0 1 2 3 4 5

  37. Trie for S Trie for T 0 0 t a a g c 2 3 1 3 1 c g g 2 5 4 g 4 1 3 2 4 0 1 2 3 4 5

  38. Trie for S Trie for T 0 0 t a a g c 2 3 1 3 1 c g g 2 5 4 g 4 1 3 2 4 0 1 2 3 4 5

  39. Complexity • Assume |S| = |T| = n • Number of words in S, T = O(hn/log n) • Number of blocks in alignment graph O(h2n2/(log n)2) • For each block • Update new DIST block O(t = size of the border) • Create direct access table O(t) • Propagating I/O across blocks • SMAWK O(t) • Sum of the sizes of all borders is O(hn2/log n) • Total complexity: O(hn2/log n)

  40. Other extensions • Trace • Reducing the space complexity for discrete scoring • Local alignment

  41. References • Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688 • Some pictures from 葉恆青

More Related