Efficient Sub-quadratic Alignment Algorithm Utilizing Trie Compression

A Sub-quadratic Sequence Alignment Algorithm

a a c g a c g a 6 7 3 4 1 5 2 8 0 c 1 t 2 a 3 c 4 g 5 a 6 g 7 a 8 Global alignment Alignment graph for S = aacgacga, T = ctacgaga • V(i,j) = max { V(i-1,j-1) + (S[i], T[j]), V(i-1,j) + (S[i], -), V(i,j-1) + (-, T[j]) } Complexity: O(n2)

Four Russian algorithm

Unrestricted scoring function

c a t 1 4 5 4 3 0 2 0 3 2 1 g g a g c g Main idea: Compress the sequences LZ-78: Divide the sequence into distinct words • T = ctacgaga • S = aacgacga Trie Trie The number of distinct words:

Trie for S g a 0 c 1 3 g 2 a c g 4 a Triefor T g a t 0 g c 3 2 1 5 g a c a c a c g 4 a a a g Main idea 3 2 4 1 0 • Compute the alignment score in each block • Propagate the scores between the adjacent blocks 1 2 3 4 5

Main idea • Compress the sequence into words • Pre-compute the score for each block • Do alignment between blocks • Note: • Replace normal characters by words • Operate on blocks

LZ-78 Compress the sequence

c a t 1 4 5 4 3 0 2 0 3 2 1 g g a g c g LZ-78 LZ-78: Divide the sequence into distinct words • T = ctacgaga • S = aacgacga Trie Trie The number of distinct words:

LZ-78 • Theorem (Lempel and Ziv): • Constant alphabet sequence S • The maximal number of distinct phrases in S is O(n/log n). • Tighter upper bound: O(hn/log n) • h is the entropy factor – a real number, 0 < h  1 • Entropy is small sequence is repetitive

Compute the alignment score in each block

a c g a g a c a c a c g a a a g Compute the alignment score in each block 3 2 4 1 0 1 2 3 4 5

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Given • Input border: I • Block • Compute • Output border: O

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O Matrices • I[i] : is the input border value • DIST[i,j] : weight of the optimal path • From entry i of the input border • To entry j of its output border • OUT[i,j] : merges the information from input row I and DIST • OUT[i,j]=I[i] + DIST[i,j] • O[j] = max{OUT[i,j] for i=1..n}

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O DIST and OUT matrix example Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) OUT matrix max col

For each block, given two sub-sequence S1, S2 • Compute (from scratch) DIST in (n*m) time • Given I and DIST, compute OUT in (n*m) time • Given OUT[i,j], Compute O in (m*n) time

Revise • Compress the sequence • Pre-compute DIST[i,j] for each block • Compute border values of each blocks • Remaining questions • How to compute DIST[i,j] efficiently? • How to compute O[j] from I[i] and DIST[i,j] efficiently? 3 2 4 1 0 1 2 3 4 5

Compute O[j] Efficiently

Compute O[j] efficiently • For each block of two sub-sequences S1, S2 • Given • I[i] • DIST[i,j] • Compute • O[j]

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O DIST and OUT matrix example Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) OUT matrix max col

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Compute O without explicit OUT Block – sub-sequences “acg”, “ag” DIST matrix I (input borders) SMAWK

Given DIST[i,j], I[i] we can compute O[j] in O(n+m) • Without creating OUT[i,j] • How? Why?

Why? • Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. • Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: • Convex condition:M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d. • Concave condition:M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.

How? • Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.

a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O • Why DIST[i,j] is totally monotone? a b The concave condition If b-c is better than a-c, then b-d is better than a-d. c d

Other problem • Rectangle problem of DIST • Set upper right corner of OUT to - • Set lower left corner of OUT to -(n+i-1)*k • Preserve the totally monotone property of OUT

Compute DIST[I,j] efficiently

Trie for S g a 0 c 1 3 g 2 a c g 4 a Triefor T g a t 0 g c 3 2 1 5 g a c a c a c g 4 a a a g Compute DIST[i,j] for block(5/4) 3 2 4 1 0 1 2 3 4 5

I I = 1 = 1 0 0 - - 1 1 - - 2 2 - - 3 3 Δ Δ Δ Δ 0 0 I I = 2 = 2 - - 1 1 - - 1 1 - - 2 2 - - 1 1 - - 2 2 Δ Δ 1 1 I I = 3 = 3 - - 2 2 0 0 0 0 1 1 - - 1 1 - - 3 3 2 2 I I = 2 = 2 Δ Δ - - 2 2 - - 2 2 0 0 - - 2 2 - - 2 2 3 3 I I = 1 = 1 Δ Δ Δ Δ - - 2 2 0 0 - - 1 1 - - 1 1 4 4 I I = 3 = 3 Δ Δ Δ Δ Δ Δ - - 2 2 - - 1 1 0 0 5 5 DIST matrix

Only column m in DIST[i,j] is new • DIST block can be updated in O(m+n)

Mantaining direct access to DIST table

Trie for S Trie for T 0 0 t a a g c 2 3 1 3 1 c g g 2 5 4 g 4 1 3 2 4 0 1 2 3 4 5

Complexity • Assume |S| = |T| = n • Number of words in S, T = O(hn/log n) • Number of blocks in alignment graph O(h2n2/(log n)2) • For each block • Update new DIST block O(t = size of the border) • Create direct access table O(t) • Propagating I/O across blocks • SMAWK O(t) • Sum of the sizes of all borders is O(hn2/log n) • Total complexity: O(hn2/log n)

Other extensions • Trace • Reducing the space complexity for discrete scoring • Local alignment

References • Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688 • Some pictures from 葉恆青

Efficient Sub-quadratic Alignment Algorithm Utilizing Trie Compression

Efficient Sub-quadratic Alignment Algorithm Utilizing Trie Compression

Presentation Transcript

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrics

Sequence alignment:

A fast Prunning Algorithm for optimal Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment