Fast and Practical Algorithms for Computing Runs

Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN CPM, UWO, July 11, 2007

Overview • I won’t talk much about runs! • Lempel-Ziv (LZ) Factorization • How to compute LZ with SA & LCP • Suffix Array & LCP Array Basics (again!) • Two different methods for LZ factorization • CPS1 and CPS2 • Various space time trade-offs • Experimental comparison to other approaches

LZ Factorization (Defn) • The LZ-factorization, LZx of string x[1..n] is a factorization x = w1w2...wk such that each wj, j ε 1..k, is either: • a letter that does not occur in w1w2...wj-1; or • the longest substring that occurs at least twice in w1w2...wj. • This is the LZ-77 parsing of the input string • Also known as the S-Factorization(Crochemore)

1 2 3 4 5 6 7 8 a b a a b a b a x = LZ Factorization (Ex) wj a b a aba ba (POS,LEN) (1,0) (2,0) (1,3) (2,2) (1,1) … or (5,2) • POS = Position of some previous occurrence • LEN = Factor length • Convention: LEN = 0 if factor is a new letter

Applications of LZ Factorization LZ Factorization is the computational bottleneck in numerous string processing algorithms • Computing all runs (Kolpakov & Kucherov) • Repeats with fixed gap (Kolpakov & Kucherov… again) • Branching repeats (Gusfield & Stoye) • Sequence Alignment (Crochemore et al.) • Local periods (Duval et al.) • Data Compression (Lempel & Ziv, many others) Etcetera…

Computing LZ • “Traditional” method is to use a suffix tree • Can be computed as a by-product of Ukkonen’s online suffix tree construction algorithm OR • During a bottom-up traversal of a whole tree • SA/LCP interval tree (Abouelhoda et al 2004) • Essentially simulating a bottom-up traversal of the suffix tree on the SA/LCP combination • Both these approaches use lots of space.

1 2 3 4 5 6 7 8 a a b a b a b a x = SORT The ubiquitous Suffix Array • Sort the n suffixes of x[1..n] into lexorder • Store the offsets in an array 1 abaababa 2 baababa 3 aababa 4 ababa 5 baba 6 aba 7 ba 8 a 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba

Many SA algorithms rely on an additional table: the LCP (longest common prefix) array Can be computed in O(n) time (Kasai et al. 1999) Several practical improvements: space consumption reduced from 13n to 9n (Manzini 2004) LCP Array 8 0 a 3 1 aababa 6 1 aba 1 3 abaababa 4 3 ababa 7 0 ba 2 2 baababa 5 2 baba LCP Array stores length of Longest Common Prefix between suffixes SA[i] and SA[i-1]

POS = 1 2 1 1 2 1 2 1 LCP = LEN = 0 0 1 0 1 1 3 3 3 2 3 0 2 2 2 1 Computing LZ with the SA • First “family” of LZ algorithms we call CPS1 • CPS1 algorithms compute arrays POS and LEN • These arrays give us the factor information for every position (which is more than we require) • Also, LEN is a permutation of LCP 1 2 3 4 5 6 7 8 a b a a b a b a x =

CPS1: LZ from SA & LCP • POS and LEN are computed in a straight left-to-right traversal of the SA/LCP arrays • We “ascend” the LCP array, saving indexes on the stack until LCP values decrease • Backtrack using the stack to locate the rightmost i1 < i2 with LCP[i1] < LCP[i2] • As we go set the larger position with equal LCP to point leftwards to the smaller one • 14 lines of C code! • x, SA, LCP, POS, LEN arrays →17n + stack

Overwrite LCP with POS • Once POS[SA[i]] has been assigned • SA[i] and LCP[i] are no longer accessed… • Reuse the space • Leave SA[i] as is • Assign LCP[i] = POS[SA[i]] • Store LEN separately as before • After the traversal of SA/LCP is complete, permute the SA and “LCP” arrays inplace into string order by following all cycles • POS array no longer needed →13n + stack

Eliminate the LEN Array • Given POS[i] = p • LEN[i] = longestmatch(x[POS[i]…n],x[i…n]) • Compute only the POS values • Permute them into the POS array (as last slide) • Compute LEN values only for factors in the parsing • Sum of factors lengths required for the parsing is n, still O(n) time • LEN array no longer needed →9n + stack

CPS2: LZ without LCP • LCP computation is slow (though linear) • requires extra space: can we drop it? • Use SA to search for the longest previous match at each position in the factorization • Problem is: we don’t want any match - we want a match to the left. • When do we stop the search?

1 2 3 4 5 6 7 8 a b a a b a b a x = LZ without LCP (cont…) RangeMinSA(1,5) = 1 Length = 1 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba RangeMinSA(3,5) = 1 Length = 2 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 4

LZ without LCP (cont…) • Use two binary searches to refine range • Incremental use of Manber and Myers search • Could use other search algs (like FM) • Preprocess SA for fast RMQ queries • RMQSA(i,j) returns minimum value in SA[i..j] • Fast implementation of RMQ requires n bytes • O(n log n) time, ~6n bytes space • n single character searches • Each search takes O(log n) time

Experiments Implemented CPS algorithms and raced with: • Kolpakov and Kucherov’s implementation • Computes factors during online construction of the suffix tree (Ukkonen’s algorithm) • Tuned specifically for DNA strings • Abouelhoda et al’s approach • Uses SA and LCP, computes the POS,LEN

Results - Runtimes

Peak Memory Usage

Conclusions • KK remains fastest algorithm on DNA • CPS1 (13n) is consistently fastest on larger alphabets (notably faster than AKO) • CPS1 (9n) provides a nice space time tradeoff • CPS2 most suitable if memory is tight

Future Work • Computing the LCP array is a burden • Can we speed it up? • Compute it during SA construction? • How easily do these algorithms map to compressed SAs? • Overwriting SA/LCP difficult in that setting • Can LZ be computed efficiently without using SA/LCP or STree? • Can we compute the rightmost previous POS instead of the leftmost? (Veli Makinen 7-9-2007)

Fast and Practical Algorithms for Computing Runs

Fast and Practical Algorithms for Computing Runs

Presentation Transcript

Bit-parallel algorithms for computing all th e runs in a string

Fast Algorithms for Minimum Evolution

Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules

Fast Matching Algorithms for Repetitive Optimization

Fast Algorithms for Submodular Optimization

Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules

FAST II: Algorithms and Performance

Fast Propositional Algorithms for Planning

“Computing requests for Dry/Technical Runs”

Simple, Fast and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

Fast Updating Algorithms for TCAMs

Algorithms for Incentive-Based Computing

Practical concurrent algorithms

Complexity and Fast Algorithms for Multiexponentiations

Fast Algorithms for Mining Frequent Itemsets

Fast Algorithms for Retiming