1 / 20

Fast and Practical Algorithms for Computing Runs

Fast and Practical Algorithms for Computing Runs. Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN. CPM, UWO, July 11, 2007. Overview. I won’t talk much about runs! Lempel-Ziv ( LZ ) Factorization How to compute LZ with SA & LCP

tate
Télécharger la présentation

Fast and Practical Algorithms for Computing Runs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN CPM, UWO, July 11, 2007

  2. Overview • I won’t talk much about runs! • Lempel-Ziv (LZ) Factorization • How to compute LZ with SA & LCP • Suffix Array & LCP Array Basics (again!) • Two different methods for LZ factorization • CPS1 and CPS2 • Various space time trade-offs • Experimental comparison to other approaches

  3. LZ Factorization (Defn) • The LZ-factorization, LZx of string x[1..n] is a factorization x = w1w2...wk such that each wj, j ε 1..k, is either: • a letter that does not occur in w1w2...wj-1; or • the longest substring that occurs at least twice in w1w2...wj. • This is the LZ-77 parsing of the input string • Also known as the S-Factorization(Crochemore)

  4. 1 2 3 4 5 6 7 8 a b a a b a b a x = LZ Factorization (Ex) wj a b a aba ba (POS,LEN) (1,0) (2,0) (1,3) (2,2) (1,1) … or (5,2) • POS = Position of some previous occurrence • LEN = Factor length • Convention: LEN = 0 if factor is a new letter

  5. Applications of LZ Factorization LZ Factorization is the computational bottleneck in numerous string processing algorithms • Computing all runs (Kolpakov & Kucherov) • Repeats with fixed gap (Kolpakov & Kucherov… again) • Branching repeats (Gusfield & Stoye) • Sequence Alignment (Crochemore et al.) • Local periods (Duval et al.) • Data Compression (Lempel & Ziv, many others) Etcetera…

  6. Computing LZ • “Traditional” method is to use a suffix tree • Can be computed as a by-product of Ukkonen’s online suffix tree construction algorithm OR • During a bottom-up traversal of a whole tree • SA/LCP interval tree (Abouelhoda et al 2004) • Essentially simulating a bottom-up traversal of the suffix tree on the SA/LCP combination • Both these approaches use lots of space.

  7. 1 2 3 4 5 6 7 8 a a b a b a b a x = SORT The ubiquitous Suffix Array • Sort the n suffixes of x[1..n] into lexorder • Store the offsets in an array 1 abaababa 2 baababa 3 aababa 4 ababa 5 baba 6 aba 7 ba 8 a 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba

  8. Many SA algorithms rely on an additional table: the LCP (longest common prefix) array Can be computed in O(n) time (Kasai et al. 1999) Several practical improvements: space consumption reduced from 13n to 9n (Manzini 2004) LCP Array 8 0 a 3 1 aababa 6 1 aba 1 3 abaababa 4 3 ababa 7 0 ba 2 2 baababa 5 2 baba LCP Array stores length of Longest Common Prefix between suffixes SA[i] and SA[i-1]

  9. POS = 1 2 1 1 2 1 2 1 LCP = LEN = 0 0 1 0 1 1 3 3 3 2 3 0 2 2 2 1 Computing LZ with the SA • First “family” of LZ algorithms we call CPS1 • CPS1 algorithms compute arrays POS and LEN • These arrays give us the factor information for every position (which is more than we require) • Also, LEN is a permutation of LCP 1 2 3 4 5 6 7 8 a b a a b a b a x =

  10. CPS1: LZ from SA & LCP • POS and LEN are computed in a straight left-to-right traversal of the SA/LCP arrays • We “ascend” the LCP array, saving indexes on the stack until LCP values decrease • Backtrack using the stack to locate the rightmost i1 < i2 with LCP[i1] < LCP[i2] • As we go set the larger position with equal LCP to point leftwards to the smaller one • 14 lines of C code! • x, SA, LCP, POS, LEN arrays →17n + stack

  11. Overwrite LCP with POS • Once POS[SA[i]] has been assigned • SA[i] and LCP[i] are no longer accessed… • Reuse the space • Leave SA[i] as is • Assign LCP[i] = POS[SA[i]] • Store LEN separately as before • After the traversal of SA/LCP is complete, permute the SA and “LCP” arrays inplace into string order by following all cycles • POS array no longer needed →13n + stack

  12. Eliminate the LEN Array • Given POS[i] = p • LEN[i] = longestmatch(x[POS[i]…n],x[i…n]) • Compute only the POS values • Permute them into the POS array (as last slide) • Compute LEN values only for factors in the parsing • Sum of factors lengths required for the parsing is n, still O(n) time • LEN array no longer needed →9n + stack

  13. CPS2: LZ without LCP • LCP computation is slow (though linear) • requires extra space: can we drop it? • Use SA to search for the longest previous match at each position in the factorization • Problem is: we don’t want any match - we want a match to the left. • When do we stop the search?

  14. 1 2 3 4 5 6 7 8 a b a a b a b a x = LZ without LCP (cont…) RangeMinSA(1,5) = 1 Length = 1 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba RangeMinSA(3,5) = 1 Length = 2 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 4

  15. LZ without LCP (cont…) • Use two binary searches to refine range • Incremental use of Manber and Myers search • Could use other search algs (like FM) • Preprocess SA for fast RMQ queries • RMQSA(i,j) returns minimum value in SA[i..j] • Fast implementation of RMQ requires n bytes • O(n log n) time, ~6n bytes space • n single character searches • Each search takes O(log n) time

  16. Experiments Implemented CPS algorithms and raced with: • Kolpakov and Kucherov’s implementation • Computes factors during online construction of the suffix tree (Ukkonen’s algorithm) • Tuned specifically for DNA strings • Abouelhoda et al’s approach • Uses SA and LCP, computes the POS,LEN

  17. Results - Runtimes

  18. Peak Memory Usage

  19. Conclusions • KK remains fastest algorithm on DNA • CPS1 (13n) is consistently fastest on larger alphabets (notably faster than AKO) • CPS1 (9n) provides a nice space time tradeoff • CPS2 most suitable if memory is tight

  20. Future Work • Computing the LCP array is a burden • Can we speed it up? • Compute it during SA construction? • How easily do these algorithms map to compressed SAs? • Overwriting SA/LCP difficult in that setting • Can LZ be computed efficiently without using SA/LCP or STree? • Can we compute the rightmost previous POS instead of the leftmost? (Veli Makinen 7-9-2007)

More Related