1 / 34

Compressed Suffix Arrays based on Run-Length Encoding

Compressed Suffix Arrays based on Run-Length Encoding. Veli Mäkinen Bielefeld University. Gonzalo Navarro University of Chile. BWT. RL. FID. Abstract.

rolando
Télécharger la présentation

Compressed Suffix Arrays based on Run-Length Encoding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWT RL FID

  2. Abstract • We introduce a new full-text index that occupies O(Hk|T|) bits and supports counting queries in O(|P|) time.- optimal space / search time on constant alphabet- works on any alphabet size s, adding log s to the space/time bounds. Compressed suffix arrays based on run-length encoding

  3. Introduction • We consider exact string matching on static text. • The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently. • Well known optimal solution exists: build a suffix tree over the text. Compressed suffix arrays based on run-length encoding

  4. Introduction... • The suffix-tree-based solution takes O(|T| log |T|) bits of space. • Text itself can be represented in O(|T| log s)bits.- or even less space if text is compressible. • In many applications the space usage is the real bottleneck, not the search efficiency. Compressed suffix arrays based on run-length encoding

  5. Introduction... • During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed. • The work can roughly be divided into three categories:(1) Reducing constant factors(2) Concrete optimization(3) Abstract optimization Compressed suffix arrays based on run-length encoding

  6. Reducing constant factors • Suffix arrays (Manber & Myers 1990) • Suffix cactuses (Kärkkäinen 1995) • Sparse suffix trees (Kärkkäinen & Ukkonen 1996) • Space-efficient suffix trees (Kurtz 1998) • Enhanced suffix arrays (Abouelhoda & Ohlebusch & Kurtz 2002) Compressed suffix arrays based on run-length encoding

  7. Concrete optimization • “ Minimizing automata” • DAWGS (Blumer & Blumer & Haussler & McConnel & Ehrenfeucht 1983) • Compact DAWGS (Crochemore & Vérin 1997) • Compact suffix arrays (Mäkinen 2000) Compressed suffix arrays based on run-length encoding

  8. Abstract optimization • Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure. • Space is measured in bits and usually given proportional to the entropy of the text. Compressed suffix arrays based on run-length encoding

  9. Abstract optimization: Example • A full text index for a given text T supports the following operations:- Exists(P): is P a substring of T? - Count(P): how many times P occurs in T?- Report(P): list occurrences of P in T. Compressed suffix arrays based on run-length encoding

  10. Abstract optimization... • Seminal work by Jacobson 1989: rank-select queries on bit-vectors. • Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-) • Lempel-Ziv index (Kärkkäinen & Ukkonen 1996) Compressed suffix arrays based on run-length encoding

  11. Abstract optimization... • Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002) • FM-index (Ferragina & Manzini 2000) • LZ-self-index (Navarro 2002) • Space-optimal full-text indexes (Grossi & Gupta & Vitter 2003, 2004) • Alphabet friendly FM-index (Ferragina & Manzini & Mäkinen & Navarro) • See also ISAAC'04, SODA'05,... Compressed suffix arrays based on run-length encoding

  12. This talk • We show that combining FM-index with compact suffix array gives a practical full-text index with good space / search time tradeoff. • Our structure, Run-Length FM-index, usesO(min(|T|(Hk log s +1),|T|log s) bits and supports Count(P) in O(|P|log s) time. Compressed suffix arrays based on run-length encoding

  13. This talk... • Hk=Hk(T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”. • There holds 0  Hk Hk-1 ...  H0 log s. Compressed suffix arrays based on run-length encoding

  14. FM-index • Let us first describe a simple variant of the FM-index that:- occupies O(|T| log s)bits, and- supports counting queries in O(|P| log s) time. Compressed suffix arrays based on run-length encoding

  15. Simple FM-index • Construct the Burrows-Wheeler-transformedtext bwt(T) [BW94]. • From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time. • Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T). Compressed suffix arrays based on run-length encoding

  16. Burrows-Wheeler transformation • Construct a matrix M that contains as rows all rotations of T. • Sort the rows in the lexicographic order. • LetL be the last column and F be the first column. • bwt(T)=L associated with the row number of T in the sorted M. Compressed suffix arrays based on run-length encoding

  17. Example pos 123456789 T = kalevala# F L sa M 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale L = alvkl#aae, row 6 ==> Exercise: Given L and the row number, how to compute Tand sa(T)? Compressed suffix arrays based on run-length encoding

  18. sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e k a l e v a l # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 a l e v a l a 1 7 3 5

  19. Implicit LF[i] • Ferragina and Manzini (2000) noticed the following connection: • LF[i]=CT[L[i]]+rankL[i](L,i) • HereCT[c] : amount of letters 0,1,...,c-1 in L=bwt(T)rankc(L,i) : amount of letters c in the prefix L[1,i] Compressed suffix arrays based on run-length encoding

  20. Rank/Select select1(L,j) 3 6 9 10 12 L 001001001101 rank1(L,i) 001112223445 Compressed suffix arrays based on run-length encoding

  21. sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 1 7 3 5 LF[7]=CT[a]+ranka(L,7) =1+2=3

  22. Backward search on bwt(T) • Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed asi’ := CT[c]+rankc(L,i-1)+1,j’ := CT[c]+rankc(L,j). Compressed suffix arrays based on run-length encoding

  23. rankv(L,i-1)=0 rankv(L,j)=1 Backward search on bwt(T)… L M vX=va? #k a# al al ev ka la le va a l v k l # a a e i X=a j i’ := 8 + 0 + 1 … C[’v’]=8 j’ := 8 + 1 i’, j’ Compressed suffix arrays based on run-length encoding

  24. Backward search on bwt(T) … AlgorithmCount(P[1,m], L[1,n],CT[1,s]) • c = P[m]; k = m; • i = CT[c]+1; j = CT[c+1]; • while (i ≤ j and k>1) do begin • c = P[k-1]; k = k-1; • i = CT[c]+rankc(L,i-1)+1; • j = CT[c]+rankc(L,j); end; • if (j<i) then return0else return (j-i+1); Compressed suffix arrays based on run-length encoding

  25. Backward search on bwt(T)... • Array CT[1,] takes O( log |T|) bits. • L=Bwt(T) takes O(|T| log ) bits. • Assuming rankc(L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T. Compressed suffix arrays based on run-length encoding

  26. Answering rankc(L,i) • Wavelet tree (GGV 2003) is a data structure replacingL=bwt(T):- supports rankc(L,i) in O(log ) time, and- occupies |T|H0(T) +o(|T|) bits. • Generalized wavelet tree (FMMN 2004) improves query time to constant when =O(polylog(|T|)). Compressed suffix arrays based on run-length encoding

  27. Simple FM-index... • We obtained a structure that- occupies O(|T|H0(T))bits, supports counting queries in O(|P|log ) time. • Original FM-index takes O(Hk|T|) bits, but only on constant alphabet. • Compression boosting can be applied to improve simple FM-index to take only O(|T|Hk(T)) bits (FMMN 2004). Compressed suffix arrays based on run-length encoding

  28. To partition or not... • All alphabet-friendly solutions obtaining O(|T|Hk(T)) space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece.- always (k+1) overhead. • MTF+zeroth order coding take O(|T|Hk(T))+O(k), but supporting queries on larger alphabets is non-trivial. Compressed suffix arrays based on run-length encoding

  29. Run-Length FM-index • We make the following changes to the previous FM-index variant:- L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],- Cumulative array CT[1,c] is replaced by CS[1,c],- wavelet tree is build on S, and- some formulas are changed. Compressed suffix arrays based on run-length encoding

  30. L B S L F B’ c c c a a g g a t t 1 0 0 1 0 1 0 1 1 0 c a g a t c c c a a g g a t t a a a c c c g g t t 1 0 1 1 0 0 1 0 1 0 Run-Length FM-index... Compressed suffix arrays based on run-length encoding

  31. Changes to formulas • Recall that we need to compute CT[c]+rankc(L,i) in the backward search. • Theorem:C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,when L[i]¹ c, and otherwise to select1(B’,CS[c]+rankc(S,rank1(B,i)))+i-select1(B,rank1(B,i)). Compressed suffix arrays based on run-length encoding

  32. L F B S B’ c c c a a g g a t t a a a c c c g g t t 1 0 0 1 0 1 0 1 1 0 c a g a t 1 0 1 1 0 0 1 0 1 0 Example, L[i]=c LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8)) = select1(B’,0+ranka(S,4))+8-select1(B,4) = select1(B’,0+2)+8-8 = 3 Compressed suffix arrays based on run-length encoding

  33. Space requirement • CS[1,s] takes O(s log |T|) bits. • B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits. • S represented using wavelet tree occupies |S|H0(S)+o(|S|) bits. • In CPM 2004, we have shown that |S|  Hk|T| +sk. Compressed suffix arrays based on run-length encoding

  34. Comparison 5 60

More Related