1 / 49

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix arrays and suffix trees with applications to text indexing and string matching. Roberto Grosssi. Jeffrey scott vitter. Agenda. A (very) short review on suffix arrays Introduction Problem Definition Information theory reasoning Simple solution round 2

alize
Télécharger la présentation

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed suffix arrays and suffix trees with applications to text indexing and string matching

  2. Roberto Grosssi Jeffrey scott vitter

  3. Agenda • A (very) short review on suffix arrays • Introduction • Problem Definition • Information theory reasoning • Simple solution round 2 • Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time • Rank And Select Problem definitions • Rank DS • Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time • Select data structure (if time permits)

  4. Short review on suffix arrays • A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S • For example The string TelAviv and it’s corresponding suffix array

  5. Introduction • Succinct data structures branch • Dna genome strings (small alphabet, large strings) • Mainly a Theoretical article

  6. Problem Definition • The Algorithm Is composed of two phases • compression • lookup • Compress : • given a suffix array Sa compress it to get it’s succinct representation • lookup(i): • Given the compressed representation return SA[i]

  7. Some Definitions • We will deal (at first) with binary alphabet • Σ = {a,b} • We will add a special end of string symbol # • And will set the relation between the characters to be • a<#<b (*) • Basic Ram Model • Log(n) word size • Word lookup and arithmetic in constant time

  8. Information theory reasoning

  9. Information theory reasoning (2) Suffix array size nlog(n) One to one corresponds between the suffix array to the string Construction details Number of possible suffix arrays 2n-1 Perfect compress n bits (the string itself) The cost for lookup Ω(n) see prev lecture

  10. “Simple” solution round 2different approach • Let’s pack together each logn bits to create a new alphabet. • So the text length will be n/logn and the pattern length would be m/logn • The suffix array will take o(n) bits • Searching becomes hard (alignment) • the text is aligned but the pattern isn’t logn cases

  11. “Simple” solution round 2 • the text isn’t aligned the pattern occurs k bit right to a word boundary • Need to append k bits to the pattern and check it • So we need to check 2^k cases • K~logn => n different cases to check • Assuming we know how much to pad!!

  12. General framework • Abstract Data Type Optimization [Jacobson'89] • # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits. • Doesn’t guarantee the time complexity on the supported operations

  13. Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time • Recursive methodin nature • Take advantage on the suffixes • Let Sa0 be theuncompressed suffix array • And N0 be it’s size (assume power of 2) • In The k phase of the compression we start with Sak with the size and create Sak+1 with the size Sak+1 holds the permutation {1..Nk+1}

  14. Sak+1 Construction • Create the Bk bit vector Bk[i] = 1 iff Sak[i] is even • create the Rank vector Rankk(j) counts the number of one bits in the first j bits of Bk • Create the Ψk(i) vector • stores the 0 to 1 companion relation) • Store the even values from Sak in Sak+1

  15. An Example • The 32 chars string T • abbabbabbabbabaaabababbabbbabba#

  16. An Example

  17. Example …

  18. How To compute Sak from Sak-1 • Lemma 1 • Given suffix array Sak let Bk rankkΨk and Sak+1 Be the result of the transformation performed by phase k we can construct Sak from Sak+1 by the following formula Sak[i] = 2* Sak+1[rankk(Ψk(i))]+(Bk[i]-1) • Let’s split for 2 cases • Bk[i] is even • Bk[i] is odd

  19. Example continue

  20. Compress • We Keep l = O(loglogn) levels • All Levels but the Sal level are save implicitly • For each of the level 0..l-1 we save Bj,rankjΨj • rankjΨj are stored implicitly • The Size of Sal is

  21. lookup • just compute recursively Sak[i] from Sak+1[i] • Recursion depth loglogn • All data structure going to be used have o(1) access time • O(loglogn) lookup cost

  22. How The Data Is Stored • The Bk bit vector is stored explctiy • O(Nk) space • O(1) lookup • O(Nk) preprocess time • The RankK vector is stored implicitly using Jacobson rank data structure • O(Nk(loglognk)/lognk) space • O(1) lookup • O(Nk) preprocess time • The Ψk vector is stored implicitly (using rank and select)

  23. Ψk vector representation

  24. Let’s Take a look

  25. An Example

  26. Example …

  27. So What can we do with all the list’s • Concatenate them together in a lexicographical order and form the Lk list • L1={9,1,6,12,14,2,4,5} • Let’s see how we can compute Ψk (i) • If Bk[i] is even , it’s simply i • Otherwise , • because all the prefix patterns saved are in sorted order, • We saved in the Lk list till the point i , entries for all the odd suffix’s before i , h=i-rank[i] • So we can look up the h entry in Lk • And it will give us the answer

  28. Simple example • L2={5,8,2,4} • Rank2={1,1,1,2,3,3,3,4} • B2={1,0,0,1,1,0,0,1} • Ψ2={1,5,8,4,5,1,4,8} • Ψ(3) = ? • Rank(3) = 1, h= 3-1 , L2[2] = 8 • Ψ(3) =8 

  29. Ψk vector representation • Lemma 2 Given s integers in sorted order , each containing w bits ,where s<2w we can store them with at most s(2+w-floor(logs))+O(s/loglogs) bits so that retrieving the hth integer takes constant time

  30. Ψk vector representation Take the first z=floor(logs) bits of each int, creating the q1..qs int It’s easy to see that , q1<qi<qi+1<s (we take the msb bits after all) The rest w-z bits of each int , will be ri Si 10101010101010101010101010101 qi ri 1010101010101010101010101 101

  31. Ψk vector representation Store ri in a simple array, (w-z)*s bits Store q1..qs in a table supporting select and rank in constant time. The table Q is implemented in the following way Instead of saving the number themselves, we store q1,q2-q1,q2-q3,… qs-qs-1 in unary representation )0i1( And add a select data structure.

  32. Ψk vector representation In order to get qi we simply do select(i) , and count the number of zeros before the ith 1 Qi = select(i) - rank(select(i))

  33. Ψk vector representation The q table size is the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) So we can output Si easily Si=qi*2w-z+ri

  34. Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones)

  35. Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones) • Each Xi integer in the lists, 1<xi<Nk will be transformed into a new integer by appending it’s list int representation • X` bit size is , 2K+lognk ,

  36. Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones) • Each Xi integer in the lists, 1<xi<Nk will be transformed into a new integer by appending it’s list int representation • X` bit size is , 2K+lognk , • After concatenating all the lists ,we have a Nk/2 sorted numbers sized 2K+lognk bits • Using lemma 2 we get. • O(1) access time • And a space bound of n(1/2+3/2k+1)+O(n/2kloglogn) bits

  37. Sum it up (space complexity)

  38. Rank data structure • Due to Jacobson • Given a bit vector length n ,Rank[i] is the number of 1 bits till I • Multilevel approach • We will slice the bit string to log2n chunks. • Between each chunk we will keep rank counter • Each chunk will be divvied into ½ * logn chunks , • And a counter will be kept between each sub chunks • At The Bottom Level a simple Lookup table will be used.

  39. Rank Log2n chunks 7 14 3 101 ½ logn sub chunks Lookup table The output 14+3+1

  40. Rank Analysis

  41. Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time • In order to break the space barrier we need to save less levels =>longer lookup’s • Lets save 3 compressed levels only Sa0 Sal Sal` L = ceil(loglogn) , l`=ceil(1/2loglogn) • using A Dictionary data structure , which Can say If an element is member of the Dictionary, and support a rank query, O(1) time for both queries • The Space complexity of the dictionary is • We keep in 2 dictionaries what items we have in the next level D0 and Dl (from Sa0->Sal` Sal`->Sal

  42. The Ψ`k function • We define the Ψ`k function , which maps each 1 to it’s companion 0 • Let’s define the φk function to be • We just need to merge the indexes in Lk and L`k

  43. Example

  44. The φk function implementation • Lemma 4 :We can store the concatenated list used for φk • k =0 in n+O(n/loglogn) bits • K>0 in n*(1+1/2k-1)+O(n/2kloglogn) , preprocess time of O(n/2k +22k) • If k>0 simply using lemma 3 • K=0 • Encode a,# as 0, and b as 1. • Create a n bit vector , named l • L[f] = 0 iff the list for φ0 is a or # at the f position • We add a select and select0 data structure on top of it. O(n/loglogn) • Also we keep the number of 0 in l as c0, • Query φk(j) is done in the following way • if j = C0 , return select0(c0) • If j<c0 return select0(j) • If j>c0 return select(j-c0)

  45. The Lookup algorithm • Sa[i] , we start walking the φk function i,i`,i``,i``` • Sa0[i]+1=Sa0[i`]… • Until reaching entry found in the dictionary D0, • Let s be the walk length • And r the entry rank in the dictionary (how many items, already passed to the next level?) • Using r we start walking the next level • Let s` be the walk length • And r` the entry rank in the dictionary • we return the following result • The walk length is , max(s,s`)<2l`<sqr(logn) • So the query time is O(sqr(logn))

  46. The General multilevel Build • For every 0<ε<1 , • Assume εl is an integer so 2εl<2logεn • Create all the levels , 0, εl,2εl ..l • Number of levels is ε-1+1 => lookup of O(logεn)

  47. The General multilevel Build

  48. Select data structure • select(i)- returns the i 1 bit in the string • Same idea as rank , a bit more complicated • multilevel approach • At the first level we record the position of every lognloglognth bit, • Total space o(N/loglogn) • Between each two bits, we keep the following data, • If the distance between them r>(lognloglogn)2 • we keep the absolute pos of all the indexes between them • log2nloglogn • Other wise we keep , the relative position of each logrloglognth bit • Total space logr*loglogn <log2nloglogn = r/loglogn r<N !!! • Then we keep one more level (the same notions) • Block size comes to the size of (lgn)4

  49. Select data structure • After that, we keep a lookup table • For every logn/d pattern we save (d>=2) • Number of 1 bits, • the location of the ith 1 bit in the pattern • Same as before the space is O(n1/dlognloglogn) • The lookup is then very simple, just walk the levels, • Get a block and ask a query about him using the lookup table. • Space complexity , O(n/loglogn)

More Related