Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Roberto Grosssi Jeffrey scott vitter

Agenda • A (very) short review on suffix arrays • Introduction • Problem Definition • Information theory reasoning • Simple solution round 2 • Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time • Rank And Select Problem definitions • Rank DS • Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time • Select data structure (if time permits)

Short review on suffix arrays • A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S • For example The string TelAviv and it’s corresponding suffix array

Introduction • Succinct data structures branch • Dna genome strings (small alphabet, large strings) • Mainly a Theoretical article

Problem Definition • The Algorithm Is composed of two phases • compression • lookup • Compress : • given a suffix array Sa compress it to get it’s succinct representation • lookup(i): • Given the compressed representation return SA[i]

Some Definitions • We will deal (at first) with binary alphabet • Σ = {a,b} • We will add a special end of string symbol # • And will set the relation between the characters to be • a<#<b (*) • Basic Ram Model • Log(n) word size • Word lookup and arithmetic in constant time

Information theory reasoning

Information theory reasoning (2) Suffix array size nlog(n) One to one corresponds between the suffix array to the string Construction details Number of possible suffix arrays 2n-1 Perfect compress n bits (the string itself) The cost for lookup Ω(n) see prev lecture

“Simple” solution round 2different approach • Let’s pack together each logn bits to create a new alphabet. • So the text length will be n/logn and the pattern length would be m/logn • The suffix array will take o(n) bits • Searching becomes hard (alignment) • the text is aligned but the pattern isn’t logn cases

“Simple” solution round 2 • the text isn’t aligned the pattern occurs k bit right to a word boundary • Need to append k bits to the pattern and check it • So we need to check 2^k cases • K~logn => n different cases to check • Assuming we know how much to pad!!

General framework • Abstract Data Type Optimization [Jacobson'89] • # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits. • Doesn’t guarantee the time complexity on the supported operations

Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time • Recursive methodin nature • Take advantage on the suffixes • Let Sa0 be theuncompressed suffix array • And N0 be it’s size (assume power of 2) • In The k phase of the compression we start with Sak with the size and create Sak+1 with the size Sak+1 holds the permutation {1..Nk+1}

Sak+1 Construction • Create the Bk bit vector Bk[i] = 1 iff Sak[i] is even • create the Rank vector Rankk(j) counts the number of one bits in the first j bits of Bk • Create the Ψk(i) vector • stores the 0 to 1 companion relation) • Store the even values from Sak in Sak+1

An Example • The 32 chars string T • abbabbabbabbabaaabababbabbbabba#

An Example

Example …

How To compute Sak from Sak-1 • Lemma 1 • Given suffix array Sak let Bk rankkΨk and Sak+1 Be the result of the transformation performed by phase k we can construct Sak from Sak+1 by the following formula Sak[i] = 2* Sak+1[rankk(Ψk(i))]+(Bk[i]-1) • Let’s split for 2 cases • Bk[i] is even • Bk[i] is odd

Example continue

Compress • We Keep l = O(loglogn) levels • All Levels but the Sal level are save implicitly • For each of the level 0..l-1 we save Bj,rankjΨj • rankjΨj are stored implicitly • The Size of Sal is

lookup • just compute recursively Sak[i] from Sak+1[i] • Recursion depth loglogn • All data structure going to be used have o(1) access time • O(loglogn) lookup cost

How The Data Is Stored • The Bk bit vector is stored explctiy • O(Nk) space • O(1) lookup • O(Nk) preprocess time • The RankK vector is stored implicitly using Jacobson rank data structure • O(Nk(loglognk)/lognk) space • O(1) lookup • O(Nk) preprocess time • The Ψk vector is stored implicitly (using rank and select)

Ψk vector representation

Let’s Take a look

An Example

Example …

So What can we do with all the list’s • Concatenate them together in a lexicographical order and form the Lk list • L1={9,1,6,12,14,2,4,5} • Let’s see how we can compute Ψk (i) • If Bk[i] is even , it’s simply i • Otherwise , • because all the prefix patterns saved are in sorted order, • We saved in the Lk list till the point i , entries for all the odd suffix’s before i , h=i-rank[i] • So we can look up the h entry in Lk • And it will give us the answer

Simple example • L2={5,8,2,4} • Rank2={1,1,1,2,3,3,3,4} • B2={1,0,0,1,1,0,0,1} • Ψ2={1,5,8,4,5,1,4,8} • Ψ(3) = ? • Rank(3) = 1, h= 3-1 , L2[2] = 8 • Ψ(3) =8 

Ψk vector representation • Lemma 2 Given s integers in sorted order , each containing w bits ,where s<2w we can store them with at most s(2+w-floor(logs))+O(s/loglogs) bits so that retrieving the hth integer takes constant time

Ψk vector representation Take the first z=floor(logs) bits of each int, creating the q1..qs int It’s easy to see that , q1<qi<qi+1<s (we take the msb bits after all) The rest w-z bits of each int , will be ri Si 10101010101010101010101010101 qi ri 1010101010101010101010101 101

Ψk vector representation Store ri in a simple array, (w-z)*s bits Store q1..qs in a table supporting select and rank in constant time. The table Q is implemented in the following way Instead of saving the number themselves, we store q1,q2-q1,q2-q3,… qs-qs-1 in unary representation )0i1( And add a select data structure.

Ψk vector representation In order to get qi we simply do select(i) , and count the number of zeros before the ith 1 Qi = select(i) - rank(select(i))

Ψk vector representation The q table size is the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) So we can output Si easily Si=qi*2w-z+ri

Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones)

Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones) • Each Xi integer in the lists, 1<xi<Nk will be transformed into a new integer by appending it’s list int representation • X` bit size is , 2K+lognk ,

Ψk vector representation • Lemma 3 We can store the concatenated list Lk used for Ψk in n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+22k) • There are 22k lists, number them ,(even the empty ones) • Each Xi integer in the lists, 1<xi<Nk will be transformed into a new integer by appending it’s list int representation • X` bit size is , 2K+lognk , • After concatenating all the lists ,we have a Nk/2 sorted numbers sized 2K+lognk bits • Using lemma 2 we get. • O(1) access time • And a space bound of n(1/2+3/2k+1)+O(n/2kloglogn) bits

Sum it up (space complexity)

Rank data structure • Due to Jacobson • Given a bit vector length n ,Rank[i] is the number of 1 bits till I • Multilevel approach • We will slice the bit string to log2n chunks. • Between each chunk we will keep rank counter • Each chunk will be divvied into ½ * logn chunks , • And a counter will be kept between each sub chunks • At The Bottom Level a simple Lookup table will be used.

Rank Log2n chunks 7 14 3 101 ½ logn sub chunks Lookup table The output 14+3+1

Rank Analysis

Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time • In order to break the space barrier we need to save less levels =>longer lookup’s • Lets save 3 compressed levels only Sa0 Sal Sal` L = ceil(loglogn) , l`=ceil(1/2loglogn) • using A Dictionary data structure , which Can say If an element is member of the Dictionary, and support a rank query, O(1) time for both queries • The Space complexity of the dictionary is • We keep in 2 dictionaries what items we have in the next level D0 and Dl (from Sa0->Sal` Sal`->Sal

The Ψ`k function • We define the Ψ`k function , which maps each 1 to it’s companion 0 • Let’s define the φk function to be • We just need to merge the indexes in Lk and L`k

Example

The φk function implementation • Lemma 4 :We can store the concatenated list used for φk • k =0 in n+O(n/loglogn) bits • K>0 in n*(1+1/2k-1)+O(n/2kloglogn) , preprocess time of O(n/2k +22k) • If k>0 simply using lemma 3 • K=0 • Encode a,# as 0, and b as 1. • Create a n bit vector , named l • L[f] = 0 iff the list for φ0 is a or # at the f position • We add a select and select0 data structure on top of it. O(n/loglogn) • Also we keep the number of 0 in l as c0, • Query φk(j) is done in the following way • if j = C0 , return select0(c0) • If j<c0 return select0(j) • If j>c0 return select(j-c0)

The Lookup algorithm • Sa[i] , we start walking the φk function i,i`,i``,i``` • Sa0[i]+1=Sa0[i`]… • Until reaching entry found in the dictionary D0, • Let s be the walk length • And r the entry rank in the dictionary (how many items, already passed to the next level?) • Using r we start walking the next level • Let s` be the walk length • And r` the entry rank in the dictionary • we return the following result • The walk length is , max(s,s`)<2l`<sqr(logn) • So the query time is O(sqr(logn))

The General multilevel Build • For every 0<ε<1 , • Assume εl is an integer so 2εl<2logεn • Create all the levels , 0, εl,2εl ..l • Number of levels is ε-1+1 => lookup of O(logεn)

The General multilevel Build

Select data structure • select(i)- returns the i 1 bit in the string • Same idea as rank , a bit more complicated • multilevel approach • At the first level we record the position of every lognloglognth bit, • Total space o(N/loglogn) • Between each two bits, we keep the following data, • If the distance between them r>(lognloglogn)2 • we keep the absolute pos of all the indexes between them • log2nloglogn • Other wise we keep , the relative position of each logrloglognth bit • Total space logr*loglogn <log2nloglogn = r/loglogn r<N !!! • Then we keep one more level (the same notions) • Block size comes to the size of (lgn)4

Select data structure • After that, we keep a lookup table • For every logn/d pattern we save (d>=2) • Number of 1 bits, • the location of the ith 1 bit in the pattern • Same as before the space is O(n1/dlognloglogn) • The lookup is then very simple, just walk the levels, • Get a block and ask a query about him using the lookup table. • Space complexity , O(n/loglogn)

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Presentation Transcript

Suffix trees and suffix arrays

Suffix Trees

Applications of Suffix Trees

Applications of Suffix Trees

Approximate String Matching using Compressed Suffix Arrays

Compressed Compact Suffix Arrays

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Suffix Trees and their applications

Augmenting Suffix Trees, with Applications

Suffix Trees and Suffix Arrays

Exact String Matching, Suffix Trees, and Applications

Suffix Trees

Suffix arrays

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Suffix Trees and Suffix Arrays

Suffix Arrays

Compressed Suffix Arrays

Suffix Trees and Derived Applications

Suffix Trees