Hard Instances of Compressed Text Indexing

Hard Instances of Compressed Text Indexing Rahul Shah Louisiana State University National Science Foundation* Based on joint work with Sharma Thankachan (Univ. of Central Florida) Arnab Ganguly (U Wisc. Whitewater) Supported by NSF Grant CCF 1527435 *This talk does not represent views of the NSF

String Data • Fundamental in Computer Science • Finite sequence of characters drawn from alphabet set ∑ • Applications: • Genomes, e.g. sequence alignment • Biometrics, Images • Time-series, e.g. finance • Text (for phrase search): Google, Microsoft • Music • Network security, e.g. online malware detection

Touches Many Fields • Uncertain and probabilistic matching[SIGMOD’14, EDBT’16] • Big Data matching[ESA’13, PODS’14, ISAAC’15] • Ranked Pattern Matching[JACM’14] • Software Plagiarism and Version Control[SODA’17] • RNA Structural Matching[ISAAC ‘17]

Agenda for today • Introduction to Text/Succinct Data Structures • Suffix Trees, Suffix Array • Bit vectors, Wavelet trees • BWT, FM-index, LF-mapping • Tree and RMQ encodings • Compressed Suffix Tree • Easy cases: augment Compressed Suffix Trees • Property Matching • Top-k retrieval : Sparsification • Hard cases: Pattern Matching problems with ST variants • Parameterized matching (pBWT) • Order-preserving matching (LF-Successor encoding) • RNA structural matching • Even harder: 2-D matching Technical Talk Alert !!!

String Searching Indexes Suffix Trees, Suffix Arrays, CSA, FM-Index, etc. T: mississippi$ mississippi$ ississippi$ ssissippi$ sissippi$ issippi$ ssippi$ sippi$ ippi$ ppi$ pi$ i$ $ P = ssi O(p) time suffixes Locus of P 12 1 Finding Occurrences in O(occ) time 11 8 9 10 2 3 4 6 5 7 LF(9) = 11 LF(10) = 12 O(n) words space and optimal O(p+occ) query time

Suffix Array M I S S I S S I P P I 1 2 3 4 5 6 7 8 9 10 11 Space: O(n) words, O(nlog n) bits

Space bloat incurred by ST and SA • Practically Suffix Trees are about 15-50 times the size of the original text. Suffix Arrays take about 5-15 times • Comparison based on minimum size required for storing the text • Complexity wise: n log Σ vs n log n bits • Σ = 4 for DNA and Σ = 256 for ASCII text • log n is often word size of 32-64 bits vs DNA symbol is 2 bits • Human Genome 3Billion base pairs = 0.8GB memory but Suffix tree takes about 35-45GB. Even more memory during construction. • Tools: bowtie, bzip

Pattern Matching with BWT C[a] = 0 C[b] = 4 T = a b a a b a b b $ ep P = a b a sp sp = (sp-1).count[c] + C[c] +1 ep = ep.count[c] + C[c] BWT(T) = b $ a b a aa b b aababb$ abaababb$ ababb$ abb$ baababb$ babb$ bb$ b$ $ Count statistics on BWT can be achieved using data structure called wavelet tree LF(6)=3 LF-mapping: jumping from ith suffix in suffix tree to its text-previous suffix LF(3)=1

Goals: • Succinct Data Structures • Information theoretic optimal space • Plus lower order o(…) terms • Compact: O(optimum) • Query times: As fast as possible with space limitations • Compressed Text Indexing • Not O(n) words i.e., O(n log n) bits • n log Σ + o(..) • nHk • poly(P, log n) or P* poly(log n) query times

Two building blocks • Text dictionary: Given character vector T from Σ, for any c in Σ • rankc(p) – counts the number of c’s in T[1..p] • selectc(i) – finds (min) position p such that T[1..p] has i number of c’s • char(i) – gives the character at T[i] • Bit dictionary: Given a bit-vector B of length n with t 1s (this can also be seen as subset of t items out of the ordered universe of size n) • rank(p) : returns # of 1s in B[1..p] • select(i): returns the position p such that B[p] is ith 1 • Min. space: nH0(T), t log (ne/t), n log Σ 6 12 3 4 1 2 6

Wavelet-Tree Count the number of e’s in T[5,15] T = c a b f b e g c g a g e f e a b e g 0 0 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 ∑={a, b, c, d, e, f, g } 0 1 1 0 0 0 1 0 1 0 0 ∑={a, b, c, d} 0 0 1 1 0 0 0 0 1 ∑={e, f, g} 0 1 0 1 g 0 1 1 0 0 1 ∑={a, b} 0 0 1 ∑={c, d} 1 0 0 1 0 0 ∑={e, f} 0 1 0 1 0 1 a b c d e f

Wavelet-Tree Count the number of e’s in T[5,15] T = c a b f b e g c g a g e f e a b e g 0 0 0 1 0 1 1 0 1 0 0 1 1 1 00 1 1 ∑={a, b, c, d, e, f, g } rank1(15)=7 rank1(5-1)=1 0 1 1 0 0 0 1 0 1 0 0 ∑={a, b, c, d} 0 0 1 1 0 0 0 0 1 ∑={e, f, g} rank0(1)=1 rank0(7)=5 0 1 0 1 g 0 1 1 0 0 1 ∑={a, b} 0 0 1 ∑={c, d} 1 0 0 1 0 0 ∑={e, f} 0 1 0 1 0 1 rank0(1)=0 rank0(5)=3 Count of e’s in T[5,15] = 3 - 0 = 3 a b c d e f

Encoding Tree Structure 11 nodes, ~22 bits Catalan numbers: 2n – θ(log n) BP : (((( ) ( ) ( )) ( ) )( )(() ( ) )) DFUDS: ((() (() ((() ) ) ) ) ) (() ) ) Rank, Select, Find_match_close --kth child, parent, leftmost leaf etc in O(1) time --LCA, LA etc (((( ) ( ) ( )) ( ) ) ( ) (() ( ) ))

Range Minimum Query 1 2 4 3 6 9 11 5 7 8 -Cartesian Tree : using balanced parenthesis encoding takes ~2n bits -No need to store original array – queries can be position based

Property Matching (22, 30) (37, 38) (3, 7) (10, 13) (18, 27) Text (T): A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G Properties (π) = {(3, 7), (10, 13), (18, 27), (22, 30), (37, 38)} Pattern (P): C A T Number of occurrences (occ) = 4 Number of occurrences which satisfied the property (occ_π) = 2 Tandem repeats, SINEs, LINEs, probabilistic matching, etc.

Compress Augmenting Structure All the property matching indexes consists of a Suffix Tree and augmented with some additional structures Compressed Suffix Trees (CST) Can we compress??? O(n) bits Compressed Property Suffix Trees (CPST) = CSA + O(n) bits additional structures Space: nH_k+O(n)+o(n log σ ) bits Query Time : O(|P|+ occ_π log1+ε n)

Compressed Property Suffix Trees (22, 30) (37, 38) (3, 7) (10, 13) (18, 27) Text (T): A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G end(i) : max {f_k, 0} such that s_k =< i, among all the properties (s_k, f_k) in π length(i) : end(i) – i +1 i end(i) length(i)

Compressed Property Suffix Trees (22, 30) (37, 38) (3, 7) (10, 13) (18, 27) Text (T): A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G Suffix Range of P = C A T (|P| = 3) Suffix Array i SA[i] length(SA[i]) SA[i] is an output if length(SA[i]) >= |P|

3-Sided Range Searching UsingRange Maximum Queries (RMQ) length (SA[i]) RMQ[2, 4, 6, 0] = 6 > |P|=3 RMQ[2, 4] = 4 >|P|=3 RMQ[-2]=-2 <|P|=3 All outputs can be retrieved in O(1+occ_π) RMQ queries RMQ structure over length(SA[i]) takes 2n+o(n) bits But Storing length[SA[i]] takes O(n log n) bits

Compressed Property Suffix Trees i end(i) length(i) • Observation: • end(i) is a non-decreasing function, • end(i) can be encoded using a bit vector B of length 2n • (2n+o(n) bits along with rank/select structures) • B[j]=1, for j = end(i)+i, else 0 • Then, end(i) = selectB(i)-i • length(i)= selectB(i)-2i+1

Compressed Property Suffix Trees • Hence we do not store length(SA[i]) explicitly, instead we store only an RMQ structure (of 2n+o(n) bits) over it • And length(SA[i]) for any given i can be computed by first finding SA[i] in O(log1+ε n) time and obtain length(SA[i]) from end(i) array in constant time • Hence our Index consists of the following components • CSA (nH_k+o(n log σ ) bits) • end(i) array (2n+o(n) bits) • RMQ over length(SA[i]) (2n+o(n) bits) • Total Space: nH_k+O(n)+o(n log σ ) bits • - Query Time : O(|P|+ occ_π log1+ε n)

Top-k Most Frequent Document Retrieval • Instead of listing all documents (strings) in which pattern occurs, list only highly ``frequent” documents. • Top-k: Retrieve only the k most frequent documents. • Approaches: Inverted Indexes • Popular in IR community. • Do not efficiently answer arbitrary pattern queries. • Not efficient for string search Suffix Tree based Solutions

Suffix Tree-Based Solutions d1: banana d2: urban • For the pattern “an”, we look at its subtree—d1 appears twice and d2 appears once in this subtree Suffixes: a$ an$ ana$ anana$ ban$ banana$ n$ na$ nana$ rban$ urban$ a urban$ $ n ban d2 d1 n a rban$ $ $ d2 $ d2 d2 a ana$ $ na$ $ d2 na$ d1 d1 d1 d1 d1

* Framework • First assume k (or K) is fixed, let group size g = k log k log1+Є n. • Take consecutive g leaves (in Suffix tree) from left to right and make them into groups. Mark the Least Common Ancestor of each group, and also Mark each node that is an LCA of two Marked nodes. • Store explicit list of top-k highest scoring documents in Subtree(v) at each marked node v.

Example LCA of two marked nodes Is also marked At each marked node, the top-k list is stored j k a c d g h i l m n o p b e f Example: Group size = 4 We build a CSA on the n/g bottom-level marked nodes.

Framework • Because of the sampling structure space consumption (for a fixed k) is = O(n/g) k log n = O(n/(k log k log1+Є n) k log n) = O(n / (log k logЄ n)) bits • Repeat this for k = 1, 2, 4, 8, 16, …. • Total space used is = O( n/logЄ n * Σ 1/log k) = O(n/logЄ n *(1+1/2+1/4+1/8+…) = o(n) bits

Query Answering v Explicit top-k list stored at u u Fringe leaves (atmost 2g) • Key Idea: Any document corresponding to top-k in the subtree of v is • Either in the top-k list of marked node u • Or the document corresponding to one of the 2g fringe leaf

Our Approach • Choose a smaller grouping factors h = k log k logЄ n • # of fringe leaves are less, hence only the fly frequency computation time is O(h log1+Є n) = O(k log k log2+Є n) • CHALLENGE: We cannot afford to store top-k answers explicitly (in log n bits) at the marked nodes(because (n/h)k log n can be very large) • SOLUTION: Encode a top-k documents corresponding to a marked node in O(log log n) bits v/s O(log n) bits. (bounds total space for pre-computed answers by o(n) bits)

Encoding an answer in O(loglog n) bits v u Fringe leaves (atmost 2g) Node u is marked based on grouping factor g = k log k log1+Є n and answers and maintained explicitly Then top-k for node v are - either from top-k of u - or from 2g fringes Instead of maintaining the document id explicitly, refer to k elements among this 2g+k elements Encoding Idea:

Encoding an answer in O(loglog n) bits • Hence the task is to reduce k numbers from a universe of size O(g+k) • N numbers from a universes of size U can be encoded in nlog(U/n)+O(n) bits and can decode any number in O(1) time using indexible dictionaries [RRR, SODA02] • For U = g+k and N =k, space is O(k loglog n) bits

Summary: Succinct and Semi-Succinct Results

Parameterized Pattern Matching • Alphabet Σ consists of two disjoint sets: • Static characters Σs • Parameterized characters Σp • Parameterized string (p-string) is a string in (ΣsU Σp)* • Two p-strings S = s1s2…sm and S’ = s’1s’2…s’m match iff • si=s’ifor any siin Σs • There exists a bijection ƒS that renames sitos’ifor any siin Σp • Example: Σs= {A,B} and Σp= {w,x,y,z} • AxBy and AwBz p-match • AxBy and AwBw do not p-match • Going forward, without loss of generality, we’ll focus on texts/suffixes with all parameterized characters.

Canonical Encodings • Convert a p-string S into a string which draws alphabet as numbers from 0 onwards • Encoding 1 : Every time a new character is encountered, it receives a new numeric symbol • aabcacb 0012021 • Encoding 2 (Baker’s) : encode a character as distance from its previous occurrence : prev(S) • prev(S)[i] = 0 if i is the first occurrence of the p-character S[i] • prev(S)[i] = (i – j) if j<i is the rightmost occurrence of the p-character S[i] • wwxyywz 0100140 • Encoding 1  n log ∑, but Encoding 2  n log n bits

Parameterized Suffix Tree • Two p-strings S and S’ p-match iffprev(S) = prev(S’) • p-Suffix Tree: • Encode every suffix according to prev(.) • Construct a suffix tree for every encoded suffix • Construction time: • Baker: O(n|Σp| + n log |Σ|) • Kosaraju: O(n log |Σ|) • Searching in p-Suffix Tree • Encode P using prev(.) • Search in p-Suffix Tree for prev(P) • Time: O(|P| log |Σ| + occ)

Parameterized BWT • Consider suffix Ti = T[i …n] • Obtain the previous character T[i-1] of suffix Ti • Define zero of suffix Ti , z(Ti ) = first position in Ti where character T[i-1] appears • Let zero depth of Ti , zd(Ti ) = number of distinct characters in Ti until first occurrence of T[i-1] • T=abcabbadcb ; T4 = abbadcb ; T[4-1] = c; • First c in T4 at T4[6] ; zero(Ti) = 6; zd(Ti)=4 • Encode all the suffix by prev(.) and sort the encoded suffixes (maintaining their corresponding zd(.) values) • Vector of zd(.) values thus obtained is called pBWT

pBWT pBWT= 5 5 5 5 1 3 3 5 4 2 SA[i] =10 9 8 7 6 3 2 1 4 5 LF(9) = 6

pST 0 T = abcabbadcb LF(6) = 7 T3 = cabbadcb 00013064 LF(5)=10 T6 = badcb00004 1… $ 0 T5 1… T10 $ 0 T4 T9 $ 3 0 T8 1… $ 1… 3… T7 T1 4… T2 T3 Good thing: Only constant number of changes as suffix transforms during LF T6

String Searching Indexes Suffix Trees, Suffix Arrays, CSA, FM-Index, etc. T: mississippi$ mississippi$ ississippi$ ssissippi$ sissippi$ issippi$ ssippi$ sippi$ ippi$ ppi$ pi$ i$ $ P = ssi O(p) time suffixes Locus of P 12 1 Finding Occurrences in O(occ) time 11 8 9 10 2 3 4 6 5 7 LF(9) = 11 LF(10) = 12 O(n) words space and optimal O(p+occ) query time

Data Structures • WT over pBWT • Operations: • pBWT[i] • rangeCount(i,j,x,y) = number of k in [i,j] satisfying x ≤ pBWT[k] ≤ y • Space: n log |Σ| bits • Time: O(log |Σ|) • Succinct representation of p-Suffix Tree • Operations on a node: • leftMostLeaf, rightMostleaf, qth child • parent, lca • Space: 4n + o(n) bits • Time: O(1) • Additional O(n) + o(n log |Σ|) bits structure • Total Space: n log |Σ| + O(n) + o(n log |Σ|) bits

Compute LF(i) • z = node just below zero of li • Computed in O(log |Σ|) time using • WT and an additional O(n log log |Σ|)-bit structure • v = parent(z)

Compute LF(i) – Computing N1 • LF(i) = N1+N2+N3, where Ni is the number of suffixes j in Si such that LF(j) ≤ LF(i) • Write in unary on every edge, the number of zeros falling on them. Arrange the edge is post-order and form a bit vector. • N1= zero’s from the left falling on path(v) = #zero’s coming from right - #zeros counted until v in postorder

Compute LF(i) – N2 • N2 = number of j’s, such that • L[j] is p-character, and • fj > fi or fj = fi and j ≤ i • = rangeCount(Lz,Rz,c+1, |Σp|) + rangeCount(Lz,i,c,c), where c = pBWT[i] • Computed using WT in O(log |Σ|) time

Summarizing LF(i) • Computed in O(log |Σ|) • Space is n log |Σ| + O(n) + o(n log |Σ|) bits • pSA[.] and pSA-1[.] can be computed in • Time: O(log1+ε n) time • Space: additional O(n) bits • Sampled suffix array and inverse suffix array

Backward Search • Suffix range of P as follows • Given suffix range [sp,ep] of Q = proper suffix of P • c = preceding character of Q in P • Compute suffix range [sp’,ep’] of cQ • Preprocess P in O(|P|log |Σ|) time such that for any p-character P[i], we can find • number of distinct p-characters in P[i+1,|P|] • number of distinct p-characters in P[i+1,ci], where ci is the first occurrence of c in P[i+1,|P|] • c is static • sp’ = 1+rangeCount(1, n, 1, c-1) rangeCount(1, sp-1, c, c) • ep’ = rangeCount(1, n, 1, c-1) + rangeCount(1, ep, c, c) • Time: O(log |Σ|)

Backward Search • c does not appear in Q • d = number of distinct p-characters in Q • (ep’-sp’+1) = rangeCount(sp, ep, d+1, |Σp|) • Computed in O(log |Σ|) time • sp’ = 1 + fSum(1+fSum(lca(leafsp,leafep))) • Computed in O(1) time • c is appears in Q • d = number of distinct p-characters in Q until the first occurrence of c • (ep’-sp’+1) = rangeCount(sp,ep,d,d) • Computed in O(log |Σ|) time • sp’ = LF(imin), where imin= min{i | sp ≤ i ≤ ep such that pBWT[i] = d} • Computed in O(log |Σ|) time

Summarizing • Suffix Range of P is found in O(|P| log |Σ|) time • Each text-position located in O(log1+ε n) time • Final Result • Space: n log |Σ| + O(n) + o(n log |Σ|) bits • Time: O(|P| log |Σ| + occ log1+ε n)

Order Preserving Pattern Matching

Order-Preserving Pattern Matching • T = 5 4 1 3 8 6 2 9 4 5 1 2 • P = 3 2 1 • Matches at T[1] and T[5] • P = 3 1 2 • Matches at T[2] and T[10] but not T[6] • Predecessor encoding : • Point to number on left, just smaller than itself • T = 1 8 6 9 5 2 4 3 7 2 • Pred = n 1 1 8 1 1 2 2 6 2 • Prev = n 1 2 2 4 5 1 2 6 4 -- what happens if we prepend this with “3” a previous character?

Pictorially … 9 8 7 6 5 4 3 2 2 1

Hard Instances of Compressed Text Indexing

Hard Instances of Compressed Text Indexing

Presentation Transcript

Full-Text Indexing

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Hard Instances of the Constrained Discrete Logarithm Problem

Efficient LZ78 factorization of grammar compressed text

Lecture 1 : Full-text Indexing

Tools for Text Indexing and SearchING

Text Indexing

Semi-Automatic Indexing of Full Text Biomedical Articles

Off-line text search (indexing)

Vakhitov Alexander Approximate Text Indexing.

Basic Text Processing and Indexing

Encoded Bitmap Indexing and Compressed Bitmaps

Explicit hard instances of the shortest vector problem

Indexing Text Data under Space Constraints

Generating Hard instances of Lattice Problems

Computer Evaluation of Indexing and Text Processing

Multimedia and Text Indexing

Performing Indexing and Full-Text Searching

Inverted Indexing for Text Retrieval

Full-Text Indexing

Vakhitov Alexander Approximate Text Indexing.

Multimedia and Text Indexing