Speeding up pattern matching by text compression

Speeding up pattern matching by text compression Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa Department of Informatics,Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan

Contents • Pattern matching on compressed text. • A unifying framework for compressed pattern matching (Collage System) • Byte pair encoding (BPE). • Pattern matching algorithm on BPE compressed text. • Experimental result. • Conclusion.

Knuth-Morris-Pratt (1974) Boyer-Moore (1977) Aho-Corasick (1975) Shift-Or (1992) Pattern Matching Problem Pattern matching Text Pattern matching is one of the most fundamental operations in string processing. Recently, a new trend for accelerating pattern matching has emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up the pattern matching since an extra work is needed to keep track of compression mechanism.

File transfer Expand File transfer on Memory original text Pattern Matching on Compressed Text Search It requires extra time and space. on Memory on Secondary disk storage compressed text Search on Secondary disk storage on Memory

GOAL 1 To perform a faster search in compressed texts in comparison with a regular decompression followed by an ordinary search. GOAL 2 File transfer To perform a faster search in compressed texts in comparison with an ordinary search in the original texts. compressed text Pattern Matching on Compressed Text Search directly on Secondary disk storage on Memory Speeding up pattern matching by text compression

year researcher compression 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding Previous Results(1)

Previous Results(2) year researcher compression 1998 1999 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Kida, Takeda, Shinohara, and Arikawa Navarro and Raffinot Word based encoding LZ family LZW 2000 Shibata, et al. Byte pair encoding 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionary based Unifying framework 1999 Kida, et al. Dictionary based methods (Collage system) Today’s talk

Compression A PM Algorithm A Compression B PM Algorithm B Collage system Compression C PM Algorithm C Compression A Kida et al.[1999]: Compression B Compression C Pattern matching algorithm on the unifying framework A Unifying Framework for Compressed Pattern Matching Previous:

Collage System Definition and Several Examples

Dictionary Based Compression Original text encoding compressed text Dictionary structure factorize into a series of phrases How to choose the phrases. How to design the data structure of the dictionary. How to encode phrases.

Collage System Collage system is a pair 〈D, S 〉 D :A sequence of assignments (Dictionary structure) X1 := expr1; X2 := expr2; ・・・ Xn := exprn; S:A sequence of variables defined in D (Compressed text) S = Xi1, Xi2,・・・, Xil ( Xi ∈D ) ||D|| = n : number of assignments inD |S| = l : number of variables in S

X1= expr1; X2= expr2; ・・・ Xn= exprn; a a ∈Σ∪{ε}, (primitive assignment) Xi ・ Xｊ for i, j < k, (concatenation) ( Xi ) j fori < kand integer j ( jtimes repetition) [ j ]Xi fori < kand integer j (prefix truncation) Xi [ j ] fori < kand integer j (suffix truncation) Collage System D :A sequence of assignments (Dictionary structure) where exprkare ...

T(X7) X7 prefix truncation X6 X4 ab 3 times repetition X5 X2 X1 ba ababab X3 bab X1 X2 babba a [3] (( )3 ) b b a height(X7) = 4 abbabbababba height(D) = 4 Example of Collage System D : X1= a ; X2= b ; X3= X1・X2 ; X4= X2・X1 ; X5= ( X3 )3 ; X6= [3]X5 ; X7= X6・X4 ; S: X3 , X6 ,X4 ,X7

??? Pattern Matching Algorithm on a Collage System

Compressed pattern matching on a collage system Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|)・height(D) + m2 + r ) time using O( ||D|| + m2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| : number of assignments inD |S| : number of variables in S m: pattern length r : number of pattern occurrences

Basic Idea a b a b b 1 2 3 4 5 : goto function : failure function 1 4 0 3 3 4 1 2 state: 5 abababba Xi1 Xi2 Xi3 Xi4 S： Pattern π= a b a b b 0 abababba original text:

Jump and Output The function Jump( j, u) =δKMP( j, u) • It simulates the sequence of state transitions for u. • The domain is Q×D Reply in O(1) time The set Output( j, u) ={1≦i≦|u| |P = a suffix of P[1: j]・u[1: i]} • This set contains the pattern occurrences. Reply in O( l ) time

Realization of Jump and Output for Jump( q, Xk) ,ifXkis ... a O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. Xi ・ Xｊ Size of the set Output forOutput( q, Xk), if Xk is ... a O(1) time It can be enumerate in O( l ) time from Output of Xi and Xｊ. Xi ・ Xｊ

example: P = COPACABANA OPA , CABAN OPACABAN concatenate ‘Yes’! P[2:9] Factor Concatenation Problem Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

Solution to the problem • Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space. • Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing. It can be solved in O(1) time after O(m2) space and time preprocessing.

Outline of Our Algorithm Input. pattern P and collage system 〈D, S 〉 (S := Xi1, Xi2,・・・, Xin ) Output. All occurrences of the patterns. /* preprocessing of D and P */ preprocess(D); preprocess(P); l:=0; q:=0; forj:=1 tondo begin for eachdOutput(q, Xij)do report ‘pattern occurs at position l+d ’; q:= Jump(q, Xij); /* state transition */ l:= l + |Xij |; /* calculation of the offset */ end

Compressed pattern matching on a collage system not suitable for speeding up pattern matching truncation LZ77, LZSS, etc... O( (||D|| + |S| )・height(D) + m2 + r ) time no truncation LZ78, LZW, BPE, Run-length, etc... O( ||D|| + |S| + m2 + r ) time

Byte Pair Encoding original encoding algorithm and modified algorithm

ABAB AB AB Code Pair A A AB→G B B DE DE DE C C Used Character DE→H D D GC GC E E F F GC→I G AB H DE I GC Pair Table Byte Pair Encoding Text:T = ABABCDEBDEFABDEABC GGCDEBDEFGDEGC GGCHBHFGHGC GIHBHFGHI

D : Byte Pair Encoding “collage system” X1= A; Text:T = ABABCDEBDEFABDEABC X2= B; AB→G X3= C ; GGCDEBDEFGDEGC DE→H X4= D ; GGCHBHFGHGC X5= E ; GC→I X6= F ; GIHBHFGHI X7= X1・X2 ; X8= X4・X5 ; S : X7, X9 , X8 , X2 ,X8, X6 ,X7 ,X8, X9 X9= X7・X3 ;

using doubly-linked list Speeding up of compression Time complexity of BPE O(uN) u : The number of character codes， N : Text length O(u + N) time

D: X1 = A X2 = C X3 = X2・X1 X255 = X247・X8 Pattern Matching Machine for multiple replacement [Arikawa et al. 1984] X256 = X125・X48 BPE compressed text: original text: Speed-up of compression we apply the BPE algorithm to the first block.

Comparison of Compression Ratio and time BPE are worse than those of “Compress” and “Gzip” compression Ratio(%) BPE Compress Gzip original modified Brown corpus ( 6.8Mb) 51.0 59.0 43.7 39.0 Medline (60.3Mb) 56.2 59.0 42.3 33.3 Genbank (17.1Mb) 30.8 32.5 26.8 23.1 It is drastically accelerated by our modification compression time(sec) Brown corpus 196.9 8.0 12.7 37.7 Medline 1699.9 60.7 73.3 242.2 Genbank 440.6 16.5 19.3 100.9

Compressed pattern matching on BPE compressed text Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| ≦256 -The dictionary D is encoded separately from the sequence S. -The size of D is small enough. -The variables of S are encoded using a fixed length code.

KMP KMP our algorithm Agrep our algorithm Agrep a clinically-oriented subset of Medlin a data set from GenBank Experimental result Medline data (compression ratio is 59%) Genbank data (compression ratio is32%) Ultra ...

Concluding Remarks Conclusion and Future Works

Conclusion • We introduced compressed pattern matching from practical viewpoints. • We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case. • We also observed that it is occasionally faster than Agrep．

Future Works More recent work • Can we reduce the complexity of the preprocessing? O(m2)  O(m) • To develop a sublinear algorithm on BPE compressed texts. • To develop an approximate pattern matching algorithm on a collage system. • To develop a new compression which is suitable for compressed pattern matching.

Does text compression speed up such a sublinear time algorithm? More recent work A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] We proposed a Boyer-Moore (BM) type algorithm for pattern matching in BPE compressed texts.

More recent work Medline data (compression ratio is 59%) Genbank data (compression ratio is32%) KMP KMP our algorithm Agrep our algorithm Agrep most recent work most recent work

Speeding up pattern matching by text compression

Speeding up pattern matching by text compression

Presentation Transcript

Pattern Matching

Pattern Matching

Pattern Matching

Speeding Up

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Dynamic Text and Static Pattern Matching

Pattern matching

Speeding up on two string matching algorithms

Multiple Pattern Matching in LZW Compressed Text

Text Compression

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching

Dynamic Text and Static Pattern Matching