1 / 35

Speeding up pattern matching by text compression

Speeding up pattern matching by text compression. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa. Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan.

adolfo
Télécharger la présentation

Speeding up pattern matching by text compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeding up pattern matching by text compression Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa Department of Informatics,Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan

  2. Contents • Pattern matching on compressed text. • A unifying framework for compressed pattern matching (Collage System) • Byte pair encoding (BPE). • Pattern matching algorithm on BPE compressed text. • Experimental result. • Conclusion.

  3. Knuth-Morris-Pratt (1974) Boyer-Moore (1977) Aho-Corasick (1975) Shift-Or (1992) Pattern Matching Problem Pattern matching Text Pattern matching is one of the most fundamental operations in string processing. Recently, a new trend for accelerating pattern matching has emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up the pattern matching since an extra work is needed to keep track of compression mechanism.

  4. File transfer Expand File transfer on Memory original text Pattern Matching on Compressed Text Search It requires extra time and space. on Memory on Secondary disk storage compressed text Search on Secondary disk storage on Memory

  5. GOAL 1 To perform a faster search in compressed texts in comparison with a regular decompression followed by an ordinary search. GOAL 2 File transfer To perform a faster search in compressed texts in comparison with an ordinary search in the original texts. compressed text Pattern Matching on Compressed Text Search directly on Secondary disk storage on Memory Speeding up pattern matching by text compression

  6. year researcher compression 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding Previous Results(1)

  7. Previous Results(2) year researcher compression 1998 1999 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Kida, Takeda, Shinohara, and Arikawa Navarro and Raffinot Word based encoding LZ family LZW 2000 Shibata, et al. Byte pair encoding 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionary based Unifying framework 1999 Kida, et al. Dictionary based methods (Collage system) Today’s talk

  8. Compression A PM Algorithm A Compression B PM Algorithm B Collage system Compression C PM Algorithm C Compression A Kida et al.[1999]: Compression B Compression C Pattern matching algorithm on the unifying framework A Unifying Framework for Compressed Pattern Matching Previous:

  9. Collage System Definition and Several Examples

  10. Dictionary Based Compression Original text encoding compressed text Dictionary structure factorize into a series of phrases How to choose the phrases. How to design the data structure of the dictionary. How to encode phrases.

  11. Collage System Collage system is a pair 〈D, S 〉 D :A sequence of assignments (Dictionary structure) X1 := expr1; X2 := expr2; ・・・ Xn := exprn; S:A sequence of variables defined in D (Compressed text) S = Xi1, Xi2,・・・, Xil ( Xi ∈D ) ||D|| = n : number of assignments inD |S| = l : number of variables in S

  12. X1= expr1; X2= expr2; ・・・ Xn= exprn; a a ∈Σ∪{ε}, (primitive assignment) Xi ・ Xj for i, j < k, (concatenation) ( Xi ) j fori < kand integer j ( jtimes repetition) [ j ]Xi fori < kand integer j (prefix truncation) Xi [ j ] fori < kand integer j (suffix truncation) Collage System D :A sequence of assignments (Dictionary structure) where exprkare ...

  13. T(X7) X7 prefix truncation X6 X4 ab 3 times repetition X5 X2 X1 ba ababab X3 bab X1 X2 babba a [3] (( )3 ) b b a height(X7) = 4 abbabbababba height(D) = 4 Example of Collage System D : X1= a ; X2= b ; X3= X1・X2 ; X4= X2・X1 ; X5= ( X3 )3 ; X6= [3]X5 ; X7= X6・X4 ; S: X3 , X6 ,X4 ,X7

  14. ??? Pattern Matching Algorithm on a Collage System

  15. Compressed pattern matching on a collage system Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|)・height(D) + m2 + r ) time using O( ||D|| + m2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| : number of assignments inD |S| : number of variables in S m: pattern length r : number of pattern occurrences

  16. Basic Idea a b a b b 1 2 3 4 5 : goto function : failure function 1 4 0 3 3 4 1 2 state: 5 abababba Xi1 Xi2 Xi3 Xi4 S: Pattern π= a b a b b 0 abababba original text:

  17. Jump and Output The function Jump( j, u) =δKMP( j, u) • It simulates the sequence of state transitions for u. • The domain is Q×D Reply in O(1) time The set Output( j, u) ={1≦i≦|u| |P = a suffix of P[1: j]・u[1: i]} • This set contains the pattern occurrences. Reply in O( l ) time

  18. Realization of Jump and Output for Jump( q, Xk) ,ifXkis ... a O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. Xi ・ Xj Size of the set Output forOutput( q, Xk), if Xk is ... a O(1) time It can be enumerate in O( l ) time from Output of Xi and Xj. Xi ・ Xj

  19. example: P = COPACABANA OPA , CABAN OPACABAN concatenate ‘Yes’! P[2:9] Factor Concatenation Problem Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

  20. Solution to the problem • Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space. • Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing. It can be solved in O(1) time after O(m2) space and time preprocessing.

  21. Outline of Our Algorithm Input. pattern P and collage system 〈D, S 〉 (S := Xi1, Xi2,・・・, Xin ) Output. All occurrences of the patterns. /* preprocessing of D and P */ preprocess(D); preprocess(P); l:=0; q:=0; forj:=1 tondo begin for eachdOutput(q, Xij)do report ‘pattern occurs at position l+d ’; q:= Jump(q, Xij); /* state transition */ l:= l + |Xij |; /* calculation of the offset */ end

  22. Compressed pattern matching on a collage system not suitable for speeding up pattern matching truncation LZ77, LZSS, etc... O( (||D|| + |S| )・height(D) + m2 + r ) time no truncation LZ78, LZW, BPE, Run-length, etc... O( ||D|| + |S| + m2 + r ) time

  23. Byte Pair Encoding original encoding algorithm and modified algorithm

  24. ABAB AB AB Code Pair A A AB→G B B DE DE DE C C Used Character DE→H D D GC GC E E F F GC→I G AB H DE I GC Pair Table Byte Pair Encoding Text:T = ABABCDEBDEFABDEABC GGCDEBDEFGDEGC GGCHBHFGHGC GIHBHFGHI

  25. D : Byte Pair Encoding “collage system” X1= A; Text:T = ABABCDEBDEFABDEABC X2= B; AB→G X3= C ; GGCDEBDEFGDEGC DE→H X4= D ; GGCHBHFGHGC X5= E ; GC→I X6= F ; GIHBHFGHI X7= X1・X2 ; X8= X4・X5 ; S : X7, X9 , X8 , X2 ,X8, X6 ,X7 ,X8, X9 X9= X7・X3 ;

  26. using doubly-linked list Speeding up of compression Time complexity of BPE O(uN) u : The number of character codes, N : Text length O(u + N) time

  27. D: X1 = A X2 = C X3 = X2・X1 X255 = X247・X8 Pattern Matching Machine for multiple replacement [Arikawa et al. 1984] X256 = X125・X48 BPE compressed text: original text: Speed-up of compression we apply the BPE algorithm to the first block.

  28. Comparison of Compression Ratio and time BPE are worse than those of “Compress” and “Gzip” compression Ratio(%) BPE Compress Gzip original modified Brown corpus ( 6.8Mb) 51.0 59.0 43.7 39.0 Medline (60.3Mb) 56.2 59.0 42.3 33.3 Genbank (17.1Mb) 30.8 32.5 26.8 23.1 It is drastically accelerated by our modification compression time(sec) Brown corpus 196.9 8.0 12.7 37.7 Medline 1699.9 60.7 73.3 242.2 Genbank 440.6 16.5 19.3 100.9

  29. Compressed pattern matching on BPE compressed text Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m2 + r ) time. ||D|| ≦256 -The dictionary D is encoded separately from the sequence S. -The size of D is small enough. -The variables of S are encoded using a fixed length code.

  30. KMP KMP our algorithm Agrep our algorithm Agrep a clinically-oriented subset of Medlin a data set from GenBank Experimental result Medline data (compression ratio is 59%) Genbank data (compression ratio is32%) Ultra ...

  31. Concluding Remarks Conclusion and Future Works

  32. Conclusion • We introduced compressed pattern matching from practical viewpoints. • We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case. • We also observed that it is occasionally faster than Agrep.

  33. Future Works More recent work • Can we reduce the complexity of the preprocessing? O(m2)  O(m) • To develop a sublinear algorithm on BPE compressed texts. • To develop an approximate pattern matching algorithm on a collage system. • To develop a new compression which is suitable for compressed pattern matching.

  34. Does text compression speed up such a sublinear time algorithm? More recent work A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] We proposed a Boyer-Moore (BM) type algorithm for pattern matching in BPE compressed texts.

  35. More recent work Medline data (compression ratio is 59%) Genbank data (compression ratio is32%) KMP KMP our algorithm Agrep our algorithm Agrep most recent work most recent work

More Related