1 / 23

Faster Approximate String Matching over Compressed Text

Faster Approximate String Matching over Compressed Text. By Gonzalo Navarro * , Takuya Kida † , Masayuki Takeda † , Ayumi Shinohara † , and Setsuo Arikawa †. * Dept. of Computer Science, University of Chile † Dept. of Informatics, Kyushu University. Contents. Introduction Motivation

garret
Télécharger la présentation

Faster Approximate String Matching over Compressed Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faster Approximate String Matching over Compressed Text ByGonzalo Navarro*, Takuya Kida†, Masayuki Takeda†, Ayumi Shinohara†, and Setsuo Arikawa† * Dept. of Computer Science, University of Chile † Dept. of Informatics, Kyushu University

  2. Contents • Introduction • Motivation • Related works and our goal • Our search approach on LZ78/LZW • Basic idea – Filtration technique • Multiple pattern matching algorithms on compressed text • Experimental results • Conclusion

  3. Motivation • Compressed pattern matching • Let sleeping files lie. • Reduce space, reduce searching time. Search Decompress File transfer on Secondary disk storage on Memory on Memory

  4. Motivation • Compressed pattern matching • Let sleeping files lie • Reduce space, reduce searching time Search directly File transfer on Secondary disk storage on Memory

  5. year researcher compression 1988 Eliam-Tzoreff and Vishkin run-length 1992 1998 Amir, Landau, and Vishkin Moura, et al. two-dimensional run-length Word based encoding 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gąsieniec, et al. LZ77 1996 Amir, Benson and Farach LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Miyazaki, et al. Huffman encoding 1998 Kida, et al. LZ78/LZW 1998 Shibata, et al. byte pair encoding Related Works (1)

  6. year researcher compression 2001 2000 2000 2000 1999 Navarro and Raffinot Klein and Shapira Klein and Shapira Navarro and Tarhio Kärkkäinen, Navarro and Ukkonen LZ family, Hybrid LZ Huffman encoding LZSS variant LZ family LZ family 1999 Shibata, et al. Antidictionary based 2000 Matsumoto, et al. Simple collage systems 1999 1999 Gąsieniec and Rytter Kida, et al. LZ78/LZW LZW 2000 Shibata, et al. collage systems 1999 Kida, et al. Dictionary based methods (Collage system) Related Works (2)

  7. Approximate String Matching • Edit distance ed(P, P’) • Insertions, deletions and replacements • Report all occurrences of any string P’ s.t. ed(P, P’)k for a given pattern P. • Survey paperG. Navarro. A guided tour to approximate string matching. ACM Computing Surverys, 2000. Example. k = 2 Pattern: TAAATCACGGCATACT Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

  8. Previous Results • J. Kärkkäinen, G. Navarro, and E. Ukkonen.Approximate string matching over Ziv-Lempel compressed text. In Proc. CPM2000. • Dynamic programming technique • O(mkn+R) worst case, O(k2n+R) average case • T. Matsumoto, T. Kida, M. Takeda, A. Shinohara, and S. Arikawa.Bit-parallel approach to approximate string matching in compressed texts. In Proc. SPIRE2000. • Bit-parallel technique • O(mk3n/w) worst case

  9. Our Search Approach on LZ78/LZW • Introduction • Motivation • Related works and our goal • Our search approach on LZ78/LZW • Basic idea • Multiple pattern matching algorithms on compressed text • Experimental results • Conclusion

  10. Basic Idea • Filtration technique (Wu and Manber, 1992) • Split the pattern in k+1 equal-length pieces • Find pattern pieces – Multiple pattern matching • Direct verification of candidate text area(We have chosen Myers’ algorithm) Example. k = 2 Pattern: TAAATCACGGCATACT Pattern pieces: TAAAT, CACGG, CATACT Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

  11. Why LZ78/LZW? • We have already developed a multiple pattern matching algorithm on LZW. • Easy to decompress locally.

  12. Multiple Pattern Matching Algorithms on Compressed Text • Aho-Corasick technique • Boyer-Moore technique • Bit parallel technique

  13. Aho-Corasick Technique • T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Multiple pattern matching in LZW compressed text. In Proc. DCC’98. • Simulate the AC machine • Running over LZW directly • O(m2+n+R) time, O(m2+n) space

  14. : goto function ・・・・・ b1 b2 b3 b4 b5 b6 b7 : failure function Pattern occurrences: Aho-Corasick Technique Patterns: TTAA, AA /{T,A} T T A A 0 1 4 2 3 TTAA, AA A 6 5 AA A Compressed text: ・・・・・ Original text: CTTAATTAAGCCCCCTGCTAAGCT State transition: 0 1 3 0 0 5 0 1

  15. Boyer-Moore Technique • G. Navarro and J. Tarhio,Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM2000. • Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa,A Boyer-Moore type algorithm for compressed pattern matching, In Proc. CPM2000. T. Kida et al.Multiple Pattern Matching Algorithms on Collage System In Proc. CPM2001, to appear.

  16. ・・・・・ b1 b2 b3 b4 b5 b6 b7 ・・・・・ CTTAATTAAGCCCCCTGCTAAGCT Pattern occurrences: Boyer-Moore Technique Compressed text: Original text: • Find all occurrences that end in the focused block. • Calculate the maximum safe shift . • Move focus according to .

  17. Bit Parallel Technique • G. Navarro and M. Raffinot,A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99. • T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99.

  18. (i) := 110000000000000000 (ii) := 000000100001000000 (iii) := 000000000000000011 Bit Parallel Technique Pattern: TTAA ・・・・・ ・・・・・ Compressed text: bi-1 bi bi+1 AAGTTAACTTAAGCCGTT Focused phrase: (ii) Occurrences inside block bi (i) Pattern suffixes (iii) Pattern prefixes Bit vectors:

  19. Experimental Results • Introduction • Motivation • Related works and our goal • Our search approach on LZ78/LZW • Basic idea • Multiple pattern matching algorithms on compressed text • Experimental results • Conclusion

  20. Experimental Results Intel Pentium III of 550 MHz and 64Mb of RAM running Linux 10Mb of Wall Street Journal articles and 10Mb of DNA data WSJ was compressed to 42.59% of its size and DNA to 27.71%

  21. Experimental Results

  22. Experimental Results

  23. Conclusion • We applied the filtration technique to compressed texts. • We implemented two new multiple pattern matching algorithms on compressed text. • Boyer-Moore type and Bit-parallel type. • We showed that this is a practical solution for approximate pattern matching on compressed text. • 10-30 times faster than previous solutions. • Up to 3 times faster than decompressing plus searching.

More Related