240 likes | 366 Vues
Nagano. Fukuoka. Multiple Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Masayuki TAKEDA. Ayumi SHINOHARA. Ayumi SHINOHARA. Masamichi MIYAZAKI. Masamichi MIYAZAKI. Setsuo ARIKAWA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan.
E N D
Nagano Fukuoka Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Masayuki TAKEDA Ayumi SHINOHARA Ayumi SHINOHARA Masamichi MIYAZAKI Masamichi MIYAZAKI Setsuo ARIKAWA Setsuo ARIKAWA Department of Informatics Kyushu University, Japan
Our Goal Pattern Matching Machine Original Text Compressed Text New Machine ! Compressed Text
Previous studies year researcher compression method run-length two-dimensional run-length LZ77 LZW straight-line programs Eilam-Tsoreff and Vishkin Amir, Landau, and Vishikin Amir and Benson Farach and Thorup Gasieniec, et al. Amir, Benson and Farach Karpinski, et al. Miyazaki, et al. 1988 1992 1992 1995 1996 1996 1997 1997
Previous result vs Our result • Amir, Benson, and Farach's algorithm (JCSS 1996)"Let sleeping files lie: Pattern matching in Z-compressed files" • deals with only single pattern. • can find only the first occurrence of the pattern. • takes O(n+m2) time and space. n : length of the compressed text,m: length of the pattern. • Our algorithm • deals with multiple patterns. • can find all occurrences of the patterns. • takes O(n+m2+r) time and O(n+m2) space.m: total length of the patterns, r: number of pattern occurrences.
0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 Lempel-Ziv-Welch compression Dictionary trie : DΣ= {a,b,c} O( |D| ) = O( n ) originaltext a b ab ab ba b c aba bc abab 1 2 4 4 5 2 3 6 9 11 compressed text
a b a b -1 0 {abab} 1 2 3 4 Basic Idea(Amir et al.) KMP automaton Pattern:abab Σ : goto function { } : output : failure function original text: a a b a b a a b b a b a b b a b a b a b b a a a a b b a a b a b b a b a b a b a b a b a b a b a a b b b a a b a b b found ! found !
a b a b -1 0 0 {abab} 1 1 2 2 3 3 4 4 ab, bab bc a b b c bca, a b ca, ba {abab} aba abab Basic Idea(Amir et al.) KMP automaton Pattern:abab Next (0, bab)=2
abc {abab} 0 a a b b ab abc 1 2 3 4 Basic Idea(Amir et al.) Next (2, abc)=0 Output (2, abc)= { 〈2, abab〉 } Who is watching the occurrences of the pattern?!
0 1 2 3 c a b b a b 4 5 9 {ababb,bb} {aba} 6 8 c 7 a {abca} b b {bb} for Multiple Patterns • Aho-Corasick Pattern Matching Machine Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output
Our Algorithm Input. Π : set of patterns, u1,u2, …,un: LZW compressed text . Output. All occurrences of the patterns. Construct from Π the AC machine, and the generalized suffix trie. Initialize the dictionary trie, Next and Output ; l:=0; state:=q0; for i:=1 to n do begin for each〈d ,π〉∈ Output(state,ui) do report "pattern π occurs at position l+d"; state:=Next(state,ui); l:= l+ |ui|; Update the dictionary trie, Next and Output end. O( m2) O( n ) O( n+r )
N1(q, u)・u if u∈Factor(Π), = Next(0, u) otherwise. Next(q,u) State Transition Function Next (q, u) O( m×|D| ) !! Next: Q×D → Q Q: states of AC machine D: strings represented by dictionary trie m: total length of patterns O( m×m2) O( |D| )
state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb 0 1 2 3 4 5 6 7 8 9 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 2 2 4 2 4 2 2 2 2 2 State Transition Function Next (q, u) O( |D|+m3 ) • Table of N1 (q, u)・u --- O(m×m2) Π={aba,ababb,abca,bb}
a c b c a a b b a c a b b b b a b Π={aba,ababb,abca,bb} Generalized Suffix Trie O( m2 ) O( m ) : explicit node : nonexplicit node
state state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb a b cab ba bbbc ca aba abb abc babbca abab abca babb ababb 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 1 3 1 3 1 7 1 1 1 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 0 0 6 0 6 0 0 0 0 0 1 3 1 3 1 1 1 3 1 1 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 9 9 5 9 5 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 4 2 4 2 2 2 4 2 2 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 1 7 1 7 1 1 1 7 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 2 4 2 4 2 2 2 2 2 2 2 4 2 4 2 2 2 2 2 State Transition Function Next (q, u) O( |D|+m2 ) O( |D|+m3 ) • Table of N1 (q, u)・u --- O( m×m )
Ancestor(q, k): the ancestor of node q with distance k in the trie of AC machine. u : one of the explicit descendants of node u in the generalized suffix trie.
q u i i i π Output Function Output(q,u)={〈i,π〉| 1≦i≦|u|, π∈Π, and π is a suffix of string q・u[1..i] } O( m×|D| ) !!!
Let be the longest prefix of u such that is a suffix of some pattern. q u π2 ~ ~ ~ u u u π1 π1 π3 Output Function dependent on q independent of q O( |D| ) O(m2)
But... Is it really fast ? Uhmm....
Decompression ! Decompression ! Experiment ◆ Method 1: AC Machine Original Text Compressed Text ◆ Method 2: AC Machine Compressed Text bcbababc 9 ◆ Method 3: Without Decompression Our Algorithm Compressed Text
Experiment Original Text "The Brown corpus" 6.8 Mbytes compress (UNIX command) Compressed Text 3.4 Mbytes Language: C++ (gcc without optimization) Machine : Sun SPARCstation 20.
30 Method 1 25 CPU time (s) Method 2 20 15 Method 3 10 5 0 0 5 10 15 20 25 Occurrence rate ( % ) (number of pattern occurrences / original text length) Result of the Experiment Our Algorithm
takes O( n+m2 ) space can answer in O(n+m2+r) time Conclusion Previous Result Our Result deals with only single pattern deals with multiple patterns can find only the first occurrence of the pattern can find all occurrences of the patterns takes O( n+m2 ) time and space about twice faster than a decompression followed by using the AC machine no practical evaluation