Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Lecture on Information Knowledge Network"Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA Lecture on Information knowledge network

The 3rdSuffix type algorithm Boyer-Moore algorithm Galil algorithm Horspool algorithm Sunday algorithm

Knuth-Morris-Pratt algorithm(review) D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(1):323-350, 1977. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ababbababcbaababc Text T: ababc ababc Pattern P: The pattern occursat position 6 of T ababc ababc next[5] = 3 ababc next[3] = 0 The next position of comparison can be obtained by function next (the amount of shifting P is equal to q－next[q]). The comparison restarts from the next character when the value of next is equal to 0. The number of comparison at each text position is O(1) times. • KMP-String-Matching (T, P) • n ← length[T]. • m ← length[P]. • q ← 1. • next ← ComputeNext(P). • for i ← 1 to n do • while q>0 かつP[q]≠T[i] do q ← next[q]; • if q=m then report an occurrence at i-m; • q ← q+1. Even in the worst case, it takes only O(n+m) time (if next is preprocessed) Lecture on Information knowledge network

Mask table M ab a b a b b 1 0 1 0 0 0 1 0 1 1 1 2 3 4 5 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 & 0 0 0 0 0 0 Shift-And algorithm (review) R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Proceedings of the 12th International Conference on Research and Development in Information Retrieval, 168-175. ACM Press, 1989. ababb Pattern P: abababba TextT: Ri = (Ri-1<<1 | 1) & M(T[i]) This can be calculated in O(1) time ※Keeping only the right transferred bits by taking AND op. with the maskbits M. Lecture on Information knowledge network

General form of efficient matching algorithms • MatchingAlgorithm (P, T) • m ← length[P]. • n ← length[T]. • i ← 1. • while i ≦ n – m +1 do • decide if i is an occurrence; • if i is an occurrence then report the occurrence at i; • decide the amount of Δ to shift the pattern safely; • i ← i + Δ. A lot of efficient algorithms including KMP and BM are in this frame ※Masayuki Taketa “High-speed pattern matching algorithms for full text processing,” Informatics symposium, January 1991(written in Japanese). • Important things for speeding-up the algorithm： • How much can we save our work for the 5th line? • How much can we make the amount of Δ large at the 7th line? Lecture on Information knowledge network

Boyer-Moore algorithm R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762-772, 1977. Features: • Characters of the pattern are compared from the right to the left. • The values of two functions (delta1 and delta2) are compared, and thenthe pattern is shifted by the larger. • Although the time complexity of BM algorithm is O(mn) in the worst case, it becomes O(n/m) on average（sub linear!!） delta1(char) := the jump width of which we shift the pattern so that the rightmost position of char in P is aligned to the current text position (if the pattern doesn’t include char, then it is equal to the pattern length). delta1(c) = 5 (bad-character heuristic) Text T: a a b c d a a c b c a b c c a ・・・ Pattern P: a b c b a b a b a b c b a b a b Δ=delta1(char) – j + 1 = 5 – 0 = 5 Shift P to align the rightmost ‘c’ in P with the current position Lecture on Information knowledge network

delta2(j) delta2(j) := the jump width of which we shift to align the suffix of P of lengthj-1 with another factor of P （or the longest prefix of P such that it is also the suffix of the string）（If there isn’t such factor, it is equal to the length of P.） delta2(3) = 8 (good-suffix heuristic) Text T: a a b c d a a b b c a b c c a ・・・ ※There are two candidate, 1 and 5, for the value of delta2(3). However, we can see that the left side character of the 5th, namely the 4th, is ‘b’ , which doesn’t match with ‘a’. Therefore, the 1st position is the only candidate. Pattern P: a b c b a b a b a b c b a b a b Δ=delta2(3) – 3 + 1 = 8 – 2 = 6 Shift P to align ‘ab’ with the prefix of P delta2(5) = 10 Text T: a a b c a b a b b c a b c c a ・・・ Pattern P: a b c b a ba b a b c b a b a b Δ=delta2(5) – 5 + 1 = 10 – 4 = 6 Shift P to align ‘ab’ with the prefix of P Lecture on Information knowledge network

Problem of BM method • It is complicated to decide the value of the delta functions. • It takes inO(m2) time in a naïve way. • To reduce it to O(m) is somewhat trouble → in a similar way of KMP • It costs to compare the values of delta1 and delta2 for each iteration. • Generally, only delta1 is used. （However, we have to devise to shift the pattern correctly since it cannot be shifted by delta1 only.) • It takes O(mn) time in the worst case. • Consider when T = an and P = bam. • The efficiency of BM declines when the alphabet size is small. • For strings in ∑={0,1}, Δ’s would be very small. Binary strings Lecture on Information knowledge network

Galil algorithm Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm.Communications of the ACM, 22(9):505-508, 1979. • Since the information about the matched string is forgotten in the original BM method, it takes O(mn) time in the worst case. • The idea for improvement is to memory how long the prefix of P has been matched with the text. • Galil algorithm scans in O(n) time theoretically, but it slows down in practice since the algorithm becomes much complicated. delta2(5) = 10 Text T: a a b c a b a b c b a b a b a ・・・ a b c b a ba b Pattern P: Each character of the text is compared twice at most! a b c b a b a b Memory that we’ve already compared the forward positions. Only these are to be compared! Lecture on Information knowledge network

Horspool algorithm R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10(6):501-506, 1980. • If ∑ is large enough, delta1（bad-character heuristic） can mostly give the best shift amount.→ A small modification can enlarge the jump width. delta1(c) = 5 Text T: a a b c d c a d b c a b c c a b a c a・・・ Pattern P: a b c b a b a d a b c b a b a d delta1’(d) = 10 delta1’(b) = 3 Text T: a a b c d c a d b c a b c c a b a c a・・・ Pattern P: a b c b a b a d Always decide the jump width by the character of the text at the end position of the pattern. a b c b a b a d Lecture on Information knowledge network

Pseudo code • Horspool (P, T) • m ← length[P]. • n ← length[T]. • Preprocessing: • For each c∈∑ do delta1’[c] ← m. • For j←1 to m – 1 do delta1’[ P[j] ] ← m – j . • Searching: • i ← 0. • while i ≦ n – m do • j ← m; • while j > 0 and T[i+j] = P[j] do j ← j – 1; • if j = 0 then report an occurrence at i+1; • i ← i + delta1’[ T[i+m] ]. Lecture on Information knowledge network

Sunday algorithm D. M. Sunday. A very fast substring search algorithm. Communications of the ACM, 33(8):132-142, 1990. • It basically based on BM method. • Different point • It compares in an arbitrary position order of the pattern to match • For example, it compares characters in an infrequent order. • It uses the right side text character at the end position of the pattern to determine the value of delta1（it also calculates delta2, and then compares them to select the longer）. • The jump width tends to be longer than that of Horspool • However, the memory consumption is larger than Horspool • Moreover, it takes much more time to decide the jump width than Horspool. delta1’(d) = 9 delta1’(c) = 6 Text T: a a b c d c a b d c a b c c a b a c a・・・ Pattern P: a b c b a b a b Always decide the jump width by the character of the text at the right side of the end position of the pattern. a b c b a b a b Lecture on Information knowledge network

Factor type algorithm BDM algorithm BOM algorithm BNDM algorithm

Backward Dawg Matching (BDM)algorithm M. Crochemore, A. Czumanj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter.Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267, 1994. • It basically based on BM method. • Different point • It decide if the pattern occurs at the current position by detecting if the reading string matches with any factors of the pattern, not with a suffix of the pattern. • It uses Suffix Automaton (or suffix tree) to determine if the reading string is a factor of the pattern. • Features of Suffix automaton (SA): • It can tell whether string u is a factor of pattern Pin O(|u|) time. • It can also tell whether string u is a suffix of P. • For P=p0p2…pm , there exists an online construction algorithm that runs in O(m) time. We can see whether the reading string is a prefix of P or not, from the second feature of SA. Factor search σ u Text T: a a b c d ca b d c a b cc a b a c a・・・ a b c b a b a b As neither ‘cc’ is a factor of P nor ‘c’ is a prefix of P, the pattern can be shifted safely to the next position. Pattern P: a b c b a b a b a b c b a b a b Lecture on Information knowledge network

a o c e c n u o n n a 8 0 1 2 3 4 5 6 7 u n u a n 9 a e a c e n n c n c c u n u u n n a a u u o u o o u u o o o n o o n n o n n o n n n n a n n n n a n n n n a n n n a n n a a a a a a a a Suffix Automaton A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen and J. Seiferas. The smallest automation recognizing the subwords of a text. Theoretical Computer Science (40):31-55, 1985. An automaton that accepts the reverse PR of P = announce. Suffix automaton Suffix trie Suffix tree Lecture on Information knowledge network

On-line construction algorithm • SuffixAutomaton(P=p1p2…pm) • Create the one-node graph G=DAWG(e). • root ← sink ← the node of G. suf[root] ←θ. • for i ← 1 to m do • create a new node newsink; • make a solid edge (sink, newsink) labeled by a; • w ← suf[sink]; • while w≠θ andson(w,a)=θ do • make a non-solid a-edge (w, newsink); • w ← suf[w]; • v ← son(w,a); • If w=θthen suf[newsink] ← root • elseif (w,v) is a solid edge then suf[newsink] ← v • else • create a node newnode; • newnode has the same outgoing edges as v except that they are all non-solid; • change (w,v) into a solid edge (w, newnode); • suf[newsink] ← newnode; • suf[newnode] ← suf[v]; suf[v] ← newnode; • w ← suf[w]; • while w≠θand(w,v) is a non-solid a-edge do • redirect this edge to newnode; w ← suf[w]. • sink ← newsink. This is rather complicated!The online construction of SA is a hard task! Lecture on Information knowledge network

I ε ε ε ε ε ε ε ε ε n n u n o a e c 8 0 1 2 3 4 5 6 7 BNDM algorithm G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata.ACM Journal of Experimental Algorithmics (JEA), 5(4), 2000. • The idea is basically same as BDM algorithm. • Different point: • It uses a non-deterministic version of suffix automaton to determine the reading string is a factor of the pattern. • It simulates the move of the NFA by bit-parallel technique. An NFA that accepts the suffix of PRfor pattern P = announce Simulate this NFA Initial condition：R0 = 1m The same Mask table as that of Shift-And method State transition：R = (R << 1) & M[ T[i] ] Lecture on Information knowledge network

Pseudo code • BNDM (P, T) • m ← length[P]. • n ← length[T]. • Preprocessing: • for c ∈ ∑ do M[c] ← 0m. • for j ← 1 to m do M[ P[j] ] ← M[ P[j] ] | 0j–110m–j. • Searching: • s ← 0. • while s ≦ n – m do • j ← m, last ← m, R ← 1m; • while R ≠ 0mdo • R ← R & M[ T[s+j] ]; • j ← j – 1; • If R & 10m-1 ≠ 0mthen • If j > 0 then last ← j; • else report an occurrence at s+1; • R ← R << 1; • s ← s + last. Lecture on Information knowledge network

a n a n c 8 0 1 2 3 4 5 6 7 e c n u o n n a u o Backward Oracle Matching (BOM)algorithm C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition.In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS2089:51-72, 2001. • The idea is basically same as BDM algorithm. • Different point: • It uses Factor oracle instead of Suffix automaton • A necessary thing for BDM is that σuis not a factor, rather than that string u is a factor. • Feature of Factor oracle: • It may accept strings other than the factor of P. • For example, in the bottom figure, ‘cnn’ is not a factor of PR. • It can be constructed in O(m) time. Moreover, it is easy to implement with small memory space. • The number of states: m+1. The number of state transitions: 2m-1. A factor oracle of PR for P=announce Lecture on Information knowledge network

Construction algorithm of Factor oracle • Oracle-on-line (P=p1p2…pm) • Create Oracle(ε) with • One single initial state 0, S(0) ←θ. • for i∈1…m do • Oracle(P=p1p2…pj) • ← Oracle_add_letter (Oracle(P=p1p2…pj-1), pj). • Oracle_add_letter (Oracle(P=p1p2…pm),σ) • Create a new state m+1. • δ(m,σ) ← m+1. • k ← S(m) • while k≠θ and δ(k,σ)=θ do • δ(k,σ) ← m+1; • k ← S(k). • If k =θthen s ← 0; • else s ← δ(k,σ). • S(m+1) ← s. • return Oracle(P=p1p2…pmσ). Lecture on Information knowledge network

Matching time comparison Flexible Pattern Matching in Strings, Navarro&Raffinot, 2002: Fig.2.22, p39. 64 English 32 100 Horspoor 50 16 29 8 18 BOM 3 DNA 4 BNDM 4 7 8 Shift-Or 50 2 2 4 8 16 32 64 128 256 Lecture on Information knowledge network

Extensions of Suffix & Factor type algorithms to multiple patterns Set Horspool algorithm Wu-Manberalgorithm

Suffix & Factor type algorithms for multiple patterns • Commentz-WalteralgorithmB. Commentz-Walter. A string matching algorithm fast on the average. In Proceedings of the 6th International Colloquium on Automata, Languages and Programming, LNCS71:118-132, 1979. • A straight extension of BM algorithm • Set Horspool algorithm • A simplified algorithm of Commentz-Walter based on the idea of Horspool • Uratani-Takedaalgorithm • A BM type algorithm with AC machine. It is faster than CW. • Set Backward Oracle Matching (SBOM)algorithmC. Allauzen and M. Raffinot. Factor oracle of a set of words.Techinical report 99-11, Institut Gaspard-Monge, Universite de Marne-la-Vallee, 1999. • A extension of BOM by extending Factor oracle for multiple patterns. • Wu-Manber algorithmS. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994. • A practically fast algorithm based on hashing. Agrep employs this algorithm. Lecture on Information knowledge network

Set Horspool algorithm • First, it makes a trie for the set of the reversed patterns in Π. • Its matching approach is the same as Horspool. • It traverses the trie as doing suffix search. • If the reading string doesn’t match with any suffixes of the patterns, then it shifts by delta1’. Reversed trie for patterns ※Cf. In Uratani-Takeda algorithm, it uses AC machine for the trie, and decides a jump width by the failure functions α σ Text T: suffix search β Text T: This range doesn’t include β delta1’ Lecture on Information knowledge network

Reason why the performance decreases delta (≦ ℓmin) Text T: The maximum of jump width is limited to ℓmin Pattern P: ℓmin ℓmax When the number of patterns increases, bad-character heuristic cannot work well since the frequency of each character increases. Lecture on Information knowledge network

a n n o u n c e a n n u a l a n n u a l l y Wu-Manberalgorithm S. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994. • It examines whether some patterns occur or not by reading B characters from the current matching position of the text (i.e. T[i-B+1…i]). • SHIFT[ T[i-B+1…i] ] : IF T[i-B+1…i] is a suffix of some patterns, then 0. Otherwise, it returns the maximum length of possible shift. • HASH[ T[i-B+1…i] ] : When SHIFT returns 0, (i.e. T[i-B+1…i] is a suffix of some patterns), it returns the list of patterns that can occur at the position. Some patterns may occur! Patterns Π: SHIFT[al]=0 HASH[al]=2, → Shift by 1 Text T: C P M a n n u a l c o n f e r e n c e a n n o u n c e SHIFT[an]=4 SHIFT[l ]=5 SHIFT[B] = HASH[B] = Lecture on Information knowledge network

Pseudo code ※ In the implementation of agrep ver4.02 (mgrep.c) in fact, SHIFT・HASH・B are 4096, 8192, and 3. • Construct_SHIFT (P={p1,p2,…,pr}) • initialize SHIFT table by ℓmin–B+1. • For each Bl=pi[j–B+1…j] do • If SHIFT[h1(Bl)] > mi – j do SHIFT[h1(Bl)] = mi – j. • Wu-Manber (P={p1,p2,…,pr}, T=T[1…n]) • Preprocessing: • Computation of B. • Construction of the hash tables SHIFT and HASH. • Searching: • pos ← ℓmin. • while pos≦n do • i ← h1( T[pos–B+1…pos] ); • If SHIFT[i] = 0 then • list ← HASH[ h2( T[pos–B+1…pos] ) ]; • Verify all the patterns in list one by one against the text; • pos ← pos + 1; • else pos ← pos + SHIFT[i]. Lecture on Information knowledge network

The 3rd summary • Suffix type algorithm • It matches with the pattern from the right to the left. • It takes O(mn) time in the worst case, but O(n/m) time on average. • Boyer-Moore, Galil, Horspool, and Sunday • Factor type algorithm • It determines whether the current position is a factor of the pattern or not, and then skips the text. • BDM, BNDM, and BOM algorithm • Extensions of Suffix & Factor type algorithms to multiple patterns • When the number of patterns increases, bad-character heuristic doesn’t work well since the frequency of each character increases. • Set Horspool and Wu-Manber algorithm • The next theme • Approximate pattern matching: pattern matching with allowing errors. Lecture on Information knowledge network

Lecture on Information Knowledge Network "Information retrieval and pattern matching"