A Fast String Matching Algorithm

A Fast String Matching Algorithm The Boyer Moore Algorithm

The obvious search algorithm • Considers each character position of str and determines whether the successive patlen characters of str matches pat. • In worst case, the number of comparisons is in the order of . Ex. pat: aab ; str: ..aaaaac .

Knuth-Pratt-Morris Algoritm • Linear search algorithm. • Preprocesses pat in time linear in and searches str in time linear in . EXAMPLE HERE IS A SIMPLE EXAMPLE … EXAMPLE EXAMPLE EXAMPLE

Characteristics of Boyer Moore Algorithm • Basic idea: string matches the pattern from the right rather than from the left. • Preprocessing pat and compute two tables: & for shifting pat & the pointer of str. • Ex. pat : AT-THAT; str : …WHICH-FINALLY-HALTS.—AT-THAT-POINT

Informal Description Compare the last char of the pat with the patlenth char of str : AT-THAT WHICH-FINALLY-HALTS.—AT-THAT-POINT Observation 1: charis not to occur in pat, skip chars of str. AT-THAT

Informal Description Observation 2: char is in pat, slidepatdown positions so that char is aligned to the corresponding character in pat. = if char not occur in pat,then ; else , where j is the maximum integer such that . • AT-THAT • WHICH-FINALLY-HALTS.--AT-THAT-POINT

Informal Description Observation 3a:str matches the last m chars of pat, and came to a mismatch at some new char. Move strptr by .(pat shifted by ) AT-THAT …FINALLY-HALTS.--AT-THAT-POINT AT-THAT

Informal Description Observation 3b: the final m chars of pat(a subpat) is matched, find the right most plausible reoccurrence of the subpat, align it with the matched m chars of str (slide pat positions). AT-THAT …FINALLY-HALTS.—AT-THAT-POINT AT-THAT AT-THAT

The delta1 & delta2 tables • The delta1 table has as many entries as there are chars in the alphabet. Ex. pat: a b c d e ; a t – t h a t : 4 3 2 1 0 else,5; 1 0 4 0 2 1 0 else,7 • The delta2 table has as many entries as there are chars in pat. Ex. pat: a b c d e ; a t - t h a t : 9 8 7 6 1 ; 11 10 9 8 7 8 1

Ex: we compute j=5 j= 1 2 3 4 5 6 7 Pat: edbcabc edbcabc -2-101 2 3 4 5 6 7 Then

The algorithm stringlen length of string. i patlen. top : if i > stringlen then return false. j patlen. loop: if j=0 then return i+1. if string(i)=pat(j) then j j-1 i i-1 goto loop. close; i i +max( delta1(sting(i)) , delta2(j)) goto top.

Implementation Consideration

Loops: fast, undo, slow • Fast：scans down string, effectively looking for the last character in pat, skipping according to . • 80% time spent in it. • Undo：decides whether this situation arose because all of stringhas been scanned or because was hit. • Slow：backs up checking for matches. • It is easy to implement on a byte addressable machine • Char <- string (i), etc

Measured the cost of each search • Three strings：binary alphabet, English, random alphabet. • Fig.1：the number of references made to string. • Fig.2：the total number of machine instruction that actually got executed.

Performance (empirical evidence)

Boyer Moore V.S. Knuth, Morris, and Pratt algorithm • for English text. • Boyer Moore： • every reference to string passes about 4 characters for a pattern of length 5. • For sufficiently large alphabets and sufficiently long patterns executes fewer than 1 instruction per character passed. • K.M.P.： • Search reference string about 1.1 times per character. • a character can be expected to be at least 3.3 instructions.

Conclusion • Require fewer CPU cycle. • Most efficiently on a byte-addressable machine. • Unadvisable：to find the first of several possible substrings or to identify a location in string defined by a regular expression. • Aho and Corasick is more suitable.

Conclusion • Improve：by fetching larger bytes in the fast loop and using a hash array to encode the extended . • Exponentially increases the effective size of the alphabet and reduces the frequency of common characters.

A Fast String Matching Algorithm

A Fast String Matching Algorithm

Presentation Transcript

String Matching

A Fast String Matching Algorithm

String Matching Using the Rabin-Karp Algorithm

A fast algorithm for Maximum Subset Matching

String Matching

Fast Algorithm for String Matching with k Mismatches

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

A Fast String Searching Algorithm

String Matching: Knuth-Morris-Pratt algorithm

brute force string matching algorithm

String Matching

A fast algorithm for approximate string matching on gene sequences

String matching

String Matching

String Matching

String Matching

Sea Ice

Sea Ice