1 / 18

A Fast String Matching Algorithm

A Fast String Matching Algorithm. The Boyer Moore Algorithm. The obvious search algorithm . Considers each character position of str and determines whether the successive patlen characters of str matches pat .

chen
Télécharger la présentation

A Fast String Matching Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fast String Matching Algorithm The Boyer Moore Algorithm

  2. The obvious search algorithm • Considers each character position of str and determines whether the successive patlen characters of str matches pat. • In worst case, the number of comparisons is in the order of . Ex. pat: aab ; str: ..aaaaac .

  3. Knuth-Pratt-Morris Algoritm • Linear search algorithm. • Preprocesses pat in time linear in and searches str in time linear in . EXAMPLE HERE IS A SIMPLE EXAMPLE … EXAMPLE EXAMPLE EXAMPLE

  4. Characteristics of Boyer Moore Algorithm • Basic idea: string matches the pattern from the right rather than from the left. • Preprocessing pat and compute two tables: & for shifting pat & the pointer of str. • Ex. pat : AT-THAT; str : …WHICH-FINALLY-HALTS.—AT-THAT-POINT

  5. Informal Description Compare the last char of the pat with the patlenth char of str : AT-THAT WHICH-FINALLY-HALTS.—AT-THAT-POINT Observation 1: charis not to occur in pat, skip chars of str. AT-THAT

  6. Informal Description Observation 2: char is in pat, slidepatdown positions so that char is aligned to the corresponding character in pat. = if char not occur in pat,then ; else , where j is the maximum integer such that . • AT-THAT • WHICH-FINALLY-HALTS.--AT-THAT-POINT

  7. Informal Description Observation 3a:str matches the last m chars of pat, and came to a mismatch at some new char. Move strptr by .(pat shifted by ) AT-THAT …FINALLY-HALTS.--AT-THAT-POINT AT-THAT

  8. Informal Description Observation 3b: the final m chars of pat(a subpat) is matched, find the right most plausible reoccurrence of the subpat, align it with the matched m chars of str (slide pat positions). AT-THAT …FINALLY-HALTS.—AT-THAT-POINT AT-THAT AT-THAT

  9. The delta1 & delta2 tables • The delta1 table has as many entries as there are chars in the alphabet. Ex. pat: a b c d e ; a t – t h a t : 4 3 2 1 0 else,5; 1 0 4 0 2 1 0 else,7 • The delta2 table has as many entries as there are chars in pat. Ex. pat: a b c d e ; a t - t h a t : 9 8 7 6 1 ; 11 10 9 8 7 8 1

  10. Ex: we compute j=5 j= 1 2 3 4 5 6 7 Pat: edbcabc edbcabc -2-101 2 3 4 5 6 7 Then

  11. The algorithm stringlen length of string. i patlen. top : if i > stringlen then return false. j patlen. loop: if j=0 then return i+1. if string(i)=pat(j) then j j-1 i i-1 goto loop. close; i i +max( delta1(sting(i)) , delta2(j)) goto top.

  12. Implementation Consideration

  13. Loops: fast, undo, slow • Fast:scans down string, effectively looking for the last character in pat, skipping according to . • 80% time spent in it. • Undo:decides whether this situation arose because all of stringhas been scanned or because was hit. • Slow:backs up checking for matches. • It is easy to implement on a byte addressable machine • Char <- string (i), etc

  14. Measured the cost of each search • Three strings:binary alphabet, English, random alphabet. • Fig.1:the number of references made to string. • Fig.2:the total number of machine instruction that actually got executed.

  15. Performance (empirical evidence)

  16. Boyer Moore V.S. Knuth, Morris, and Pratt algorithm • for English text. • Boyer Moore: • every reference to string passes about 4 characters for a pattern of length 5. • For sufficiently large alphabets and sufficiently long patterns executes fewer than 1 instruction per character passed. • K.M.P.: • Search reference string about 1.1 times per character. • a character can be expected to be at least 3.3 instructions.

  17. Conclusion • Require fewer CPU cycle. • Most efficiently on a byte-addressable machine. • Unadvisable:to find the first of several possible substrings or to identify a location in string defined by a regular expression. • Aho and Corasick is more suitable.

  18. Conclusion • Improve:by fetching larger bytes in the fast loop and using a hash array to encode the extended . • Exponentially increases the effective size of the alphabet and reduces the frequency of common characters.

More Related