1 / 26

CSE 30331 Lecture 23 – String Matching

CSE 30331 Lecture 23 – String Matching. Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm. The Problem. Find the first occurrence of the pattern P in text T. The number of characters in P is m The number of characters in T is n. The Simple Approach.

emelda
Télécharger la présentation

CSE 30331 Lecture 23 – String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 30331Lecture 23 – String Matching • Simple (Brute-Force) Approach • Knuth-Morris-Pratt Algorithm • Boyer-Moore Algorithm

  2. The Problem • Find the first occurrence of the pattern P in text T. • The number of characters in P is m • The number of characters in T is n

  3. The Simple Approach • For each position j in the text • If T[ j .. j+m) matches P[0..m) • stop : pattern found at position j • Advantage: • simple to increment • Disadvantage: • may require ability to push previously read characters back into input stream • Worst Case Efficiency: O(m*n) • The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

  4. Knuth-Morris-Pratt (KMP) • Based on FSA for recognizing the pattern P • The FSA is represented by a KMP flowchart • States are letters in the pattern P • Arcs are SUCCESS or FAIL • On success ( T[ j ] == P[ k ] ) • move forward with match ( j++ & k++ ) • On failure ( T[ j ] != P[ k ] ) • Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the longest matching prefix

  5. KMP Fail Links: hubbahubba • Example pattern: hubbahubba • P: H U B B A H U B B A • K: 0 1 2 3 4 5 6 7 8 9 • Fail[k] -1 0 0 0 0 0 1 2 3 4 • Match to text: hubbahubbletelescope... • hubbahubba last A != L fail[9]= 4 • hubbahubba first A != L fail[4]= 0 • hubbahubba H != L fail[0]= -1 • hubbahubba • hubbahubbletelescope... • ^

  6. KNP – Building Fail Links • Pattern: ABABDD • If P [ k ] != T [ j ] then • Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ] • Finding fail[k]: • Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) • If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 • Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat Read char A B A B D D * 0 1 2 3 4 5

  7. KNP – Building Fail Links void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // } }

  8. KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[1]:‘B’ s = fail[k-1];// s is fail[0]:-1 while(s >= 0) { // skip loop if (P[s] == P[k-1]) // break; // s = fail[s]; // } fail[k] = s + 1;// set fail[1] = -1 + 0 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

  9. KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[2]:‘A’ s = fail[k-1];// s is fail[1]:0 while(s >= 0) { // loop once if (P[s] == P[k-1])// P[0]:’A’ != p[1]:’B’ break; // s = fail[s];// so s is fail[0]:-1 } fail[k] = s + 1;// fail[2] = -1+1 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

  10. KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[3]:‘B’ s = fail[k-1];// s is fail[2]:0 while(s >= 0) {// loop once if (P[s] == P[k-1])// P[0]:‘A’ == P[2]:‘A’ break;// so, break s = fail[s]; // } fail[k] = s + 1;// fail[3] = 0+1 = 1 } } Read char A B A B D D * 0 1 2 3 4 5

  11. KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 2 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[4]:‘D’ s = fail[k-1];// s is fail[3]:1 while(s >= 0) { // loop once if (P[s] == P[k-1])// P[1]:‘B’ == P[3]:‘B’ break;// so, break s = fail[s]; // } fail[k] = s + 1;// fail[4] = 1+1 = 2 } } Read char A B A B D D * 0 1 2 3 4 5

  12. KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 2 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[5]:‘D’ s = fail[k-1];// s is fail[4]:2 while(s >= 0) { // loop twice if (P[s] == P[k-1])// P[2]:‘A’ != P[4]:‘D’, P[0]:‘A’ != P[4]:‘D’ break; // s = fail[s];// s = fail[2]:0, s = fail[0]:-1 } fail[k] = s + 1;// fail[5] = -1+1 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

  13. KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD fail: -1 0 0 1 2 0 • ABABDD .ABABDD A != X so fail[0] = -1 • X????? X????? Skip X & k=0 • ABABDD .ABABDD B != X so fail[1] = 0 • AX???? AX?????? k=0 (shifts pattern 1) • ABABDD ..ABABDD 2nd A != X so fail[2] = 0 • ABX??? ABX??? k=0 (shifts pattern 2) • ABABDD ..ABABDD 2nd B != X so fail[3] = 1 • ABAX?? ABAX???? k=1 (shifts pattern 2)

  14. KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD fail: -1 0 0 1 2 0 • ABABDD ..ABABDD D != X so fail[4] = 2 • ABABX? ABABX? k=2 (shifts pattern 2) • ABABDD .....ABABDD 2nd D != X so fail[5] = 0 • ABABDX ABABDX k=0 (shifts pattern 5)

  15. KNP Scan Algorithm int kmpScan (char P[], char T[], int m, int fail[]) { int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match; }

  16. KNP - Efficiency • Building Fail Links – O(m) • Scanning text – O(n) • Overall – O(m+n) = O(n)

  17. Boyer-Moore (BM) • Heuristic # 1 • Match pattern Right-to-Left • Create a charJump[ch] array with entry for each character in the alphabet (ASCII code) • If T[ j ] != P[ k ] then • If T[ j ] appears in P[0..k) then • the rightmost occurrence is aligned with T[ j ] • Else • the pattern P is aligned beginning at T[ j+1 ] • Jnew = charJump[ T[ j ] ] • matching resumes with T[ jnew ] and P[m-1] • This skips multiple text characters WITHOUT ever examining them

  18. Boyer Moore Algorithm • Heuristic # 2 • MatchJump[k] = slide[k] + m – k • Slide[k] is amount of slide to align substrings • M-k is length of suffix (substring) being realigned • Similar to KMP fail links, but calculated right to left • If a suffix has matched in P & T and that same substring appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T • Matching resumes at the new end of the pattern determined by matchJump [ k ]

  19. BM - Example • Pattern: BATSANDCATS • BATSANDCATS first Pattern alignment • BATSANDCATS charJump[T[j]] aligns N’s • BATSANDCATS matchJump[k] aligns ATS’s • TWOOLDGNATSCANBELIKEBATSANDCATS  The Text • New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I) • Use MAX(charJump(T[j]),matchJump[k])

  20. Computing individual charJumps // find cJ[ch] for each character ch in pattern P void computeJumps (char P[], int m, int alpha, int charJump[]) { // assume jump distance is entire pattern length for all // characters that do not match a pattern letter. for (int ch=0; ch<alpha; ch++) charJump[ch] = m; // for each pattern letter find the minimum jump to align // rightmost occurrence in string, with same current char // in the text for (int k=0;k<m; k++) charJump[(int)P[k]] = m - (k + 1); }

  21. Computing substring matchJumps void computeMatchJumps (char P[], int m, int matchJump[]) { int k, s, low, shift, *sufx = new int[m+1]; // note: sufx[0] tells what suffix matches a prefix of P for (k=0;k<m; k++) matchJump[k] = m + 1; // initially, an impossibly large slide // Compute sufx links (like KMP fail links, but right-to-left // Detect if substring equals matched suffix and is preceded // by mismatch at s; compute its slide. sufx[m] = m + 1; for (k=m-1; k>=0; k--) // k indexes sufx array, k-1 indexes P and matchJump { s = sufx[k+1]; while (s <= m) { if (P[k] == P[s-1]) // P indices 0..m-1, sufx indices 0,1..m break; if (s-(k+1) < matchJump[s-1]) // Mismatch between P[k] and P[s-1] matchJump[s-1] = s-(k+1); s = sufx[s]; } sufx[k] = s - 1; }

  22. Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix. Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; } // Add number of matched characters to slide amount for (k=0; k<m; k++) matchJump[k] += (m-(k+1)); }

  23. BM Scan Algorithm int boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]) { int match = -1, j = m-1, k = m-1; while (! endOfText(T,j)){ if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match; }

  24. BM - Example • Pattern: WOWWOW • mJump: 876731 cJump: ‘W’=0, ‘O’=1, others=6 • WOWTHISISWOWXOWWOWWOW the TEXT (21 chars) • 1 1111111111121 # of comparisons (15) • WOWWOW W != I, cJ[I]=6, mJ[5]=1 • WOWWOWW != S, cJ[S]=6, mJ[2]=6 • WOWWOW W != X, cJ[X]=6, mJ[3]=7 • WOWWOW W != O, cJ[O]=1, mJ[5]=1 • WOWWOW match • Note: cJump[‘W’]=0 means simply that if the TEXT character is ‘W’ the pattern realignment placing the rightmost pattern ‘W’ over the text ‘W’ is achieved by not moving the pattern • Note: the algorithm will NOT work using only cJump

  25. BM Algorithm Efficiency • Building charJump[ ] – O(S) • Building matchJump[ ] – O(m) • Scanning text – O(n) • In practice, only every 3 or 4 characters are examined in text so BM is quite fast • Overall – O(n)

  26. String Matching Program • Program to demonstrate all three approaches to string matching • demos\strScan.cpp

More Related