Pattern Matching

1. 1 Pattern Matching

2. 2 Outline Problem description Brute Force algorithm Knuth-Morris-Pratt algorithm Boyer-Moore algorithm

3. 3 Problem description What is pattern matching? Pattern: quchjfng Text: kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzb zfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkw eweg tkrmwm fjduwhqvhwpfg pymyajsaywf aas bwxwcmrncmslgeqztyoygdbnxjqoctslbrvngoyxe

4. 4 Problem description Given two strings P (pattern) and T (text), find the first occurrence of P in T. Input: two strings P & T Output: The starting index of the occurrence of P in T or Null . Applications: Text editors Search engines Biological research

5. 5 Strings [1/2] A string is a sequence of data elements, such as numbers or characters. An alphabet S is a set of basic symbols for a category of strings Examples of alphabet: Character: {A, B�Z, a, b �z} Number: {0, 1, 2�9} DNA: {A, C, G, T}

6. 6 Strings [2/2] Let P be a string of size m A substring P[i .. j] is a string containing characters of string P with ranks between i and j. P = P[0 .. m-1]. A prefix of P is a substring of the type P[0 .. i] A suffix of P is a substring of the type P[i ..m - 1]

7. 7 Brute Force Algorithm [1/2]

8. 8 Brute Force Algorithm [2/2]

9. 9 Shift P by more than one position when mismatch occurs Never miss a occurrence of P in T Preprocessing approach Preprocessing P Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Preprocessing T Suffix trees

10. 10 The Knuth-Morris-Pratt algorithm Best known For linear time for exact matching Preprocessing pattern P Compares from Left to Right

11. 11 Shift more than one position Reduce comparison The KMP Ideas

12. 12

13. 13 KMP Failure Function Knuth-Morris-Pratt�s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]

14. 14 The KMP Algorithm

15. 15 Example

16. 16 The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while-loop, either i increases by one, or the shift amount i - j increases by at least one (observe that F(j - 1) < j) Hence, there are no more than 2n iterations of the while-loop Thus, KMP�s algorithm runs in optimal time O(m + n)

17. 17 Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time The construction is similar to the KMP algorithm itself At each iteration of the while-loop, either i increases by one, or the shift amount i - j increases by at least one (observe that F(j - 1) < j) Hence, there are no more than 2m iterations of the while-loop

18. 18 Boyer-Moore�s Algorithm The Boyer-Moore�s pattern matching algorithm is based on two principles Right to Left Scan Bad character shift Reference : http://www.cs.utexas.edu/users/moore/best-ideas/string-searching

19. 19 Right to left Scan Rule

20. 20 Bad Character Shift Rule [1/3] Shift more than one position at a time. When a mismatch occurs at T[i] = c If P contains c : shift P to align the last occurrence of c in P with T[i]

21. 21 Bad Character Shift Rule [2/3] When a mismatch occurs at T[i] = c If P contains c : shift P to align the last occurrence of c in P with T[i] Else if P doesn�t contains c : shift P to align P[0] with T[i + 1]

22. 22 Bad Character Shift Rule [3/3] When a mismatch occurs at T[i] = c If P contains c : shift P to align the last occurrence of c in P with T[i] Else if P doesn�t contains c : shift P to align P[0] with T[i + 1] What if we have already crossed the last occurrence?

23. 23 Example of BM Algorithm

24. 24 Last-Occurrence Function Preprocess the pattern P and the alphabet S to build the last-occurrence function L mapping S to integers, where L(c) is defined as L(c) = Example: S = {a, b, c, d} P = abacab The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of S

25. 25 The Boyer-Moore Algorithm

26. 26 Time complexity Analysis Boyer-Moore�s algorithm is significantly faster than the Brute-Force Algorithm on English text Pattern size = m, Text size = n ,Number of alphabet = s ?Time complexity : O(nm+s) Example of worst case:T = aaaaaaaa�..aP = caaaaa

27. 27 Extended Boyer-Moore algorithm Extended Bad character shift rule then shift P to the right so that the closest c to the left of position i in P is below the mismatched c in T. The original Boyer-Moore algorithm use the simpler bad character rule

28. 28 Extended Boyer-Moore algorithm Extended Bad character shift rule

Pattern Matching

Pattern Matching

Presentation Transcript

Combinatorial Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching II

Combinational Pattern Matching

Pattern Matching

Graph pattern matching

Combinatorial Pattern Matching

Pattern Matching

Pattern Matching

Combinatorial Pattern Matching

Pattern Matching

Combinatorial Pattern Matching

Combinatorial Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching