1 / 60

Pattern Matching

Pattern Matching. A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T}.

seven
Télécharger la présentation

Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching Pattern Matching

  2. A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T} Let P be a string of size m A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P[0 .. i] A suffix of P is a substring of the type P[i ..m -1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P Applications: Text editors Search engines Biological research Strings Pattern Matching

  3. Brute-Force Algorithm AlgorithmBruteForceMatch(T, P) Inputtext T of size n and pattern P of size m Outputstarting index of a substring of T equal to P or -1 if no such substring exists for i  0 to n - m { test shift i of the pattern } j  0 while j < m  T[i + j]= P[j] j j +1 if j = m return i {match at i} else break while loop {mismatch} return -1{no match anywhere} • The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either • a match is found, or • all placements of the pattern have been tried • Brute-force pattern matching runs in time O(nm) • Example of worst case: • T = aaa … ah • P = aaah • may occur in images and DNA sequences • unlikely in English text Pattern Matching

  4. Boyer-Moore Heuristics • The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i] = c • If P contains c, shift P to align the last occurrence of c in P with T[i] • Else, shift P to align P[0] with T[i + 1] • Example Pattern Matching

  5. Last-Occurrence Function • Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet S to build the last-occurrence function L mapping S to integers, where L(c) is defined as • the largest index i such that P[i]=c or • -1 if no such index exists • Example: • S = {a, b, c, d} • P=abacab • The last-occurrence function can be represented by an array indexed by the numeric codes of the characters • The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of S Pattern Matching

  6. Case 2: 1+l j Case 1: j  1+l The Boyer-Moore Algorithm AlgorithmBoyerMooreMatch(T, P, S) L lastOccurenceFunction(P, S ) i  m-1 j  m-1 repeat if T[i]= P[j] if j =0 return i { match at i } else i  i-1 j  j-1 else { character-jump } l  L[T[i]] i  i+ m – min(j, 1 + l) j m-1 until i >n-1 return -1 { no match } Pattern Matching

  7. Example Pattern Matching

  8. Boyer-Moore’s algorithm runs in time O(nm + s) Example of worst case: T = aaa … a P = baaa The worst case may occur in images and DNA sequences but is unlikely in English text Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text Analysis Pattern Matching

  9. The KMP Algorithm - Motivation • Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. • When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? • Answer: the largest prefix of P[0..j]that is a suffix of P[1..j] a b a a b x . . . . . . . a b a a b a j a b a a b a No need to repeat these comparisons Resume comparing here Pattern Matching

  10. KMP Failure Function • Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself • The failure functionF(j) is defined as the size of the largest prefix of P[0..j]that is also a suffix of P[1..j] • Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j] T[i] we set j  F(j-1) Pattern Matching

  11. The KMP Algorithm AlgorithmKMPMatch(T, P) F failureFunction(P) i  0 j  0 while i <n if T[i]= P[j] if j =m-1 return i -j{ match } else i  i+1 j  j+1 else if j >0 j  F[j-1] else i  i+1 return -1 { no match } • The failure function can be represented by an array and can be computed in O(m) time • At each iteration of the while-loop, either • i increases by one, or • the shift amount i - j increases by at least one (observe that F(j-1)< j) • Hence, there are no more than 2n iterations of the while-loop • Thus, KMP’s algorithm runs in optimal time O(m + n) Pattern Matching

  12. Computing the Failure Function • The failure function can be represented by an array and can be computed in O(m) time • The construction is similar to the KMP algorithm itself • At each iteration of the while-loop, either • i increases by one, or • the shift amount i - j increases by at least one (observe that F(j-1)< j) • Hence, there are no more than 2m iterations of the while-loop AlgorithmfailureFunction(P) F[0] 0 i  1 j  0 while i <m if P[i]= P[j] {we have matched j + 1 chars} F[i]j+1 i  i+1 j  j+1 else if j >0 then {use failure function to shift P} j  F[j-1] else F[i] 0 { no match } i  i+1 Pattern Matching

  13. Example Pattern Matching

  14. Binary Failure function • For your assignment, you are to do the binary failure function. Since there are only two possible charaters, when you fail at a character, you know what you were looking at when you failed. Thus, you state the maximum number of character that match the previous characters of the pattern AND the opposite of the current character. • Binary Failure • Regular Failure function Pattern Matching

  15. Tries: Basic Ideas • Preprocess fixed text rather than pattern • Store strings in trees, one character per node • Use in search engines, dictionaries, prefixes • Fixed alphabet with canonical ordering • Use special character as word terminator Pattern Matching

  16. Tries are great if • Word matching (know where word begins) • Text is large, immutable, and searched often. • Web crawlers (for example) can afford to preprocessed text ahead of time knowing that MANY people want to search contents of all web pages. Pattern Matching

  17. Facts • Prefixes with length i stop in level i • # leaves = # strings (words in text) • Is a multi-way tree, used similarly to the way we use a binary search tree. • Tree height = length of longest word • Tree size O(combined length of all words) • Insertion and search as in multi-way ordered trees, O(word length) • Word, not substring, matching • Could use 27-ary trees instead • Exclude stop words from trie as won’t search for them Pattern Matching

  18. Trie Example Pattern Matching

  19. Compressed Tries • When there is only one child of a node, a waste of space, so store substrings at nodes • Then the tree size is O(s), the number of words Pattern Matching

  20. Compressed Tries with Indexesavoids variable length strings Pattern Matching

  21. Suffix Tries • Tree of all suffixes of a string • Used for substrings, not just full words • Used in pattern matching – a substring is the prefix of a suffix (all words are from same string) • Changes linear search for the beginning of the pattern to a tree search Pattern Matching

  22. Suffix Tries are efficient • In space, O(n) rather than O(n2), because characters only need to appear once • Efficient – O(dn) to construct, O(dm) to use • d is size of alphabet Pattern Matching

  23. Search Engines • Inverted index (file) has words as keys, occurrence lists (webpages) as value (access by content) • Also called concordance – omit stop words • Can use a trie effectively • Multiple keywords return the intersection of occurrence lists • Can use sequences in fixed order with merging for intersection • gives ranking – major challenge Pattern Matching

  24. Text Compression and Similarity A. Text Compression 1. Text characters are encoded as binary integers; different encodings may result in more or fewer bits to represent the original text a. Compression is achieved by using variable, rather than fixed size encoding (e.g. ASCII or Unicode) b. Compression is valuable in reduced bandwidth communication, storage space minimization Pattern Matching

  25. 2. Huffman encoding • a. Shorter encodings for more frequently occurring characters • b. Prefix code - can’t have one code be a prefix of another • c. Most useful when character frequencies differ widely • Encoding may change from text to text, or may be defined for a class of texts, like Morse code. Pattern Matching

  26. Huffman algorithm uses binary treesStart with an individual tree for each character, storing character and frequency at root. Iteratively merge trees with two smallest frequencies at the root, writing sum of frequencies of children at each internal node. Greedy Algorithm Pattern Matching

  27. Complexity is O(n + d log d), where the text of n characters has d distinct characters n is to process the text calculating frequencies d log d is the cost of heaping the frequency trees, then iteratively removing two, merging, and inserting one. Pattern Matching

  28. Text SimilarityDetect similarity to focus on, or ignore, slight differencesa. DNA analysisb. Web crawlers omit duplicate pages, distinguish between similar onesc. Updated files, archiving, delta files, and editing distance Pattern Matching

  29. Longest Common SubsequenceOne measure of similarity is the length of the longest common subsequence between two texts. This is NOT a contiguous substring, so it loses a great deal of structure. I doubt that it is an effective metric for similarity, unless the subsequence is a substantial part of the whole text. Pattern Matching

  30. LCS algorithm uses the dynamic programming approachHow do we write LCS in terms of other LCS problems? The parameters for the smaller problems being composed to solve a larger problem are the lengths of a prefix of X and a prefix of Y. Pattern Matching

  31. Find recursion:Let L(i,j) be the LCS between two strings X(0..i) and Y(0..j).Suppose we know L(i, j), L(i+1, j) and L(i, j+1) and want to know L(i+1, j+1). a. If X[i+1] = Y[j+1] then it is L(i, j) + 1.b. If X[i+1] != Y[j+1] then it is max(L[i, j+1], L(i+1, j)) Pattern Matching

  32. Pattern Matching

  33. Pattern Matching

  34. This algorithm initializes the array or table for L by putting 0’s along the borders, then is a simple nested loop filling up values row by row. This it runs in O(nm)While the algorithm only tells the length of the LCS, the actual string can easily be found by working backward through the table (and string), noting points at which the two characters are equal Pattern Matching

  35. This material in not in your text (except as exercises) • Sequence Comparisons • Problems in molecular biology involve finding the minimum number of edit steps which are required to change one string into another. • Three types of edit steps: insert, delete, replace. • Example: abbc babb • abbc  bbc  bbb  babb (3 steps) • abbc  babbc  babb (2 steps) • We are trying to minimize the number of steps. Pattern Matching

  36. Idea: look at making just one position right. Find all the ways you could use. • Count how long each would take and recursively figure total cost. • Orderly way of limiting the exponential number of combinations to think about. • For ease in coding, we make the last character right (rather than any other). Pattern Matching

  37. There are four possibilities (pick the cheapest) • If we delete an, we need to change A(0..n-1) to B(0..m). The cost is C(n,m) = C(n-1,m) + 1 C(n,m) is the cost of changing the first n of str1 to the first m of str2. 2. If we insert a new value at the end of A(n) to match bm, we would still have to change A(n) to B(m-1). The cost is C(n,m) = C(n,m-1) + 1 3. If we replace an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1 4. If we match an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) Pattern Matching

  38. We have turned one problem into three problems - just slightly smaller. • Bad situation - unless we can reuse results. Dynamic Programming. • We store the results of C(i,j) for i = 1,n and j = 1,m. • If we need to reconstruct how we would achieve the change, we store both the cost and an indication of which set of subproblems was used. Pattern Matching

  39. M(i,j) which indicates which of the four decisions lead to the best result. • Complexity: O(mn) - but needs O(mn) space as well. • Consider changing do to redo: • Consider changing mane to mean: Pattern Matching

  40. Changing “do” to “redo”Assume: match is free; others are 1 Pattern Matching

  41. Changing “mane” to “mean” Pattern Matching

  42. Longest Increasing Subsequence of single list Find the longest increasing subsequence in a sequence of distinct integers. Idea 1. Given a sequence of size less than m, can find the longest sequence of it. (Recursion) The problem is that we don't know how to increase the length. Case 1: It either can be added to the longest subsequence or not Case 2: It is possible that it can be added to a non-selected subsequence (creating a sequence of equal length - but having a smaller ending point) Case 3: It can be added to a non-selected sub-sequence creating a sequence of smaller length but successors make it a good choice. Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11 Pattern Matching

  43. Idea 2. Given a sequence of size string < m, we know how to find all the longest increasing subsequences. • Hard. There are many, and we need it for all lengths. Pattern Matching

  44. Idea 3Given a sequence of size < m, can find the longest subsequence with the smallest ending point. • We might have to create a smaller subsequence, before we create a longer one. Pattern Matching

  45. Idea 4. Given a sequence of size <m, can find the best increasing sequence (BIS) for every length (k < m-1). • For each new item in the sequence, when we add it to the sequence of length 3 will it be better than the current sequence of length 4? Pattern Matching

  46. For s= 1 to n (or recursively the other way) For k = s downto 1 until find correct spot If BIS(k) > As and BIS(k-1) < As BIS(k) = As Pattern Matching

  47. Actually, we don't need the sequential search as can do a binary search. 5 1 10 2 12 8 15 18 45 6 7 3 8 9 Length BIS 1 1 2 2 3 3 4 7 5 8 6 9 • To output the sequence would be difficult as you don't know where the sequence is. You would have to reconstruct. Pattern Matching

  48. Try: 8 1 4 2 9 10 3 5 14 11 12 7 Pattern Matching

  49. Probabilistic Algorithms • Suppose we wanted to find a number that is greater than the median (the number for which half are bigger). • We could sort them - O(n log n) and then select one. • We could find the biggest - but stop looking half way through. O(n/2) • Cannot guarantee one in the upper half in less than n/2 comparisons. • What if you just wanted good odds? • Pick two numbers, pick the larger one. What is probability it is in the lower half? Pattern Matching

  50. There are four possibilities: • both are lower • the first is lower the other higher. • the first is higher the other lower • both are higher. We will be right 75% of the time! We only lose if both are in the lowest half. Pattern Matching

More Related