1 / 12

Avi Gavlovski

An extra push for assignment 5 In this document: Understanding Aho Corasick: slides 2-9 Hints & Tips (updated): slides 10-12. Avi Gavlovski. First lets recall what brute force matching does.

khan
Télécharger la présentation

Avi Gavlovski

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An extra push for assignment 5 In this document: Understanding Aho Corasick: slides 2-9 Hints & Tips (updated): slides 10-12 Avi Gavlovski

  2. First lets recall what brute force matching does. Given text T with length n and a word W with length m, and letting T[k:m] represent the substring of text T from index k to index m, Brute force says: Compare T[0:m-1] with W. If that fails, compare T[1:m] with W. If that fails, compare T[2:m+1] with W. … If that fails, compare T[n-m:n-1] with W. If that fails, return false

  3. However, consider the following situation, where W = abbabb: mismatch Brute Force would say, start over like this: KMP would say, “but we already had seen the next 4 letters!” “we could have noticed the following match:”

  4. KMP would say, “but we already had seen the next 4 letters!” “we could have noticed the following match:” We were comparing index 5 within T, vs index 5 within W. “so we could have resumed like this:” We now compare index 5 within T, vs index 2 within W. So when a match fails, KMP tells us that we should stay at the SAME index within T as we were before, and we just change our index within W (we move back within W). Unless of course, the match was at the index 0 within W, in which case we just increment our index within T by 1.

  5. The Failure Function f tells us the index within W to which we move back when there is a mismatch/failure. If we were comparing at index i within T, and index j within W. we know that T[i-1] = W[j-1] and T[i-2] = W[j-2] … and T[i-j] = W[0] Therefore, the failure function, however we define it, does NOT need to depend on T[i-j:i-1], since T[i-j:i-1]=W[0:j-1] This means we can precompute W before even looking at T. Now what should f (j) be? Well, we had matched T[i-j:i-1] with W[0:j-1] But the next index failed. We failed after seeing abbab. But we had noticed that the ab, which is a suffix of bbab was also prefix of W.

  6. But we need to go back as far as possible in order to guarantee that we don’t miss anything. For example, if we were searching for abababa in the following situation: It would we dangerous to move back to We could miss something! So we need to take the LONGEST suffix of W[1:j-1] which is a prefix of W

  7. So f(j-1), which again is the value to which we move back j if we fail at index j, will be the length-1 (or the position within W) of the longest suffix of W[1:j-1] which is a prefix of W. So the failure function for aabaab for example, Would be f(0) = 0 longest is null f(1) = 1 longest is a f(2) = 0 longest null f(3) = 1 longest is a f(4) = 2 longest is aa f(5) = 3 longest is aab The Pseudo Code for KMP would thus be: Start i and j at 0 While (i<n-m) if(T[i] = W[j]) then we increment both i and j unless j=m in which case we _ found the word else if j = 0 then i++ else j = f(j-1)

  8. Aho-Corasick extends this to allow us to search for multiple words, by making the domain of f be the set position within the trie defined by the set of words instead of just an index within a word. Then instead of keeping j, which was the index in W We still use the longest suffix rule.. If we fail on making a transition from a node N to its child, we transition to a node M, where the string that defines M (the string obtained by walking down from the root of the trie until hitting M) is the farthest node (longest prefix) from the root which is also a suffix of the string we had matched when we failed (removing the first transition). The picture on the bottom of page 2 of the write-up of the assignment should make it very easy to see how this extension works. Tips on next slide

  9. The Failure Function for Aho-Corasick: Given a node N in the Trie, we need to find the longest proper suffix (proper means excluding the first letter) of the string represented by the word that is in the Trie. This string will be the failure transition for N. For example, look at the picture on page 2 of the assignment, at node ATATATA. The proper suffixes of this string are TATATA ATATA TATA ATA TA A We need to take the longest one which is in the trie. Note that TATATA is not in the trie. But ATATA is, so this is the longest suffix of ATATATA which is in the trie, and thus the failure transition for ATATATA is ATATA. Computing the failure transitions could thus be done by just checking each proper suffix, from longest to shortest, and taking the longest one which is in the trie. BUT, this won’t be linear! …so this isn’t how We actually implement it (see Hints section, on next page, for more on this)

  10. HINTS & TIPS page 1 1) Precomputation: 1.1) Building a simple trie. You might want to start by just building a simple trie containing the keywords… where only nodes that contain a word are mark as accepting states… just remember that some search keywords might be prefixes of others, so non-leaf nodes can also store words. 1.2) Computing the failure function. As said in the notes, you want to do this in length-lex or BFS order. To compute the failure transition f(N) for a node N reachable1 by letter L just follow the failure transitions of N’s parent (which by induction we know that they have already been computed) until you can make a forward transition on letter L. Then the node that you arrive at through this forward transition will be f(N). 1) Node N is ‘reachable’ by letter L if the forward transition from N’s parent to N is on letter L.

  11. HINTS AND TIPS page 2 1.3) Computing accepting states. Remember: Some search words can be substrings of others, so if our accepting states would just be the nodes that contain the keywords, when we reach these nodes we would have to update all the keywords which are substrings of the completed keywords… but don’t try hacking a solution using this representation (with only keyword nodes as accepting states), it would be hard, and even if you did manage to do it, guess what, it wouldn’t be linear. You will also want to do this in an ‘inductive way’: in length-lex or BFS order, as we did to get the failure transitions. Therefore, what you want is for each node N to contain a set of words S(N), which are the words whose counts need to be incremented when we reach that node. To initialize this, all the nodes which contain a keyword K will start out will set S = {K} And all the nodes which don’t contain a keyword will start out with S = empty. Then, in length-lex order, when we get to node N, we set S(N) = S(N) union S(f(N)) and proceed to the next node…

  12. HINTS AND TIPS page 3 2) Traversing the text Remember that this is not a regular finite state machine… if you fail on node N reachable by letter L you need to follow failure transitions until you can make a forward transition on L, or until you hit the root. Also remember that all you have to do when you reach an accepting state is update the counts of the words stored at that state. 3) Other tips: Think hard about what representation and data structures you will need for each component of the process… remember that precomputation needs to be linear in the total size of the input keywords, and traversal needs to be linear in the total size of the input… (keywords + text)… so if you try to brute force any part of this process, you will fail some of the tests. Good luck!

More Related