1 / 23

A Fast Algorithm for Multi-Pattern Searching

A Fast Algorithm for Multi-Pattern Searching. Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994. Outline. Introduction Boyer-Moore algorithm review Fast algorithm for Multi-Pattern Search Preprocessing Stage Scanning Stage Performance

virgo
Télécharger la présentation

A Fast Algorithm for Multi-Pattern Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994

  2. Outline • Introduction • Boyer-Moore algorithm review • Fast algorithm for Multi-Pattern Search • Preprocessing Stage • Scanning Stage • Performance • Experiments • Conclusion

  3. Introduction • Given a algorithm to find all occurrences of all the pattern of P in T. • P={p1, p2, ......, pk} be the ser of patterns, which are strings of characters from a fixed alphabet Σ. • T = t1, t2, ...., tN be a large text, consisting of character from Σ.

  4. char ……… ……… string pattern m Boyer-Moore algorithm review • Symbol used: • Σ : the set of alphabets • patlen : the length of pattern • m : the last m characters of pattern matched • char : the mismatched character

  5. string ptr Bad Character Heuristic • Observation 1: • If the char doesn’t occur in pat: • Pattern Shift : j character • String pointer shift: patlen character • Example: ......a c d a b b a c d e a f e c a ........ text a b c e pat

  6. Bad Character Heuristic (cont.) • Observation 2: • If the char occur in the pattern • The rightmost char in pattern in position δ1[char] and the pointer to the pattern is in j • If j < δ1[char] we shift the pattern right by 1 • If j > δ1[char] we shift the pattern right by j- δ1[char] • We say δ1 is SHIFT table

  7. string ptr string ptr j j Bad Character Heuristic (cont.) • Example: • j < δ1[char] ......A C F D B A D A E C A D A E....... text • j δ1[char] ......A C F D B A D A E C A D A E....... text δ1[A] = 7 and j = 4 shift pattern right by 1 D A E C E C A δ1[A] = 2 and j = 4 shift pattern right by 2 D A E C E C

  8. text size = B Multi-Pattern Searching • Instead looking at character from text one by one, we consider them in blocks of size B. • A good value of B is in the order of logc2M. In practice, we use either B=2 or B=3. • M is the total size of all patterns. • c is the size of the alphabet.

  9. Multi-Pattern Searching (cont.) • Preprocessing Stage built three tables for the set of patterns: • SHIFT table : like Boyer-Moore’s Shift table with little different. • HASH table and PREFIX table: used when the shift value = 0.

  10. Preprocessing Stage • First Compute the minimum length m of a pattern, and consider first m character of each pattern. • SHIFT table contains all possible string of size B • Table size is cB • We can use hash function to compress table.

  11. D A B C A D B A A B text SHIFT table • Let X = x1x2.....xB be the B characters in the text, and X is mapped into i’th entry of SHIFT table. • Case 1: • X doesn’t appear as a substring in P, we shift text m-B+1 characters. m =4, B =2 so we shift pattern m-B+1

  12. G A B C A D C E B D text SHIFT table (cont.) • Case 2: • X appears in some patterns:To find the rightmost occurrence of X in any of the patterns. • X ends at position q of Pj, and q is the largest in all possible patterns. • We shift text m-j characters-> SHIFT[i] = m-j.

  13. SHIFT table (cont.) • The value of SHIFT table are the largest possible safe value for shifts. • To do pre-scan all of the patterns, set SHIFT value min(current value, m-j) • Initial value is m-B+1 • We can map several different strings into the same entry.

  14. HASH table • When SHIFT[i] = 0, we match some patterns. • HASH[i] records the pointer PAT_POINT which point to the patterns. … ….. list of PAT_POINT patterns which sorted by the hash value of the last B characters of each pattern.

  15. HASH table (cont.) • HASH[i] = p, point to the beginning of the list of patterns whose hash value mapped to h. • To find the end of this list, we keep incrementing this pointer until it’s value equal to the value in HASH[i+1]

  16. PREFIX table • Nature language isn’t random. The suffix “ion”, “ing” is common in English Text. • It may appear in several of the patterns. • We use PREFIX table to speed up this process. • Mapping the first B’ characters of all patterns into Prefix function. • It can filter patterns whose suffix is the same but whose prefix is different.

  17. Scanning Stage while (text <= textend) { h = Huchfunct(B); /* The hash function (we use Hbits=5) */ shift = SHIFT[h]; if (shift == 0) { text_prefix = (*(text-m+1)<<8) + *(text-m+2); p = HASH[h]; p_end = HASH[h+1]; while (p++ < p_end) { if(text_prefix != PREFIX[p]) continue; px = PAT_POINT[p]; qx = text-m+1; while (*(px++) == *(qx++)); if (*(px-1) == 0) { /* 0 indicates the end of a string */ report a match } shift = 1; } text += shift; } Text possible shift is zero. Some match happened. 1.Compute the hash value h based on the B character from the text Check for each p HASH[i] <= p < HASH[i+1] where PREFIX[p] = text_prefix.

  18. Performance • The SHIFT table is constructed in O(M) • M = m * P • B = logc2M • cB = clogc2M  2Mc

  19. Performance (cont.) • Lemma: The probability of random string of size B leads to a shift value of i, is <=1/2m Prof: 1. P = M/m strings lead to shift value of i 2. the number of possible strings of size B is 2M at least

  20. Performance (cont.) • Lemma implies that the expected value of shift is >= m/2 • total amount of non-zero shift is O(BN/m) • shift = 0, the amount of cost is O(m) * O(1/2m) • The total amount is O(BN/m)

  21. Experiment

  22. Experiment (cont.)

  23. Conclusion • This algorithm use three table : SHIFT, HASH, Prefix, to save scanning time. • Preprocessing stage cost is low. • It can use in many application, such as file search in database,

More Related