1 / 28

Martin Kay Stanford University

Martin Kay Stanford University. String Search 1. Naive Search (1). naive_search(Pattern, Text, 1) :- append(Pattern, _, Text). naive_search(Pattern, [_ | Text], N) :- naive_search(Pattern, Text, N0), N is N0+1. naive_search("is", "mississippi", N). N = 2 ? ; N = 5 ? ; no | ?-.

abie
Télécharger la présentation

Martin Kay Stanford University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Martin Kay Stanford University String Search 1

  2. Naive Search (1) naive_search(Pattern, Text, 1) :- append(Pattern, _, Text). naive_search(Pattern, [_ | Text], N) :- naive_search(Pattern, Text, N0), N is N0+1. naive_search("is", "mississippi", N). N = 2 ? ; N = 5 ? ; no | ?-

  3. pref — A Prefix Predicate pref(P, T) :- assert(stat(T, P)), fail. Make an entry in the data base every time the predicate is called. pref([], _). pref([H | P], [H | T]) :- pref(P, T).

  4. Search using pref naive_search1(Pattern, Text, 1) :- pref(Pattern, Text). naive_search1(Pattern, [_ | Text], N) :- naive_search1(Pattern, Text, N0), N is N0+1. | ?- naive_search1([i,s], [m,i,s,s,i,s,s,i,p,p,i], N). N = 2 ? ; N = 5 ? ; no | ?-

  5. | ?- listing(stat). stat([m,i,s,s,i,s,s,i,p,p,i], [i,s]). stat([i,s,s,i,s,s,i,p,p,i], [i,s]). stat([s,s,i,s,s,i,p,p,i], [s]). stat([s,i,s,s,i,p,p,i], []). stat([s,s,i,s,s,i,p,p,i], [i,s]). stat([s,i,s,s,i,p,p,i], [i,s]). stat([i,s,s,i,p,p,i], [i,s]). stat([s,s,i,p,p,i], [s]). stat([s,i,p,p,i], []). stat([s,s,i,p,p,i], [i,s]). stat([s,i,p,p,i], [i,s]). stat([i,p,p,i], [i,s]). stat([p,p,i], [s]). stat([p,p,i], [i,s]). stat([p,i], [i,s]). stat([i], [i,s]). stat([], [s]). stat([], [i,s]). 11 Allignments The Statistics 18 Entries

  6. or maybe even here Mismatch No “m” here So move to here! Observe-- If the pattern “mississippi” matched part of the way, we can move over all the the characters matched because none of them can be an “m”, which is what we need to start a new match. Text: Pattern: m i s s i o n a r y . . . . m i s s i s s i p p i

  7. Mismatch p e r p e t r a t e So try this This is a prefix of the pattern Observe further -- p e r p e n d i c u l a r . . . p e r p e t r a t e Text: Pattern:

  8. p e r p e t r a t e So move to here Mismatch Observe yet further -- p e r p e t u a l . . . . . p e r p e t r a t e Text: Pattern: No (shorter) prefix of the pattern ends here

  9. Overlaps Search for a b a c a b a d a b a c a b a in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a

  10. Déja vu Search for a b a c a b a d a b a c a b a in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a

  11. c c c On-line search We have seen this much of the text so far: c a c a We are looking for the pattern cacao. We have some number (0 or more) searches in progress and are waiting for the next character to see which ones continue and maybe to start a new one. c a c a c a

  12. 0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] Search for The rightmost pointer always moves. Others pointers move if they can do so over the same character A new ‘0’ is introduced on the left a b a c a b a d a b a c a b a    in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a  A pointer in a given position always has pointers in the same set of positions to its left These are properties of the pattern only. Therefore they can be cached or precompiled.

  13. 0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] If this matches ... then so will these Search for a b a c a b a d a b a c a b a a b a b a c a b a d a b a c a b a d a b a c a b a b a

  14. 0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] Search for a b a c a b a d a b a c a b a a b a b a c a b a d a b a c a b a d a b a c a b a b a So try these only if this fails!

  15. a [0] b [0, 1] a [0, 2] b [0, 1, 3] a [0, 2] c [0, 1, 3] a [0, 4] b [0, 1, 5] a [0, 2, 6] d [0, 1, 3, 7] a [0, 8] b [0, 1, 9] a [0, 2, 10] c [0, 1, 3, 11] a [0, 4, 12] The failure function 0 1 2 3 4 5 6 7 8 9 10 11 12 ... a b a c a b a d a b a c a ... 0 0 1 0 1 2 3 0 1 2 3 4 ...

  16. a [0] b [0, 1] a [0, 2] b [0, 1, 3] a [0, 2] c [0, 1, 3] a [0, 4] b [0, 1, 5] a [0, 2, 6] d [0, 1, 3, 7] a [0, 8] b [0, 1, 9] a [0, 2, 10] c [0, 1, 3, 11] a [0, 4, 12] 0 1 2 3 4 5 6 7 8 9 10 11 12 ... a b a c a b a d a b a c a ... 0 0 1 0 1 2 3 0 1 2 3 4 ...

  17. The Failure Function -1 0 0 0 1 2 3 4 5 a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c

  18. The Failure Function -1 0 0 1 0 1 2 3 0 1 2 3 4 5 6 a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a

  19. The Failure Function -1 0 0 1 0 1 2 3 0 1 2 3 4 5 6 a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a

  20. Substring, Prefix, Suffix • Part of a string S (even if it covers the whole of S) is a substring of S. • If it includes the first (last) character of S, it is a prefix (suffix) of S. • If it does not cover the whole of S, it is a proper substring (prefix, suffix) of S. Example: S = ababac Some substrings: ababac, ab, b, bab, ac,  only ababac is not proper Some prefixes: ababac, a, aba,  only ababac is not proper Some suffixes: ababac, abac, c,  only ababac is not proper  is the empty string

  21. Borders • If B is a proper prefix and a proper suffix of a string S, it is a border of S. • Note  is a border of every string Examples: abcabcabc has borders abc, abcabc,  abacabadabacaba has borders abacaba, aba, a, 

  22. -1 0 0 0 1 2 3 4 5 a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c Borders

  23. border in Prolog border(Pattern, Boarder) :- append([_ | _], Border, Pattern), append(Border, _, Pattern).

  24. a b a c a b a d a b a b a c a b a d a b a b a c a b a d a b -1 0 0 1 0 1 2 3 0 1 Borders in Linear-time border(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border(P0, Pattern, Q), extend(C, Q, Pattern, R). Borders at position i+1 extend borders at position i

  25. border(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border(P0, Patttern, Q), extend(C, Q, Pattern, R). make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL). make_table(_, I, N) :- I>N, !. make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N). Building A Table

  26. border(I, Pattern, Q) :- J is I-1, border_table(J, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border_table(P0, Q), extend(C, Q, Pattern, R). make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL). make_table(_, I, N) :- I>N, !. make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N). Building A Table

  27. Searching search(Pattern, Text, N) :- make_table(Pattern), retract(border_table(0, _)), assert(border_table(0, 0)), length(Pattern, PL), search(Pattern, PL, Text, N). search(Pattern, PL, Text, N) :- common_prefix(Pattern, Text, CPL), search(CPL, Pattern, PL, Text, N). search(CPL, _, CPL, _, 0). search(CPL, Pattern, PL, Text0, N) :- border_table(CPL, BL), M is CPL-BL, advance(Text0, M, Text), search(Pattern, PL, Text, N0), N is N0+M. Build the table Do the search

  28. Reference Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing , 6(2):323-350, June 1977.

More Related