String Matching

String Matching The problem: • Input: a text T (very long string) and a pattern P (short string). • Output: the index in T where a copy of P begins.

Some Notations and Terminologies • |P| and |T|: the lengths of P and T. • P[i]: the i-th letter of P. • Prefix of P: a substring of P starting with P[1]. • P[1..i]: the prefix containing the first i letters of P. • Example: abcabbccaa. prefix: a, ab, abc, abca, abcab, abcabb, ….

Some Notations and Terminologies • suffix of P[1..i]: a substring of P[1..i] ending at P[i], e.g. P[3..i], P[5..i] (i>4). Example: P[1..5]=abcaa. Suffix of P[1, 3]: c, bc, abc. Suffix of P[1..4]: a, ca, bca, abca.

Straightforward method • Basic idea: 1. i=1; 2. Start with T[i] and match P with T[i],T[i+1], ... T[i+|P|-1] | | | P[1] P[2] P[|P|] 3. whenever a mismatch is found, i=i+1 and goto 2 until i+|P|-1<|T|. • Example 1: T=ABABABCCA and P=ABABC P: ABABC A ABABC | | | T: ABABABCCA ABABABCCA ABABABCCA

Analysis • Step 2 takes O(|P|) comparisons in the worst case. • Step 2 could be repeated O(|T|) times. • Total running time is O(|T||P|).

Knuth-Morris-Pratt Method (linear time algorithm) A better idea • In step 3, when there is a mismatch we move forward one position (i=i+1). • We may move more than one position at a time when a mismatch occurs. (carefully study the pattern P). For example: P: ABABCABA T: ABABABCCA ABABABCCA

Questions: • How to decide how many positions we should jump when a mismatch occurs? • How much we can benefit? O(|T|+|P|). Example 2: P: abcabcabcaa | T: abcabcabcabcaa | abcabcab back here

We can move forward more than one position. Reason? • Study of Pattern P P[1..7] abcabca P[1..10] abcabcabca (when trying to P[11], we have a mismatch) P[1..7] abcabca P[1..4] abca • P[1..7] is the longest prefix that is also a suffix of P[1..10]. • P[1..4] is a prefix that is a suffix of P[1..10], but not the longest. • Key: When mismatch occurs at P[i+1], we want to find the longest prefix of P[1..i] which is also a suffix of P[1..i].

Failure function • f(i) is the largest r with (r<i) such that P[1] P[2] ...P[r] = P[i-r+1]P[i-r+2], ..., P[i]. Prefix of length rSuffix of P[1]P[2]…P[i] of length r • That is, P[1..f(i)] is the longest prefix that is a suffix of P[1..i]. • Example 3: P=ababaccc and i=5. P[1] P[2] P[3] a b a a b a b a P[3] P[4] P[5](r=3) f(5)=3.

Example 4: P=abcabbabcabbaa It is easy to verify that f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2, f(6)=0, f(7)=1, f(8)=2, f(9)=3, f(10)=4, f(11)=5, f(12)=6, f(13)=7, f(14)=1.

The Scan Algorithm(draw a figure to show) • i: indicates that T[i] is the next character in T to be compared with the right end of the pattern. • q: indicates that P[q+1] is the next character in P to be compared with T[i]. • i=1 and q=0; • Compare T[i] with P[q+1]case 1: T[i]==P[q+1] i=i+1;q=q+1; if q+1==|P| then print "P occurs at i+1-|P|"case 2: T[i]≠P[q+1] and q≠0 q=f(q); case 3: T[i]≠P[q+1] and q==0 i=i+1; • Repeat step2 until i==|T|.

Example 5: P=abcabbabcabbaa T=abcabcabbabbabcabbabcabbaa abcabb | | | abcabbabc | abc | a(i=i+1) abcabbabcabbaa(q+1=|p|)

Running time complexity(hard) • The running time of the scan algorithm is O(|T|). • Proof: • There are two pointers i and p. • i: the next character in T to be compared. • p: the position of P[1]. (See figure below) p i P:abcabcabcaa | T:abcabcabcabcaa | P: abcabcaa p

Facts: 1 When a match is found, move i forward. 2 When a mismatch is found, move p forward until p and i are the same. (When p=i and a mismatch occur, move both i and p forward) From facts 1 and 2, it is easy to see that the total number of comparisons is at most 2|T|. Thus, the time complexity is O(|T|).

Another version of scan algorithm (code) n=|T| m=|P| q=0 for i=1 to n { while q>0 and P[q+1]≠T[i] do { q=f(q) } if P[q+1]==T[i] then q=q+1 if q==m then { print "pattern occurs at i-m+1" q=f(q) } }

Failure Function Construction Basic idea: Case 1: f(1) is always 0. Case 2: if P[q]==P[f(q-1)+1] then f(q)=f(q-1)+1. Example: p=abcabcc abc f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=0; P[4]= P[f(4-1)+1], f(4)=f(4-1)+1=1. P[5]= P[f(5-1)+1], f(5)=f(5-1)+1=1+1=2. P[6]= P[f(6-1)+1]. F(6)=f(6-1)+1=2+1=3.

Case 3: if P[q]P[f(q-1)+1] and f(q-1)≠0 then consider P[q] ?= P[f(f(q-1))+1] (Do it recursively) Case 4: if P[q] P[f(q-1)+1] and f(q-1)==0 then f[q]=0. Example : abc abc abb abc abc f(8)=5 abc f(5)=2 a f(2)=0 i: 1 2 3 4 5 6 7 8 9 f(i): 0 0 0 1 2 3 4 5 0

The algorithm (code) to compute failure function 1. m=|P|; 2. f(1)=0; 3. k=0; 4. for q=2 to |P| do { 5. k=f(q-1); 6. if(k>0 and P[k+1]!=P[q]) { k=f(k); goto 6; } 7. if(k>0 and P[k+1]==P[q]) { f[q]=k+1; } 8. if(k==0) { if(P[k+1]==P[q] f[q]=1; else f[q]=0; } }

Another version 1. m=|P|; 2. f(1)=0; 3. k=0; 4. for q=2 to |P| do { 5. k=f(q-1); 6. while(k>0 and P[k+1]!=P[q]) do { 7. k=f(k); } 8. if(P[k+1]==P[q]) then k=k+1; 9. f[q]=k; }

Example 3: 1 2 3 4 5 6 7 8 9 10 11 12 P=a b c a b c a b c a a c f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=4; f(8)=5; f(9)=6; f(10)=7; f(11)=1. (The computation of f(11) is very interesting.) Question: Do we need to compute f(12)? Yes, if you want to find ALL occurrences of P. No, if you just want to find the first occurrence of P.

Example: P=abcabc T=abcabcabc abcabc abcabc When a match is found at the end of P, call f(|p|). Running time complexity (Fun Part, not required) The running time of failure function construction algorithm is O(|P|). (The proof is similar to that for scan algorithm.) Total running time complexity The total complexity for failure function construction and scan algorithm is O(|P|+|T|).

Linear Time Algorithm for Multiple patterns (Fun Part) • Input: a string T (very long) and a set of patterns P1,P2,...,Pk. • Output: all the occurrences of Pi's in T. Let us consider the set of patterns { he, she, his, hers }. We can construct an automata as follows:

0 8 2 7 6 9 4 5 3 e,i,r h e r s 1 s i s h e

g(s,a)=s' means that at state s if the next input letter is a then the next state is s'. • The states of the automata is organized column by column. • Each state corresponds to a prefix of some pattern Pi. • F: the set of final states (dark circled) corresponding to the ends of patterns. • For the starting state 0, add g(0,a)=0, if g(0,a) is originally fail.

Exercise: write down the g() function for the above automata. • Failure function f(s) = the state for the longest prefix of some pattern Pi that is a suffix of the string in the path from 0 (starting state) to s. • Example: he is the longest prefix for hers that is a suffix of the string she.

The scan algorithm Text: T[1]T[2]...T[n] s=0; for i:=1 to n do { while g(s,T[i])=fail do s=f(s); s:=g(s,T[i]); if s is in F then return "yes"; } return "no"

Theorem: The scan algorithm takes O(|T|) time. Proof: Again, the two pointer argument. • When a match is found, move the first pointer forward. (s:=g(s,T[i]);) • When a mismatch is found (g(s,T[i])==fail), move the second pointer forward. (s=f(s);) • When a final state is meet, declare the finding of a pattern. (if s is in F then return "yes";)

Example: i=1 2 3 4 5 6 7 8 s h e r s h i i 3 4 5 2 8 9 3 4 1 0 0 0

Failure function construction • Basic idea: similar to that for one pattern. for each state s of depth 1 do f(s)=0 for each depth d>=1 do for each state sd of depth d and character a such that g(sd,a)=s' do { s=f(sd) while g(s,a)=fail do { s=f(s) } f(s')=g(s,a) }

g(0,c)≠fail for any possible character c. The failure function for {he, she, his, hers} is Time complexity: O(|P1|+|P2|+...+|Pk|). Proof: Two pointer argument. Leave it for assignment (optional)

String Matching

String Matching

Presentation Transcript

String Matching

Approximate String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching

String Matching Algorithms

String Matching

String matching

Approximate String Matching

String Matching Algorithms

String Matching

String Matching

String Matching