Create Presentation
Download Presentation

Download Presentation
## Approximate Boyer-Moore String Matching

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Approximate Boyer-Moore String Matching**Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260 J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen**The k mismatches problem**• The k differences problem**Definition of the k mismatches problem**• Given a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with at most k mismatches. If k=1, then**Consider the following situation where a pattern P is**matching with a windows W of T and there are already (k+1) mismatches:**Since there are already (k+1) mismatches, we must move the**pattern. The following is obvious: P must be moved to such an extent that there are at most k mismatches between a suffix S of W and a substring S’ of P.**Our trick is as follows: Consider the (k+1)-suffix of W.**There are two cases:**Case 1: There is one character in this (k+1)-suffix which**exists in P in such a way as shown below. Move the pattern to match these characters. Note that in such a situation, there are at most k mismatches between the (k+1)-suffix and its corresponding substring in P.**Case 2: No such a character exists. Move the pattern in**such a way that the k-prefix of P aligns with the k-suffix of W as shown below. Under such a situation, again, there are at most k-mismatches between the k-suffix of W and k-prefix of P.**The generalization of the BM algorithm for the k mismatches**problem will be very natural: for k=0 the generalized algorithm is exact string matching. Recall that the k mismatches problem asks for finding all occurrences of P in T such that in at most k positions of P, T and P have different characters.**We just scan the pattern from right to the left until we**have found k+1 mismatches (unsuccessful search) or the pattern ends (successful search).**Preprocessing phase for approximate matching**Dk table The value Dk for a particular alphabet is defined as the rightmost position of that character in the pattern – 1 and the end position i where i=[m..m-k]. Example : Let k=1, m=8, a ∑**Algorithm for preprocessing phase**P = p1p2…pm,T = t1t2…tn Preprocessing Fora ∑ Do Forj=m downto m-k Do Begin dk[j,a]← m Find a character a that it is close to pj. If it is found, we calculate the distance between the position of the character a and j and insert it into dk[j,a].**Algorithm for searching phase**P = p1p2…pm,T = t1t2…tn Searching j=m; While j≦ n+ k Do Begin h=j; i=m; mismatch=0; While i>0 and mismatch ≦ k Do Begin d=min(dk[i, th], dk[i-1, th-1]); If th≠pi Then mismatch=mismatch+1; i= i- 1; h= h-1 End of while; If mismatch ≦ k Then report match at position j; j= j+ dEnd of while**Complete example for approximate string matching**Example 1: Let k=1, m=4, n=17**Example 1 (6/6)**j ← 16+ p , j ← 16+ 3, j ← 19 jump out of while loop**Example 2:**Let k=1, m=8, n=24**Example 2 (5/14)**Then report match at position j; j ← 13+ p , j ← 13+ 2, j ← 15**Example 2 (14/14)**If h = 0 Then report match at position j; j ← 24+ p , j ← 24+ 2, j ← 26 jump out of while loop**Time complexity**• preprocessing phase in O(m+ kc) time and O(kc) space complexity. • searching phase in O(mn) time complexity.**Definition of the k differences problem**• Given a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with edit distance not larger than k.**The basic approach to solve the problem is to find the edit**distance for T(1, i) and P for every i [Ukk85b] : Let Edit be an m+1 by n+1 table such that Edit(i, j) is the minimum edit distance between p1p2…pj and any substring of T ending at ti.**Table Edit must be completely evaluated column-by-column in**time O(mn).**If we can find out all occurrences of i where Edit(T(1, i),**P) cannot be smaller than k. We may skip this i. This paper is based upon Rule 7 proposed by Professor Lee.**Rule 7**• If k characters in String A do not appear in String B, Distance(A,B) is not smaller than k.**In the scanning phase, we define some terms first.**A diagonal h of Edit for h=-m,…, n, consists of all Edit(i, j) such that i- j=h. For every Edit(i, j), there is a minimizing arc from Edit(i-1, j) to Edit(i, j) if Edit(i, j)=Edit(i-1, j)+1, from Edit(i, j-1) to Edit(i, j) if Edit(i, j-1)+1, and from Edit(i-1, j-1) to Edit(i, j) if Edit(i, j)=Edit(i-1, j-1) where pj=tior if Edit(i, j)=Edit(i-1, j-1)+1 where pj≠ti. The costs of the arcs are 1, 1, 0 and 1, respectively.**A minimizing path is any path that consists of minimizing**arcs and leads from an entry Edit(i, 0) on the first row of Edit to an entry Edit(h, m) on the last row of Edit. A minimizing path is successful if it leads to an entry Edit(h, m)≤k.**Proof : Each addition of a diagonal comes from either an**insertion or deletion. If there are more than (k+1) diagonals, there must be more than (k+1) operations, either deletions or insertions. Thus there cannot be more than (k+1) diagonals. Lemma 1: The entries on a successful minimizing path M are contained in ≤ k+1 successive diagonals of Edit.**T:ABCABBA**P:CBABAC S:C-AB-- P:CBABAC EDIT(P, S)=3 There are (k+1)=3+1=4 successive diagonals because there are three deletions. Successive diagonals**T:BCABDAB**P:CBADB k =3 S:C-ABDAB P:CBA-D-B EDIT(P, S)=3 There are 1+2=3 <(k+1) =3+1=4 successive diagonals because there are one deletion and two insertions. Successive diagonals**By Lemma 1, for each diagonal d, any successful minimizing**path starting at the top of this diagonal will have a bandwidth of 1+k+k=2k+1**T:ABCABBA**P:CBABAC k=3 S:C-AB-- P:CBABAC Result EDIT(P, S)=3 The successful minimizing path is only in the bandwidth ≤ 7 of Edit. k=3 Successive diagonals k=3**For the width of bandwidth ≤ k of Edit, we give it a name,**call k-environment. For each j=1, …, m, let the k-environment of the pattern symbol pj be the string Cj=pj-k…pj+k,where pa=ε for a<1 and a> m.**The longest vertical path in any minimizing path has length**not greater than 2k+1. We only have to determine whether ti appears in the k environment of pj.**Given T=ATGCGAGAGAT, P=GCAGAGAGATG, and k=2. We select t5,**t8 and t11 three characters. The 2-environment of t5 is C5=p3p4p5p6p7=AGAGA. The 2-environment of t8 is C8=p6p7p8p9p10=GAGAT. The 2-environment of t11 is C11=p9p10p11=ATG.**We now obtain a stronger version of Rule 7.**Lemma 2: Let a successful minimizing path M go through some entry on a diagonal h of Edit. Then for at most k indexes j, 1≤j ≤m, character th+j does not occur in the k environment of Cj. A formal proof can be found in the paper. In the following, we give some physical feeling of it.**In this case, although there are two mismatches, by deleting**a which mismatches x, we may achieve a perfect match. Thus the edit distance between T and P may still be1. k=1