Approximate Boyer-Moore String Matching Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260 J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen
The k mismatches problem • The k differences problem
Definition of the k mismatches problem • Given a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with at most k mismatches. If k=1, then
Consider the following situation where a pattern P is matching with a windows W of T and there are already (k+1) mismatches:
Since there are already (k+1) mismatches, we must move the pattern. The following is obvious: P must be moved to such an extent that there are at most k mismatches between a suffix S of W and a substring S’ of P.
Our trick is as follows: Consider the (k+1)-suffix of W. There are two cases:
Case 1: There is one character in this (k+1)-suffix which exists in P in such a way as shown below. Move the pattern to match these characters. Note that in such a situation, there are at most k mismatches between the (k+1)-suffix and its corresponding substring in P.
Case 2: No such a character exists. Move the pattern in such a way that the k-prefix of P aligns with the k-suffix of W as shown below. Under such a situation, again, there are at most k-mismatches between the k-suffix of W and k-prefix of P.
The generalization of the BM algorithm for the k mismatches problem will be very natural: for k=0 the generalized algorithm is exact string matching. Recall that the k mismatches problem asks for finding all occurrences of P in T such that in at most k positions of P, T and P have different characters.
We just scan the pattern from right to the left until we have found k+1 mismatches (unsuccessful search) or the pattern ends (successful search).
Preprocessing phase for approximate matching Dk table The value Dk for a particular alphabet is defined as the rightmost position of that character in the pattern – 1 and the end position i where i=[m..m-k]. Example : Let k=1, m=8, a ∑
Algorithm for preprocessing phase P = p1p2…pm,T = t1t2…tn Preprocessing Fora ∑ Do Forj=m downto m-k Do Begin dk[j,a]← m Find a character a that it is close to pj. If it is found, we calculate the distance between the position of the character a and j and insert it into dk[j,a].
Algorithm for searching phase P = p1p2…pm,T = t1t2…tn Searching j=m; While j≦ n+ k Do Begin h=j; i=m; mismatch=0; While i>0 and mismatch ≦ k Do Begin d=min(dk[i, th], dk[i-1, th-1]); If th≠pi Then mismatch=mismatch+1; i= i- 1; h= h-1 End of while; If mismatch ≦ k Then report match at position j; j= j+ dEnd of while
Complete example for approximate string matching Example 1: Let k=1, m=4, n=17
Example 1 (6/6) j ← 16+ p , j ← 16+ 3, j ← 19 jump out of while loop
Example 2: Let k=1, m=8, n=24
Example 2 (5/14) Then report match at position j; j ← 13+ p , j ← 13+ 2, j ← 15
Example 2 (14/14) If h = 0 Then report match at position j; j ← 24+ p , j ← 24+ 2, j ← 26 jump out of while loop
Time complexity • preprocessing phase in O(m+ kc) time and O(kc) space complexity. • searching phase in O(mn) time complexity.
Definition of the k differences problem • Given a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with edit distance not larger than k.
The basic approach to solve the problem is to find the edit distance for T(1, i) and P for every i [Ukk85b] : Let Edit be an m+1 by n+1 table such that Edit(i, j) is the minimum edit distance between p1p2…pj and any substring of T ending at ti.
If we can find out all occurrences of i where Edit(T(1, i), P) cannot be smaller than k. We may skip this i. This paper is based upon Rule 7 proposed by Professor Lee.
Rule 7 • If k characters in String A do not appear in String B, Distance(A,B) is not smaller than k.
In the scanning phase, we define some terms first. A diagonal h of Edit for h=-m,…, n, consists of all Edit(i, j) such that i- j=h. For every Edit(i, j), there is a minimizing arc from Edit(i-1, j) to Edit(i, j) if Edit(i, j)=Edit(i-1, j)+1, from Edit(i, j-1) to Edit(i, j) if Edit(i, j-1)+1, and from Edit(i-1, j-1) to Edit(i, j) if Edit(i, j)=Edit(i-1, j-1) where pj=tior if Edit(i, j)=Edit(i-1, j-1)+1 where pj≠ti. The costs of the arcs are 1, 1, 0 and 1, respectively.
A minimizing path is any path that consists of minimizing arcs and leads from an entry Edit(i, 0) on the first row of Edit to an entry Edit(h, m) on the last row of Edit. A minimizing path is successful if it leads to an entry Edit(h, m)≤k.
Proof : Each addition of a diagonal comes from either an insertion or deletion. If there are more than (k+1) diagonals, there must be more than (k+1) operations, either deletions or insertions. Thus there cannot be more than (k+1) diagonals. Lemma 1: The entries on a successful minimizing path M are contained in ≤ k+1 successive diagonals of Edit.
T:ABCABBA P:CBABAC S:C-AB-- P:CBABAC EDIT(P, S)=3 There are (k+1)=3+1=4 successive diagonals because there are three deletions. Successive diagonals
T:BCABDAB P:CBADB k =3 S:C-ABDAB P:CBA-D-B EDIT(P, S)=3 There are 1+2=3 <(k+1) =3+1=4 successive diagonals because there are one deletion and two insertions. Successive diagonals
By Lemma 1, for each diagonal d, any successful minimizing path starting at the top of this diagonal will have a bandwidth of 1+k+k=2k+1
T:ABCABBA P:CBABAC k=3 S:C-AB-- P:CBABAC Result EDIT(P, S)=3 The successful minimizing path is only in the bandwidth ≤ 7 of Edit. k=3 Successive diagonals k=3
For the width of bandwidth ≤ k of Edit, we give it a name, call k-environment. For each j=1, …, m, let the k-environment of the pattern symbol pj be the string Cj=pj-k…pj+k,where pa=ε for a<1 and a> m.
The longest vertical path in any minimizing path has length not greater than 2k+1. We only have to determine whether ti appears in the k environment of pj.
Given T=ATGCGAGAGAT, P=GCAGAGAGATG, and k=2. We select t5, t8 and t11 three characters. The 2-environment of t5 is C5=p3p4p5p6p7=AGAGA. The 2-environment of t8 is C8=p6p7p8p9p10=GAGAT. The 2-environment of t11 is C11=p9p10p11=ATG.
We now obtain a stronger version of Rule 7. Lemma 2: Let a successful minimizing path M go through some entry on a diagonal h of Edit. Then for at most k indexes j, 1≤j ≤m, character th+j does not occur in the k environment of Cj. A formal proof can be found in the paper. In the following, we give some physical feeling of it.
In this case, although there are two mismatches, by deleting a which mismatches x, we may achieve a perfect match. Thus the edit distance between T and P may still be1. k=1