A Hybrid Indexing Method for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

The approximate string matching problem is: Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

This paper uses an exhaustive searching mechanism. We open a window T’ in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T’’ of this window T’ has ed(T’’,P) > k. If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T’’ of the window T’ has ed(T’’,P) ≦k.

We use dynamic programming to compute the edit distance between two strings. A matrix C0…|m|,0…|n| is filled, where Cj,i represents the minimum number of operations need to match T1…i to P1…j. This is computed as follows Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

example: T = surgery P = survey k = 2 There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey are smaller than or equal to k=2.

Let us now see how we can be sure that for a window T’ with size m+k , for every prefix T’’ of T’, ed(T’’,P) > k. We present Lemma 1 of this paper as follows.

Lemma 1 Let T’ in T and P be two strings such that ed(T’, P) ≦k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xiand for any j≧ 1. Then, at least one string Pi appears in T’ with at most errors. Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

To be more precise, we may say that if ed(T’,P) ≦ k, there exists a Pi in P and a T’’ in T’ such that ed(Pi,T’’)≦ .

Lemma 1 tells us that if for all Pi in P and every substring b in T’, ed(Pi,b) > , then ed(P,T’) > k. Suppose that there is a window T’ with size m+k and for all Pi in P and for every substring b in T’, ed(Pi,b) > . Then, we can be sure that for every prefix T’’of T’ , for all Pi in P and every substring b in T’’, ed(Pi,b) > . T’ T’’ T P

Let us define the following condition. Condition A: For all Pi in P and every substring b in T’, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T’’ of T’, ed(T’’,P)>k. In such a case, we ignore T’ and shift P one step to the right.

Question, how can we be sure that the above condition is satisfied. The approach: For each Pi, we generate all possible modified strings Piwhose distances with Pi are smaller than or equal to k. After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .

We still have the following questions: • Question 1. How to divide P into j pieces? • Question 2. How to generate all modified Pi’s? • Question 3. How to find the occurrences of Pi’s in T with edit distance less than or equal to .

Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around logσn.

Question 2. How to generate all modified Pi’s? The generation of all modified strings whose distances with P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu. Another method can be found in [HM2007] reported By L. C. Chen. In this paper, the authors used the second method mentioned in [HM2007].

We can use non-deterministic finite automatons (NFA).A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ∪ {ε}) into the set of subsets of Q, q0 Qis an initial state, and F Q is a set of final states.

P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}. Recognize aa

Full example: T = GACACAGACCAAAGCAG n = 17 P = CAAG m = 4 k = 1

P = CAAG j = (m + k) / logσn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces. P1 = CA P2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

NFA with k = 1 of P1 = CA: NFA with k = 1 of P2 = AG:

T = GACACGGACCAAAGCAG We construct the suffix tree of T. A G C GACCAAAGCAG$ A $ C G AC CAG$ AGCAG$ A GCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ 17 GGACCAAAGCAG$ $ CAAAGCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ GGACCAAAGCAG$ CAG$ CGGACCAAAGCAG$ 14 AAGCAG$ G$ 16 11 12 6 15 13 9 7 8 10 5 2 1 4 3

We only need to consider the tree level from root to = 3 . A G C A $ C G AC GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 NFA of P1: NFA of P2

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG 13 16 k = 1 Out of active states. We record positions 13 and 16 where AG occurs. (exact match)

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (exact match) We record positions 3, 10 and 15 where CA occurs. Out of active states.

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.

A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)

After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 16 We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations.

k = 1 The probable positions of T are 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CACG is found.

The probable positions of T are: 3, 10, 13, 15, 16 m+k This window does not include any probable position. Therefore we can ignore this window.

The probable positions of T are: 3, 10, 13, 15, 16 m+k The window does not include any probable position. Therefore we can shift the window directly.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CAA, CAAA and CAAAG are found.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAAG is found.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAG is found.

k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m No approximate matching with k=1 found.

A Hybrid Indexing Method for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching

Presentation Transcript

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

String Matching

XML data management and approximate string matching

Approximate String Matching

String Matching

Rules for Approximate String Matching

String Matching

String Matching

String Matching

String Matching

Approximate Boyer-Moore String Matching

XML data management and approximate string matching

Filter Algorithms for Approximate String Matching

A fast algorithm for approximate string matching on gene sequences

String matching

Approximate String Matching

Using a Genetic Algorithm for Approximate String Matching on Genetic Code

String Matching

String Matching