Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Approximate String Matching Using Compressed Suffix ArraysTrinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y. • k-difference string matching problem: • Given a text T with length n, a pattern P with length m, and an error bound k. • Find all position i of T such that there exists an suffix S of T(1, i), d(S, P) ≦ k.

The approach of this paper is as the follows: • Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P. • Then we conduct an exact match of all such P’s against T.

Example: T=abbaaa, P=aba and k=1. From P and k, we generate the following P’s: ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k. • How can we generate all P’s which we want? • We use the following observation.

S S1 S2 T P P1 P2 Let S be a substring of T, and S= S1S2. P = P1P2. If d(S1, P1) ≦k, and Dist(S2, P2) = 0, d(S, P) ≦ k.

k = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 T Example: A C A C A A A A A C A C C S1 S2 1 2 3 4 5 6 P A G A B C A P1 P2 Consider the substring S = T(6, 11) = AAAACA, Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA. Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0. We have Dist(S, P) = 2 ≦k.

k = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 T Example: A C A C A A A A A C A C C S1 S2 1 2 3 4 5 6 P A G A B C A P1 P2 Consider the substring S = T(8, 11) = AACA, Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA. Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0. We have Dist(S, P) = 2 ≦k.

Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner. • Consider P=aba, k=1.

ba (Deletion) k = 1 aaba (Insertion) k = 1 i = 1 baba (Insertion) k = 1 P = aba • P=aba, k=1. bba (Substution) k = 1 aa (Deletion) k = 1 aba k = 0 aaba (Insertion) k = 1 abba (Insertion) k = 1 i = 2 aaa (Substution) k = 1 ab (Deletion) k = 1 aba k = 0 abaa (Insertion) k = 1 abba (Insertion) k = 1 i = 3 abb (Substution) k = 1 aba k = 0 abaa (Insertion) k = 1 abab (Insertion) k = 1 i = 4

ba (Deletion) k = 1 aaba (Insertion) k = 1 i = 1 baba (Insertion) k = 1 P = aba • P=aba, k=2. bba (Substution) k = 1 aa (Deletion) k = 1 aba k = 0 aaba (Insertion) k = 1 abba (Insertion) k = 1 i = 2 aaa (Substution) k = 1 ab (Deletion) k = 1 aba k = 0 abaa (Insertion) k = 1 abba (Insertion) k = 1 i = 3 abb (Substution) k = 1 aba k = 0 abaa (Insertion) k = 1 abab (Insertion) k = 1 i = 4

a (Deletion) k = 2 i = 2 • P=aba, k=2. aba (Insertion) k = 2 bba (Insertion) k = 2 ba (k = 1) aa (Substution) k = 2 b (Deletion) k = 2 ba k = 1 baa (Insertion) k = 2 bba (Insertion) k = 2 i = 3 bb (Substution) k = 2 ba k = 1 baa (Insertion) k = 2 bab (Insertion) k = 2 i = 4

PR’ PL’ i For i=1 to m+1 Deletion, k’++ P’ PR PL i P PL’ PR’ P’ PL’ PR’ A Replacement , k’++ P’ P’ C … PL’ PR’ k’=Dist(PL’, PL)≦k. Dist(PR’, PR) = 0 P’ Insertion, k’++ A P’ C … PL’ PR’ No operation. P’ i Terminate if k’ > k.

Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not. • For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

This exact matching can be found by using the suffix array and the inverse suffix array.

Suffix Array • Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A. • The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj. • The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ Suffixes of T: {GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $} Lexicographic order: $, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$. = T9, T1, T3, T2, T7, T8, T0, T4, T6, T5 i 0 1 2 3 4 5 6 7 8 9  SA[i] 9 1 3 2 7 8 0 4 6 5

Inverse Suffix Array • The inverse suffix array of T is denoted as SA-1[i]. • SA-1[i] equals the number of suffix which are lexicographically smaller then Ti.

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) i SA[i] SA-1[i] SA-1[0]=6 because there are 6 suffixes smaller than T0= GACAGTTCG. 0 9 6 1 1 1 2 3 3 3 2 2 4 7 7 5 8 9 6 0 8 7 4 4 SA-1[SA[x] ] = x. 8 6 5 9 5 0

The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed]. We write [st..ed ] = range(T, P).

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] P = G. Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 G is a prefix of T8, T0 and T4. 1 1 2 3 T8 = TSA[5] T0 = TSA[6] T4 = TSA[7]  st=5, ed=7, range(T, P) = [5..7]. 3 2 4 7 5 8 6 0 7 4 8 6 9 5

Lemma 1 (Gusfild [12]) Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

Lemma 2 Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

Let [st1..ed1] = range(T , P1), [st2..ed2] = range(T , P2), [st..ed] = range(T , P1P2). • [st..ed] is a subinterval of [st1..ed1].

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) P1 = G. P2 = A. 0 9 1 1 range(T, P1) = [5..7]. 2 3 3 2 range(T, P1P2) must be within [5..7]. How can we find the exact interval with [5..7]? 4 7 5 8 6 0 7 4 8 6 9 5

By the definition of suffix array, the lexicographic order of are increasing. • The lexicographic order of are also increasing.

T2 = CAGTTCG$ T2+1 = T3 = AGTTCG$ T2+1 is obtained by deleting the prefix with length 1 from T2. In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti. Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5)

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ P1 = G. P2 = A. i SA[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 range(T, P1) = [5..7]. 1 1 2 3 3 2  T8 < T0 < T4 4 7 5 8 6 0 • T8+1, T0+1, T4+1 • T9 < T1 < T5 7 4 8 6 9 5

The lexicographic order of are also increasing. • Thus • To find st and ed, we find the smallest st such that and the largest ed such that

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G A T C G $ P1 = G. P2 = A. i SA[i] SA-1[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) ATCG$. (T5) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GATCG$ (T4) TCG$ (T6) 0 9 7 range(T, P1) = [6..8]. 1 1 1 range(T, P2) = [1..3]. 2 3 4 range(T, P1P2) = [st..ed]. 3 5 2 4 2 8 6≦ st, ed ≦8 5 7 3 6 8 9 7 0 5 8 4 6 9 6 0  st = 7and ed =8.

To find the interval of the first character of P: We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c. range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] C[A] = 2 C[C] = 4 C[G] = 7 C[T] = 9 Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 1 1 2 3 3 2 4 7 5 8 P = GACAGCA 6 0 7 4 range(T, p1) = [C[C]+1…C[G] ] = [5…7]. 8 6 9 5

Lemma 3 Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]). II Call kapproximate([0..n], 1, 0, ε, ε). kapproximate([s’..e’], i, k’, PL’, Υ ) begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ; end

After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

References • [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc. • Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192. • [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on • Discrete Algorithms, 2000, pp. 794–803. • [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Proc. Seventh Ann. Symp. on Combinatorial Pattern • Matching (CPM’96), pp. 1–23. • [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLEI, vol. 1, November 1997, pp. 273–282. • [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772. • [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products. in: ESA 2000, pp. 120–131. • [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Combinatorial Pattern Matching (CPM’95), Lecture • Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54. • [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and don’t cares, in: Proc. 36th Ann. ACM Symp. on • Theory of Computing, 2004, pp. 91–100. • [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symp. on Foundations of Computer Science • (FOCS’00), 2000, pp. 390–398.

[10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland, • 1992. • [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM • Symp. on Theory of Computing, 2000, pp. 397–406. • [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, • Cambridge, 1997. • [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing full-text indices, in: Proc. IEEE Symp. on Foundations • of Computer Science, 2003. • [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. in: Proc. MFCS’91, Lecture Notes in Computer Science, • vol. 520, Springer, Berlin, 1991, pp. 240–248. • [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM 2003, pp. 186–199. • [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350. • [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, pp. 200–210. • [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. • [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.

[20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272. • [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88. • [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern • Matching (CPM’99), pp. 163–185. • [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matching, J. Discrete Algorithms 1 (1) (2000) 205–239 18. • [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24 (4) (2001) • 19–27. • [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, in: Proc. 11th Ann. Symp. on Combinatorial Pattern • Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000. • [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems, Genome Informatics 12 (2001) 175–183. • [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South American Workshop on String Processing (WSP’96), • Carleton University Press, 1996. • [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching • (CPM’96), pp. 50–63. • [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Matching 1993, vol. 4, Springer, Berlin, June 1993, • pp. 228–242. • [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173.

Thank you!

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu