690 likes | 819 Vues
This study presents an effective algorithm for solving the indexing problem, efficiently locating all occurrences of a pattern in a given text, allowing for at most one mismatch. By constructing suffix and prefix trees from the text, the algorithm identifies matches through precise node querying. The processing steps are delineated, including tree construction and interval intersections for range queries. This method demonstrates robust performance in approximate string matching tasks, establishing a foundation for further research in algorithmic string processing.
E N D
Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng
Problem Definition • The Indexing Problem: • Input:A Text T of length n over alphabet Σ, a pattern P of length m over alphabet Σ and an integer k. • Output: All occurrences of P in T with at most k mismatches.
Main idea • In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.
Processing • 1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn … t1.
Ex: T=AGCAGAT TR=TAGACGA
Ex: T=AGCAGAT TR=TAGACGA
Processing • 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.
Ex: T=AGCAGAT TR=TAGACGA
Processing • 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.
Ex: T=AGCAGAT TR=TAGACGA
Processing • 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n – i + 3, where i is the starting position of the leaf’s suffix in TR.
Ex: T=AGCAGAT TR=TAGACGA
Query Processing • For j = 1, …., m do • 1. Find node v, the location of Pj+1 … Pm in ST, if such a node exists. • 2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist. • 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all iI.
Example Ex: T=actgacctcagctta P=ctga k=1
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε} W[vl….vr]={3,12,…,14} I={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε} W[vl….vr]={4,8,9,14,11} I={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10} W[vl….vr]={5,10,15} I={15,5,10}
When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all iI. T2…T6, T7…T11, T12…T15。 T=actgacctcagctta P=ctaa
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=4 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13} W[vl….vr]={ε} I={ε}
Range Query Problem • In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.
Problem Definition of Range Query • Input: Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. • Output: Find the intersection of elements of V[i … i+k] and W[j … j+l].
Example: V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]
Preprocessing 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 V= W= 1 2 3 4 5 6 7 8
Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 7 V= W= 4 1 2 3 4 5 6 7 8 6 2 1 5 8
Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 6 V= W= 4 1 2 3 4 5 6 7 8 7 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}. 1 5 8
Time Complexity of Range Query Problem • By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.
Time Complexity • For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.
Problem Definition • The Dictionary Matching Problem • Input: • 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns. • 2. A Text T of length n over alphabet Σ. • 3. An integer k. • Output: • All occurrences of any dictionary patterns in T with at most k mismatches.
Main idea • In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.
Processing • 1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.
Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Processing • 2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Preprocessing • 3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Query Processing • For j = 1,…., n do • 1. Find node v, the location of the longest prefix of tj+1 … tn in SD. • 2. Find node w, the location of the longest prefix of tj-1 … t1 in SDR. • 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.
Example T=acagccga D={tca,gctga,gca} K=1
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga