Text Indexing and Dictionary Matching with One Error

Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng

Problem Definition • The Indexing Problem： • Input：A Text T of length n over alphabet Σ, a pattern P of length m over alphabet Σ and an integer k. • Output： All occurrences of P in T with at most k mismatches.

Main idea • In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.

Processing • 1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn … t1.

Ex： T=AGCAGAT TR=TAGACGA

Processing • 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

Processing • 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.

Processing • 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n – i + 3, where i is the starting position of the leaf’s suffix in TR.

Query Processing • For j = 1, …., m do • 1. Find node v, the location of Pj+1 … Pm in ST, if such a node exists. • 2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist. • 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all iI.

Example Ex： T=actgacctcagctta P=ctga k=1

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε} W[vl….vr]={3,12,…,14} I={ε}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε} W[vl….vr]={4,8,9,14,11} I={ε}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10} W[vl….vr]={5,10,15} I={15,5,10}

When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all iI. T2…T6， T7…T11， T12…T15。 T=actgacctcagctta P=ctaa

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=4 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13}

Ex： T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13} W[vl….vr]={ε} I={ε}

Range Query Problem • In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

Problem Definition of Range Query • Input： Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. • Output： Find the intersection of elements of V[i … i+k] and W[j … j+l].

Example： V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]

Preprocessing 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 V= W= 1 2 3 4 5 6 7 8

Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 7 V= W= 4 1 2 3 4 5 6 7 8 6 2 1 5 8

Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 6 V= W= 4 1 2 3 4 5 6 7 8 7 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}. 1 5 8

Time Complexity of Range Query Problem • By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

Time Complexity • For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.

The Dictionary Matching Problem

Problem Definition • The Dictionary Matching Problem • Input： • 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns. • 2. A Text T of length n over alphabet Σ. • 3. An integer k. • Output： • All occurrences of any dictionary patterns in T with at most k mismatches.

Main idea • In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.

Processing • 1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.

Example： P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

Suffix Tree of D (SD) Example： P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

Suffix Tree of DR (SDR) Example： P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

Processing • 2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.

Preprocessing • 3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

Query Processing • For j = 1,…., n do • 1. Find node v, the location of the longest prefix of tj+1 … tn in SD. • 2. Find node w, the location of the longest prefix of tj-1 … t1 in SDR. • 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.

Example T=acagccga D={tca,gctga,gca} K=1

Suffix Tree of D (SD) Example： P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga

Suffix Tree of DR (SDR) Example： P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga

Text Indexing and Dictionary Matching with One Error

Text Indexing and Dictionary Matching with One Error

Presentation Transcript

Error Tolerant Matching

Full-Text Indexing

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Text Mining: Fast Phrase-based Text Indexing and Matching

Indexing with

Text Features Matching Game

Tools for Text Indexing and SearchING

Dictionary Matching with One Gap

Token-based dictionary pattern matching for text analytics

Dictionary Matching and Indexing with Edits and Don’t Cares

Text Indexing

Text Message Slang Dictionary

Compressed Index for Dictionary Matching

Basic Text Processing and Indexing

Multimedia and Text Indexing

Dynamic Text and Static Pattern Matching

Performing Indexing and Full-Text Searching

Text Search and Fuzzy Matching

Text Search and Fuzzy Matching

Full-Text Indexing

Dynamic Text and Static Pattern Matching

Multimedia and Text Indexing