String Edit Distance Matching Problem With Moves

String Edit Distance MatchingProblem With Moves • Graham Cormode • S. Muthukrishna November 2001

Pattern Matching • Text T of length n • Pattern P of length m Our goal: find “good” matches of P in T as measured by some edit distance function d(_,_) For every i=1,2…n find: D[i] = back

Pattern Matching example • T = “assasin” P = “ssi” -D[1]=2 assasin Or assasin _ssi ss i -D[3]=1 assasin s_si -naively it will take O(mn^3)

The main idea • d(X,Y) = smallest number of operations to turn X into Y. • The operations are insertion, deletion, replacement of a character and moves of a substring. • The idea is to approximate D[i] up to a factor of O(LognLog*n)

General algorithm Embed the string distance into L1 vector distance, up to a O(LognLog*n) factor : • compute the vector with a single pass over the string. • Find the vector representation of O(n) substrings. Do all this in O(nLogn) time. (a deterministic algorithm with an aproximate result)

Edit Sensitive Parsing (ESP)for the embedding • We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i ) For example: “abcbbabcbdfgj” “xyzbbabcbdfgj” “abcbb abc b dfgj” “xyzbb abc b dfgj”

Edit Sensitive Parsing (ESP)for the embedding • In practice, find landmarks in the strings, based only on their locality and parse by them. • Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so we will reduce the alphabet.

Edit Sensitive Parsing (ESP)choosing the landmarks • We use a technique called Alphabet reduction: Text: c a b a g e f a c e d Binary:010 000 001 000 110 100 101 000 010 100 011 Label: 00 2 1 0 3 2 1 0 3 2 3 Label(A[i]) = 2l + bit(l,A[i]) l = location of first bit that is different between A[i] and A[i-1] The value of the l bit in A[i]

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.) - In each iteration the alphabet is reduced fromΣ to 2Log|Σ|. - After Log*|Σ| iterations we get |Σ| < 7. - Then we reduce from 6 to 3, ensuring no adjacent pairs are identical (start from 6 then 5 then 3): Final iteration: 2 1 0 3 2 1 0 3 2 3 After reduction: 2 1 0 1 2 1 0 1 2 0

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.) • Properties of final labels: • Final alphabet is {0,1,2}. • No adjacent pair is identical. • Takes Log*|Σ| iterations. • Each label depends on O(Log*|Σ|) characters to its left.

Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa) • For varying substrings, use alphabet reduction then mark landmarks as follows:

Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 010 2 1 0 1 2 1 0 1 2 0

Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 0102 1 0 1 2 1 0 1 2 0 • Then mark any local minima if not adjacent to a marked char.

Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 0102 1 0 1 2 1 0 1 2 0 • Then mark any local minima if not adjacent to a marked char. • Clearly, distance between marked labels is 2 or 3.

Edit Sensitive Parsing (ESP)What did we achieve so far? • By now the whole string has been arranged in pairs and triples. • The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)

Edit Sensitive Parsing (ESP)constructing the ESP tree • We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987). Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q For some large q

Edit Sensitive Parsing (ESP)constructing the ESP tree In O(nLogn) construction time we get a 2-3 tree:

How do we represent a 2-3 tree as a vector? • The vector is the frequency of occurrence of each (level,label) pair:

Proof of correctness • Theorem: 1/2d(X,Y)V(X) – V(Y) 1O(lognlog*n)d(X,Y)

Upper bound proof: • V(X) – V(Y) 1 O(lognlog*n)d(X,Y) • Insert/change/delete a character: at most log*n + c nodes change per tree level. • Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level. since there are lognlevels in the tree: • Conclusion: each operation changes V byO(lognlog*n) and there are d(X,Y) changes.

Lower bound proof: • d(X,Y) 2 V(X) – V(Y) 1 • The Idea is to transform X into Y using at most 2 V(X) – V(Y) 1 operations: We want to keep hold of large pieces of the string that are common to both X and Y. So we will go through and protect enough pieces of X that will be needed in Y.

Lower bound proof: (cont.) • We avoid doing any changes in the protected pieces. • At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X). So we get V0(X)-V0(Y) 1 = 0. • We proceed inductively up on the tree, then to make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: Vi-1(X)-Vi-1(Y) 1 = 0)

Lower bound proof: example • Y: • X: I Counter=0 G H D E F E A B A B C A B B C Mark protected pieces: L K M J F E E C B A A B B C B C

Lower bound proof: example • Y: • X: I Counter=1 (deletion) Counter=0 Counter=2 (insertion) G H D E F E A B A B C A B B C Remove and add characters as needed: L K M J F E E A C B A A B B C B C

Lower bound proof: example • Y: • X: I Counter=2 (insertion) Counter=3 (1 move) G H D E F E A B A B C A B B C Move to level 2, move nodes in level 1 as needed: L K M J F E E A B A A B B C B C

Lower bound proof: example • Y: • X: I Counter=5 (2 moves) Counter=3 (1 move) G H D E F E A B A B C A B B C Move to level 3, move nodes in level 2 as needed: L This node will not move K M D F E E A B A A B B C B C

Lower bound proof: example • Y: • X: I Counter=5 (2 moves) G H D E F E A B A B C A B B C That’s it!! I G H D E F E A B A B C A B B C

I G H D E F E A B A B C A B B C L K M J F E E C B A A B B C B C Lower bound proof: (example with d(X,Y)=1) • Did we achieve what we wanted? Y X 2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18 And we surely created X from Y in less then 18 moves

Application to string matching • To find D[i] , we need to compare P against all possible substrings of T. we can reduce this to O(n): • d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P) So we only need to consider O(n) substrings of length m And we get a 2-approximation of the optimal matching !! Since we need at least |r–m| operations to make T[1,r] the same length as P.

Application to string matching – Final algorithm • Create ESP trees for T and P. • Find V(T[1:m])-V(P)1 • Iteratively compute D[i] V(T[i:i+m-1])-V(P)1 Overall we get O(nLogn) time cost for the whole algorithm, and compute every D[i] up to a factor of O(lognlog*n)

Application to string matching – Final algorithm example:

The End

String Edit Distance Matching Problem With Moves

String Edit Distance Matching Problem With Moves

Presentation Transcript

String Matching

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance

Top-k String Similarity Search with Edit-Distance Constraints

String Matching

String Matching with Mismatches

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

Edit Distance

String Matching

String Matching

String Matching

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]

String matching

String Matching

String Matching

String Matching