Edit Distance

Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007

Overview • What is edit distance? Why do we care? • A few interesting properties • Some algorithms from the past for finding exact edit distance • More (complicated) algorithms from today for approximate edit distance • Embedding edit distance into l1

What is that? Suppose we have two strings x,y e.g. x = kitten y = sitting and we want to transform x into y. We use edit operations: 1. insertions 2. deletions 3. substitutions

What is that? A closer look: k i t t e n s i t t i n g 1st step: kitten sitten (substitution) 2nd step: sittensittin (substitution) 3rd step: sittinsitting (insertion)

What is that? Can we do better? Answer here is no (obviously) What about: x = darladidirladada y = marmelladara Tough…

Why do we care? A lot of applications depend on the similarity of two strings • Computational Biology: …ATGCATACGATCGATT… …TGCAATGGCTTAGCTA…Animal species from the same family are bound to have more similar DNAs What about evolutionary biology?

Why do we care? • searching keywords through the net: usually by “mtallica” we mean “metallica”:

Definitions • We care about bit-strings only: Σ = {0,1}n • for i..j<n we denote the substring of x: x[i..j] • by xi we denote the i-th bit of x • Associate operations withstring positions: deleting xi↔ isubstituting xi↔ iinserting y ↔ position of y after the insertion • Alignmentτ of x,y is a sequence of edit operations transforming x into y

Definitions • Length of an alignment is the number of its edit operations • Optimal alignmentis one that uses a minimum number of edit operations • Edit Distance of two strings x,y is the length of their optimal alignment: ED(x,y)e.g. ED(kitten, sitting) = 3 • Hamming Distance of two equal length strings x,y is the number of positions forwhich the corresponding symbols are different(xi≠ yi)e.g. HD(kitten, sittin) = 2

Some interesting properties • Triangle Inequality: for any three strings x,y,z of arbitrary lengths: ED(x,y) ≤ ED(x,z) + ED(z,y) • Splitting Inequality: let x, y be strings of lengths n and m respectively. For all i,j:ED(x,y) ≤ ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])

Some interesting properties • Some simple upper and lower bounds: Let x, y be strings of lengths n, m (n ≤ m). Then: • ED(x,y) ≤ m • ED(x,y) ≥ m-n • ED(x,y)=0 iff x=y • if m=n, ED(x,y) ≤ HD(x,y) • ED(x,y) ≥ number of characters (not counting duplicates) found in x but not in y

Some interesting properties • Induced alignment property: • insτ(i..j) = number of insertions in interval [i..j] • delτ(i..j) = number of deletions in interval [i..j] • subτ(i..j) = number of sub/tions in interval [i..j]Caution: i,j refer to the initial positions of the string (see the string as an indexed array) • shτ(i..j) = insτ(i..j) - delτ(i..j)Intuitively shτ(i..j) is the induced shift on x[i..j]Define shτ(i) = shτ(1..i) and shτ(0) = 0 • edτ(i..j) is the induced alignment of τ, i.e. the sub-sequence of edit operations in τ that belong in [i..j]

Some interesting properties Consider the strings:x = savvato, y = eviva edτ(4..5)=1 because we insert “i” in pos. 4

Some interesting properties Induced alignment property: For any alignment τ of x and y, for all i ≤ j, edτ(i..j) ≥ ED (x[i..j], y[i+shτ(i-1)..j+ shτ(j)])

A Dynamic Programming Algorithm • Levenshtein ~1965 (Levenshtein distance) • induced alignment property  principle of optimality (not exactly) • use dynamic programming to solve the problem (quite similar to subset sum problem) • Key ideas: • input: strings s, t with lengths n, m respectively • use an (n+1)x(m+1) matrix • the invariant maintained throughout the algorithm is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. • use a traceback algorithm to find the optimal alignment

A Dynamic Programming Algorithm int LevenshteinDistance(char s[1..n], char t[1..m]) int d[0..n, 0..m] for i from 0 ton d[i, 0] := i for j from 1 tom d[0, j] := j for i from 1 ton for j from 1 tom if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) return d[n, m]

A Dynamic Programming Algorithm • branches correspond to different optimal alignments • the length of the optimal alignment is 5 • the algorithm runs in O(n2) time and it can be improved so as to use O(n) space

A Dynamic Programming Algorithm • the above algorithm returns the exact value of the edit distance of two strings • there exists a (minor) improvement in the algorithm so as to run in time O(n2/logn) by Masek and Paterson (1980): only theoretical interest • O(n2) is too much! We want linear time. • Even developing subquadratic time algorithms for approximating edit distance within a modest factor has proved quite challenging

Approximating edit distance efficiently • Suppose we know that our strings are either too similar or too dissimilar, namely that their edit distance is either at most k, or greater than l(n,k). Then we can develop a gap algorithm, that decides which of the two holds. • We are going to present three linear algorithms (Yossef, Jayram et. al. FOCS 2004 ): • The first one has l= O((kn)2/3) & works for all strings • The second one has l= O(k2) & works only for strings that are non-repetitive • The third one is an n3/7-approximation algorithm which improves to n1/3 if the strings are non-repetitive

Approximating edit distance efficiently • Why do we care about an efficient gap algorithm? • Such algorithms yield approximation algorithms that are as efficient, with the approximation ratio directly correlated with the gap • They are useful on their own: there exist problems where two strings can be either too similar or too dissimilar

Approximating edit distance efficiently • Our model for the first two algorithms: the sketching model: • two-party public-coin simultaneous messages communication complexity protocol • persons: Alice, Bob and the Referee • goal: to jointly compute f: AxB→C, when Alice has the input and Bob has the input • Alice uses her input a and shared random coins to compute a sketch sA(a) and then sends it to the referee. Bob does the same and sends a sketch sB(b) • the referee uses the sketches and the shared coins to compute the value of the function f(a,b), with a small error probability (constant) • main measure of complexity: the sketches’ size • usually desirable that Alice and Bob are efficient too

Approximating edit distance efficiently • Why the sketching model?Because there already exist sketching algorithms for the Hamming Distance • Idea 1:map the Edit Distance into some form of Hamming Distance and then use the known results for the Hamming Distance • Idea 2:if two strings share a lot of identical substrings in near positions they cannot differ too much.

Gap algorithm for arbitrary strings • Algorithm for arbitrary strings Technique: • Map each string to the multiset of all its (overlapping) substrings. Annotate each substring with an “encoding” of its position in the original string. • Take the characteristic vectors of the two multisets and run a gap algorithm for their Hamming Distances. • Map the results for HD into results for ED

Gap algorithm for arbitrary strings 1st step: • two inputs: the size of the input n the gap parameter k • define a suitable substring length B(n,k)and a suitable window parameter D(n,k) • map: x→Tx and y→Ty. These sets consist of pairs (γ,i), where γ is a length B substring and i is the “encoding” of the substrings position

Gap algorithm for arbitrary strings • Encoding method: round down the starting position of the substring to the nearest multiple of D(n,k) Example: Let x = savvato, B=2 and D=3. Then Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)} • formally: • Tx = {(x[i..i+B-1],i div D) | i = 1,…n-B+1} • B = n2/3/(2k1/3) and D = n/B

Gap algorithm for arbitrary strings 2nd step: • Take the characteristic vectors u, v of multisets Tx and Ty respectively.To do that impose a (lexicographical) order on the elements of Example: If Tx = {(γ1, i1 ), (γ2, i2 ), (γ1, i3 )} and Ty = {(γ2, i1 ), (γ1, i3 ), (γ4, i5 )} then = {(γ1, i1 ),(γ2, i2 ),(γ1, i3 ),(γ2, i1 ),(γ4, i5 )} order: (γ1, i1 ), (γ1, i3 ), (γ2, i1 ), (γ2, i2 ), (γ4, i5 ) Then u=11010 and v=01101 • But we do not want to sort things (it costs…). So instead of the union consider the set of all (γj, ij )

Gap algorithm for arbitrary strings • a pair is indicative of substrings of x and y that match (i.e. they are identical in terms of contents and appear at nearby positions in x and y) and corresponds to a j such that uj = vj • a pair is indicative of substrings that do not match and corresponds to a j such that uj≠ vj • We now have to estimate the Hamming Distance between u and v

Gap algorithm for arbitrary strings • Hamming Distance can be approximated using constant-size sketches as shown by Kushilevitz, Ostrovsky, Rabani (2000) • Problem: the Hamming Distance instance our problem is reduced to is exponentially long (set of all (γj, ij )). As a result the time to produce the sketches is prohibitive (Alice and Bob are very slow…) • Solution: new improved sketching method which • produces constant size sketches • runs in time proportional to the HD of the instances • solves the k vs (1+ε)k gap HD problem for any ε>0 • Notice that HD(u,v) ≤ O(n): no more than 2n distinct substrings

Gap algorithm for arbitrary strings 3rd step: • We tune the sketching algorithm for HD to determine whether HD(u,v) ≤ 4kB or HD(u,v) ≥ 8kB with constant probability of error. The referee, upon receiving the sketches from Alice and Bob, decides that ED(x,y) ≤ k if he finds HD(u,v) ≤ 4kB and ED(x,y) ≥ 13 (kn)2/3 if he finds HD(u,v) ≥ 8kB. • The algorithm’s correctness follows from: • Lemma 1: If ED(x,y) ≤ k then HD(u,v) ≤ 4kB • Lemma 2: If ED(x,y) ≥ 13 (kn)2/3 then HD(u,v) ≥ 8kB

Gap algorithm for arbitrary strings set map(string x) set T:= empty for i=0 to n div D for j=1 to B T←(x[i*D+j.. i*D+j+B-1]) return T int algorithm_1(string x, string y, int k) int B = n2/3/(2k1/3) int D = n/B set Tx = map(x) set Ty = map(y) (u,v) = characteristic_vectors(Tx , Ty ) if HD(u,v) ≤ 4kB then return (ED(x,y) ≤ k) if HD(u,v) ≥ 8kB then return (ED(x,y) ≥ 13 (kn)2/3 )

Gap algorithm for arbitrary strings Remarks • In the above algorithm Alice and Bob run the procedure map and the referee runs the procedure HD, which is the sketching algorithm for Hamming Distance • Notice that Alice and Bob do not use their random coins. Only the referee uses randomness and has bounded error probability • The algorithm solves the k vs Ω((kn)2/3 ) gap problem for all k ≤ n1/2, in quasi-linear time, using sketches of size O(1) • The idea of independently encoding the position of each substring, though simple, has problems. Consider:x = paramana, y = mana, with B = 2, D = 3

Gap algorithm for nonrepetitive strings • We saw that “encoding” the substrings’ position independently, by choosing an appropriate integer D, can lead to problems: in fact we fail to identify many matches even in the presence of just one edit operation (consider the case of the strings x and 1x) • We overcome this handicap by resorting to a method where the “encodings” are correlated.

Gap algorithm for nonrepetitive strings • Idea:we scan the input from left to right, trying to find “anchor” substrings, i.e. identical substrings that occur in x, y at very near positions. All we need to change is the “encoding” of the position: we now map each substring to the region between successive “anchors” • Example:x = ma …ro ...as, y = *ma****ro***as** the substrings starting at a region between successive anchors have the same encoding

Gap algorithm for nonrepetitive strings • Why does this idea work?Remember that we are dealing with a gap algorithm. Hence, if the input strings are very similar, we expect that a sufficiently short substring, chosen from a sufficiently long window is unlikely to contain edit operations and thus has to be matched with a corresponding substring in y in the same window • And how do Alice and Bob choose identical substrings?Alice and Bob cannot communicate with each other, so they pick the “anchors” randomly. In fact they use some (shared) random permutations (remember the shared coins)

Gap algorithm for nonrepetitive strings • Isn’t this too good to be true?Yes, it is: in order for the random permutations to ensure that “anchors” are detected with high probability we must demand that the input strings are non-repetitive. • A string is called (t,l)-non-repetitive if for any window of size l, the l substrings of length t, which start inside the window are distinct

Gap algorithm for nonrepetitive strings set map(string x, randomness r, int t) //Alice&Bob now use their coins r set T:= empty int c:=1 int i:=1 table of anchors a table of regions reg //pick a sliding window of length W and place it at c for all length t-substrings starting in the interval [c+W..c+2W-1] produce a Karp-Rabin fingerprint using r pick a random permutation of the fingerprints using r a[i] = the substring with the minimal fingerprint permutation i:=i+1 c:=the first character after the ancher for j=1 to i reg[j] = substring starting after last char of a[i-1] till last char of a[i] T ←(reg[i],i) return T

Gap algorithm for nonrepetitive strings int algorithm_2(string x, string y, int k, int t) //t is the non-repetition parameter randomness r set Tx = map(x,r,t) set Ty = map(y,r,t) (u,v) = characteristic_vectors(Tx , Ty ) if HD(u,v) ≤ 3k then return (ED(x,y) ≤ k) if HD(u,v) ≥ 6k then return (ED(x,y) ≥ Ω(tk)2 )

Gap algorithm for nonrepetitive strings • The algorithm’s correctness follows from: • Lemma 1: If ED(x,y) ≤ k then Pr[HD(u,v) ≤ 3k] ≥ 5/6 • Lemma 2: If HD(u,v) ≤ 6kB then ED(x,y) ≤ O(tk)2 • For any 1≤t<n, the algorithm solves the k vs Ω((tk)2) gap problem for all k ≤ (n/t)1/2, for all (t,tk)-non-repetitive strings, in polynomial time, using sketches of size O(1).

Approximation algorithm • Key idea: use of graphs • Given two strings x,y the edit graph GE is a representation of ED(x,y), by means of a directed graph. • The vertices correspond to the edit distances of x[1..i] and y[1..j] for all i,j ≤n. • An edge between vertices corresponds to a single edit operation transforming one substring to the other.

Approximation algorithm • We define the graph G(B) as a (lossy) compression of GE: each vertex corresponds to a pair (i,s), where i=jB, for j=0..n/B and s=-k..k. The bigger parameter B is, the lossier the compression. • Each vertex is closely related with the edit distance of x[1..i] and y[1..i+s] (s denotes the amount by which we shift y with respect to x) • We have two types of edges: • a-type edges from (i,s’) to (i,s) where |s’-s|=1 • b-type edges from (i-B,s) to (i,s) with w(e) w(a-type edges) = 1, w(b-type edges) depends on approximation factor c

Approximation algorithm • In GE the weight of the shortest source-sink path corresponds to the optimal alignment of x, y. • Theorem 1: Given two strings x, y and their corresponding graph G(B), let the shortest path P from (0,0) to (n,0) have weight T. Then ED(x,y) ≤ T and T ≤ (2c+2)k, if ED(x,y) ≤k, where c is the approximation factor affiliated with b-type edge weights.

Approximation algorithm • Only problem: finding the shortest path. • Dijkstra is too slow (i.e. not linear) • If we could figure out the weights of all b-type edges for a given i simultaneously, perhaps we could solve the problem. • But computing b-type edges is the same as finding the approximate edit distances between x[i+1..i+b] and every substring of y[i+1-k..i+B+k] of length B

Approximation algorithm • c(p,t)-edit pattern matching problem: given a pattern string P of length p and a text string of length t ≥ p, produce numbers d1, d2,…dt-p+1, such that di /c ≤ ED(P,T[i..i+p-1]) ≤ di for all i • Theorem 2: Suppose there is an algorithm that can solve the c(p,t)-edit pattern matching problem in time TIME(p,t). Let x, y be two strings and G(B) their corresponding graph. Set p=B, t=B+2k and c=c(p,t). Then the shortest path in G(B) can be used to solve the k vs (2c+2)k edit distance gap problem in time O((k+TIME(p,t))n/B) (i.e. the shortest path can be efficiently computed)

Approximation algorithm graph make_graph(string x, string y, int k, int B) //bigger B→faster algorithm, bigger gap vertices V = empty edges E = empty for j = 0 to n div B //vertices i = j*B for s = -k to k V←(i,s) for j = 0 to n div B //a-type edges i = j*B for s = -k to k EA← ((i,s),(i,s+1),1) //(source, sink, weight) EA← ((i,s),(i,s-1),1) for j = 1 to n div B //b-type edges i = j*B for s = -k to k EB← ((i-B,s),(i,s),w) return (V, EA EB)

Approximation algorithm (d1, d2,…dt-p+1) epm_algorithm(string P, string T) //length(T) ≥ length(P) //returns approximate ED(P,S) for all S: length(P)-substrings of T int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input…? int d[-k..k] //table of b-type edges weights d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k]) if n-B=0 return d for s = -k to k EB← ((i-B,s),(i,s),d[s]) G’ = update(G, EB) return compose(d, fix_weights(x,y,B,k,G’,n-B)) //what is compose…? • Let T(i,s) denote the shortest path from (0,0) to (i,s) • Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!) computed for all s. The idea is to compute the shortest path T(n,0) by recursively computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s). • No more than n/B recursive calls of fix_weights are needed

Approximation algorithm int shortest_path(string x, string y, int k, int B) graph G = make_graph(x,y,k,B) int T = fix_weights(x,y,k,B,G,n) return T • Correctness of the gap algorithm (follows from Theorem 1): • If ED ≤ k then T ≤ (2c+2)k • If ED ≥ (2c+2)k then T ≥ ED ≥ (2c+2)k

Approximation algorithm • Some results which use algorithms for the edit pattern matching problem: • quasi-linear time algorithm for k vs O(k2) • quasi-linear time algorithm for k vs O(k7/4) • quasi-linear time algorithm for k vs O(k3/2) for (k, O(k1/2))-non-repetitive strings • 2&3 imply approximation algorithms with factors n3/7 and n1/3 respectively

Embedding edit distance into l1 • Question: can {0,1}d endowed with edit distance be embedded into l1 (Rd endowed with L1-metric) with low distortion? • Known results up to 2005: • Edit distance cannot be embedded into l1 with distortion less than 3/2 • Edit distance can be trivially embedded into l1 with distortion d • Edit distance can be embedded into l1 with distortion (Ostrovsky, Rabani, STOC 2005) • Why do we care? • Because embeddings allow us to use efficient algorithms of one metric space on instances of another metric space • Because embeddings help us learn more about the structure of a metric space

Embedding edit distance into l1 • Let x be a string. Then for any integer s we denote by shifts(x,s) the set {x[1,|x|-s+1], x[2,|x|-s+2],…, x[s,|x|] which consists of all length s substrings of x. • Theorem: There exists a universal constant c>0 such that for every integer d>0 there exists an embedding f:({0,1}d, ED)→l1 with distortion at most

Embedding edit distance into l1 • Key idea of the embedding: for sufficiently small d the distortion is indeed that small. So it suffices to break down the string into substrings of approximately the same length (±1), the blocks, and recursively embed into l1 some metric spaces of lower dimension. The spaces embedded must be chosen in such a way, that the concatenation of their scaled embeddings results in the embedding of the original string.

Edit Distance

Edit Distance

Presentation Transcript

Edit Distance and Large Data Sets

Approximating Edit Distance in Near-Linear Time

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

Efficient Approximation of Edit Distance

Minimum Edit Distance

String Edit Distance Matching Problem With Moves

EDIT 652 Distance Learning Applications I

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

On Embedding Edit Distance into L 1

Dynamic Programming: Edit Distance

Pair HMMs and edit distance

Edit Distance