Lower Bounds for Embedding Edit Distance into Normed Spaces

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova

Definitions • Edit distance between two strings: • Minimum number of edit operations needed to transform one string into another • Edit operation: • Insertion, deletion, or substitution of one character • Edit distance = Levenstein metric

Edit Distance • Important in: • Computational biology • Text processing • Computational bottlenecks: • Widely used algorithm takes quadratic time • No efficient algorithm known for nearest neighbor computation • New approach for dealing with edit distance: • Embedding into a normed space

Embedding • Definition: • A mapping f: Strings→lpd, such that for any pair of strings s and s': Edit(s, s') ≤ ||f(s) -f(s')||p ≤ c·Edit(s, s') • The factor c is called distortion • Useful to embed edit distance into a normed spacebecause: • Efficient algorithms working on normed spaces are known (e.g. nearest neighbor computation) • Can compute (approximately) edit distance in subquadratic time, if computing the mapping takes subquadratic time

Embedding edit distance into a normed space • Essentially nothing known • If allow moving a contiguous block of characters as a single edit operation: • can embed new metric into l1 with distortion O(log d·log*d) [CPSV’00] (d – length of strings to embed)

Result in this paper • A lower boundof 3/2 on the distortion of embedding into l1 and (l2)2 • The bound cannot be improved using our technique

Structure of the argument • Will show that: • Edit metric contains the shortest path metric over the K2,n graph (K2,n–metric) as induced subgraph • K2,n–metric not embeddable into (l2)2 with low distortion • Conclude that: • Edit metric not embeddable into (l2)2 with distortion better than 3/2 • Edit metric not embeddable into l1with distortion better than 3/2 since l1-metric can be embedded isometrically into (l2)2 [LLR94] • Show that: • The bound of 3/2 is tight for the considered graph

A1 10101010 n=4 B1 B2 B3 B4 1101010 1011010 1010110 1010101 A2 101010 K2,n metric – induced subgraph of edit metric • Vertices of the graph are A1, A2, B1, B2, … Bn • Edges are (Ai, Bj), where 1≤i≤2, 1≤j≤n • The mapping: • A1 is mapped to the string (10)n • A2 is mapped to the string (10)n-1 • Bj is mapped to the string (10)j-11(10)n-j

Lower bound for embedding K2,n graph into (l2)2 • Theorem 1: • for any ε>0, there exists some n such that K2,n–metric cannot be embedded into (l2)2 with distortion less than (3/2-ε)

Proof of the theorem 1 • Let: • B-1=A1 and B0=A2 • f- some embedding of K2,n–metric into (l2)2 with distortionc • The metric over points f(B-1), … f(Bn)needs to satisfy negative type inequality: • For any integers b-1,… bn that sum up to 0: Σ-1≤i<j≤nbibj||f(Bi)-f(Bj)||22≤0 • With suitable values for n and bi, inequality gives: c ≥ 3/2-ε

3/2 is a tight bound • Will prove that 3/2 is a tight bound for embedding K2,n–metric into l1 • Theorem 2: • There exists an embedding f of K2,n–metric into l1 with distortion 3/2

Proof of the theorem 2 • Will combine two embeddings f1 and f2 • f1 is: • f1(A1)=(0,…0) • f1(A2)=(1,…1)/2n • f1(Bj)=(bin(0)j,…bin(2n-1)j)/2n, (bin(i)j = j-th bit of the binary representation of integeri) • f1satisfies: • ||f1(A1)-f1(A2)||1=1 • ||f1(Ai)-f1(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f1(Bi)-f1(Bj)||1=1/2, for 1≤i<j≤n

Proof of theorem 2 (cont) • f2is: • f2(A1)=f2(A2)=(0,…0) • f2(Bj)=ej/2 (ej = vector with 1 at the j-th position and 0 elsewhere) • f2satisfies: • ||f2(A1)-f2(A2)||1=0 • ||f2(Ai)-f2(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f2(Bi)-f2(Bj)||1=1, for 1≤i<j≤n • If f1 and f2 induce metrics D1 and D2: • 2D1+D2provides a distortion of 3/2

Computational Experiments • Goal: raise lower bound (of 3/2) • Tried following approaches: • Optimal embedding of strings of length up to d • into l1 using cut-metric formulation • into (l2)2 using semidefinite programming • Lower bounds via expansion properties of metric

Optimal embedding into l1 • A metric embeddable into l1 iff can be represented as a convex combination of cut metrics • For computing optimal distortion can use linear programming • Deficiency: number of variables is 2|X|-1, where |X|=2d+1-1 • Infeasible for d>3 • For d=3, distortion is 4/3<3/2

Optimal embedding into (l2)2 • Formulated as a semidefinite programming problem • For d=5, obtained optimal distortion of ~1.30<3/2 • Could not run for d=6 since would require ~2Gb of memory

Lower bounds via expansion • Idea: • To show that the graph underlying edit metric is a “good” expander • Considered “two-layers” graph G: • The graph of all strings of length d and d-1 • Regular with added self-loop edges (up to degree Δ=3d-1) • Shortest path metric over G = induced subgraph of edit metric

Expansion • Goal: • To find C such that for any set A of vertices: • |e(A, V-A)|≥C|A||V-A|/n (|e(A, B)|=set of edges between A and B) • Then: • Distortion ≥ S·C·avg(G)/Δ , where • S=const • avg(G)=average distance in G • C ≥ “eigenvalue gap”

Eigenvalue gap • Can compute eigenvalues efficiently • Was not large enough: • ~2.7 for d=4,8,12,16 • for comparison: 2 for hypercube (embeddable isometrically into l1) • Gives lower bound for distortion <3/2 for d≤16

Conclusion • Lower bound of 3/2 for distortion of embedding edit metric into l1 and (l2)2 • Using K2,n-metric • Tight bound for K2,n-metric

Lower Bounds for Embedding Edit Distance into Normed Spaces