200 likes | 318 Vues
This paper explores the embedding of edit distance, defined as the minimum number of edit operations to transform one string into another, into normed spaces. It introduces a new approach through a mapping that provides efficient algorithms for computing nearest neighbor distances. Additionally, it establishes a lower bound of 3/2 for the distortion in embedding edit distance into L1 and L2 spaces, demonstrating that existing techniques cannot improve this bound. The findings have significant implications for fields such as computational biology and text processing, highlighting potential computational bottlenecks.
E N D
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova
Definitions • Edit distance between two strings: • Minimum number of edit operations needed to transform one string into another • Edit operation: • Insertion, deletion, or substitution of one character • Edit distance = Levenstein metric
Edit Distance • Important in: • Computational biology • Text processing • Computational bottlenecks: • Widely used algorithm takes quadratic time • No efficient algorithm known for nearest neighbor computation • New approach for dealing with edit distance: • Embedding into a normed space
Embedding • Definition: • A mapping f: Strings→lpd, such that for any pair of strings s and s': Edit(s, s') ≤ ||f(s) -f(s')||p ≤ c·Edit(s, s') • The factor c is called distortion • Useful to embed edit distance into a normed spacebecause: • Efficient algorithms working on normed spaces are known (e.g. nearest neighbor computation) • Can compute (approximately) edit distance in subquadratic time, if computing the mapping takes subquadratic time
Embedding edit distance into a normed space • Essentially nothing known • If allow moving a contiguous block of characters as a single edit operation: • can embed new metric into l1 with distortion O(log d·log*d) [CPSV’00] (d – length of strings to embed)
Result in this paper • A lower boundof 3/2 on the distortion of embedding into l1 and (l2)2 • The bound cannot be improved using our technique
Structure of the argument • Will show that: • Edit metric contains the shortest path metric over the K2,n graph (K2,n–metric) as induced subgraph • K2,n–metric not embeddable into (l2)2 with low distortion • Conclude that: • Edit metric not embeddable into (l2)2 with distortion better than 3/2 • Edit metric not embeddable into l1with distortion better than 3/2 since l1-metric can be embedded isometrically into (l2)2 [LLR94] • Show that: • The bound of 3/2 is tight for the considered graph
A1 10101010 n=4 B1 B2 B3 B4 1101010 1011010 1010110 1010101 A2 101010 K2,n metric – induced subgraph of edit metric • Vertices of the graph are A1, A2, B1, B2, … Bn • Edges are (Ai, Bj), where 1≤i≤2, 1≤j≤n • The mapping: • A1 is mapped to the string (10)n • A2 is mapped to the string (10)n-1 • Bj is mapped to the string (10)j-11(10)n-j
Lower bound for embedding K2,n graph into (l2)2 • Theorem 1: • for any ε>0, there exists some n such that K2,n–metric cannot be embedded into (l2)2 with distortion less than (3/2-ε)
Proof of the theorem 1 • Let: • B-1=A1 and B0=A2 • f- some embedding of K2,n–metric into (l2)2 with distortionc • The metric over points f(B-1), … f(Bn)needs to satisfy negative type inequality: • For any integers b-1,… bn that sum up to 0: Σ-1≤i<j≤nbibj||f(Bi)-f(Bj)||22≤0 • With suitable values for n and bi, inequality gives: c ≥ 3/2-ε
3/2 is a tight bound • Will prove that 3/2 is a tight bound for embedding K2,n–metric into l1 • Theorem 2: • There exists an embedding f of K2,n–metric into l1 with distortion 3/2
Proof of the theorem 2 • Will combine two embeddings f1 and f2 • f1 is: • f1(A1)=(0,…0) • f1(A2)=(1,…1)/2n • f1(Bj)=(bin(0)j,…bin(2n-1)j)/2n, (bin(i)j = j-th bit of the binary representation of integeri) • f1satisfies: • ||f1(A1)-f1(A2)||1=1 • ||f1(Ai)-f1(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f1(Bi)-f1(Bj)||1=1/2, for 1≤i<j≤n
Proof of theorem 2 (cont) • f2is: • f2(A1)=f2(A2)=(0,…0) • f2(Bj)=ej/2 (ej = vector with 1 at the j-th position and 0 elsewhere) • f2satisfies: • ||f2(A1)-f2(A2)||1=0 • ||f2(Ai)-f2(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f2(Bi)-f2(Bj)||1=1, for 1≤i<j≤n • If f1 and f2 induce metrics D1 and D2: • 2D1+D2provides a distortion of 3/2
Computational Experiments • Goal: raise lower bound (of 3/2) • Tried following approaches: • Optimal embedding of strings of length up to d • into l1 using cut-metric formulation • into (l2)2 using semidefinite programming • Lower bounds via expansion properties of metric
Optimal embedding into l1 • A metric embeddable into l1 iff can be represented as a convex combination of cut metrics • For computing optimal distortion can use linear programming • Deficiency: number of variables is 2|X|-1, where |X|=2d+1-1 • Infeasible for d>3 • For d=3, distortion is 4/3<3/2
Optimal embedding into (l2)2 • Formulated as a semidefinite programming problem • For d=5, obtained optimal distortion of ~1.30<3/2 • Could not run for d=6 since would require ~2Gb of memory
Lower bounds via expansion • Idea: • To show that the graph underlying edit metric is a “good” expander • Considered “two-layers” graph G: • The graph of all strings of length d and d-1 • Regular with added self-loop edges (up to degree Δ=3d-1) • Shortest path metric over G = induced subgraph of edit metric
Expansion • Goal: • To find C such that for any set A of vertices: • |e(A, V-A)|≥C|A||V-A|/n (|e(A, B)|=set of edges between A and B) • Then: • Distortion ≥ S·C·avg(G)/Δ , where • S=const • avg(G)=average distance in G • C ≥ “eigenvalue gap”
Eigenvalue gap • Can compute eigenvalues efficiently • Was not large enough: • ~2.7 for d=4,8,12,16 • for comparison: 2 for hypercube (embeddable isometrically into l1) • Gives lower bound for distortion <3/2 for d≤16
Conclusion • Lower bound of 3/2 for distortion of embedding edit metric into l1 and (l2)2 • Using K2,n-metric • Tight bound for K2,n-metric