1 / 20

Lower Bounds for Embedding Edit Distance into Normed Spaces

Lower Bounds for Embedding Edit Distance into Normed Spaces. A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova. Definitions. Edit distance between two strings: Minimum number of edit operations needed to transform one string into another Edit operation :

lynsey
Télécharger la présentation

Lower Bounds for Embedding Edit Distance into Normed Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova

  2. Definitions • Edit distance between two strings: • Minimum number of edit operations needed to transform one string into another • Edit operation: • Insertion, deletion, or substitution of one character • Edit distance = Levenstein metric

  3. Edit Distance • Important in: • Computational biology • Text processing • Computational bottlenecks: • Widely used algorithm takes quadratic time • No efficient algorithm known for nearest neighbor computation • New approach for dealing with edit distance: • Embedding into a normed space

  4. Embedding • Definition: • A mapping f: Strings→lpd, such that for any pair of strings s and s': Edit(s, s') ≤ ||f(s) -f(s')||p ≤ c·Edit(s, s') • The factor c is called distortion • Useful to embed edit distance into a normed spacebecause: • Efficient algorithms working on normed spaces are known (e.g. nearest neighbor computation) • Can compute (approximately) edit distance in subquadratic time, if computing the mapping takes subquadratic time

  5. Embedding edit distance into a normed space • Essentially nothing known • If allow moving a contiguous block of characters as a single edit operation: • can embed new metric into l1 with distortion O(log d·log*d) [CPSV’00] (d – length of strings to embed)

  6. Result in this paper • A lower boundof 3/2 on the distortion of embedding into l1 and (l2)2 • The bound cannot be improved using our technique

  7. Structure of the argument • Will show that: • Edit metric contains the shortest path metric over the K2,n graph (K2,n–metric) as induced subgraph • K2,n–metric not embeddable into (l2)2 with low distortion • Conclude that: • Edit metric not embeddable into (l2)2 with distortion better than 3/2 • Edit metric not embeddable into l1with distortion better than 3/2 since l1-metric can be embedded isometrically into (l2)2 [LLR94] • Show that: • The bound of 3/2 is tight for the considered graph

  8. A1 10101010 n=4 B1 B2 B3 B4 1101010 1011010 1010110 1010101 A2 101010 K2,n metric – induced subgraph of edit metric • Vertices of the graph are A1, A2, B1, B2, … Bn • Edges are (Ai, Bj), where 1≤i≤2, 1≤j≤n • The mapping: • A1 is mapped to the string (10)n • A2 is mapped to the string (10)n-1 • Bj is mapped to the string (10)j-11(10)n-j

  9. Lower bound for embedding K2,n graph into (l2)2 • Theorem 1: • for any ε>0, there exists some n such that K2,n–metric cannot be embedded into (l2)2 with distortion less than (3/2-ε)

  10. Proof of the theorem 1 • Let: • B-1=A1 and B0=A2 • f- some embedding of K2,n–metric into (l2)2 with distortionc • The metric over points f(B-1), … f(Bn)needs to satisfy negative type inequality: • For any integers b-1,… bn that sum up to 0: Σ-1≤i<j≤nbibj||f(Bi)-f(Bj)||22≤0 • With suitable values for n and bi, inequality gives: c ≥ 3/2-ε

  11. 3/2 is a tight bound • Will prove that 3/2 is a tight bound for embedding K2,n–metric into l1 • Theorem 2: • There exists an embedding f of K2,n–metric into l1 with distortion 3/2

  12. Proof of the theorem 2 • Will combine two embeddings f1 and f2 • f1 is: • f1(A1)=(0,…0) • f1(A2)=(1,…1)/2n • f1(Bj)=(bin(0)j,…bin(2n-1)j)/2n, (bin(i)j = j-th bit of the binary representation of integeri) • f1satisfies: • ||f1(A1)-f1(A2)||1=1 • ||f1(Ai)-f1(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f1(Bi)-f1(Bj)||1=1/2, for 1≤i<j≤n

  13. Proof of theorem 2 (cont) • f2is: • f2(A1)=f2(A2)=(0,…0) • f2(Bj)=ej/2 (ej = vector with 1 at the j-th position and 0 elsewhere) • f2satisfies: • ||f2(A1)-f2(A2)||1=0 • ||f2(Ai)-f2(Bj)||1=1/2, for 1≤i≤2, 1≤j≤n • ||f2(Bi)-f2(Bj)||1=1, for 1≤i<j≤n • If f1 and f2 induce metrics D1 and D2: • 2D1+D2provides a distortion of 3/2

  14. Computational Experiments • Goal: raise lower bound (of 3/2) • Tried following approaches: • Optimal embedding of strings of length up to d • into l1 using cut-metric formulation • into (l2)2 using semidefinite programming • Lower bounds via expansion properties of metric

  15. Optimal embedding into l1 • A metric embeddable into l1 iff can be represented as a convex combination of cut metrics • For computing optimal distortion can use linear programming • Deficiency: number of variables is 2|X|-1, where |X|=2d+1-1 • Infeasible for d>3 • For d=3, distortion is 4/3<3/2

  16. Optimal embedding into (l2)2 • Formulated as a semidefinite programming problem • For d=5, obtained optimal distortion of ~1.30<3/2 • Could not run for d=6 since would require ~2Gb of memory

  17. Lower bounds via expansion • Idea: • To show that the graph underlying edit metric is a “good” expander • Considered “two-layers” graph G: • The graph of all strings of length d and d-1 • Regular with added self-loop edges (up to degree Δ=3d-1) • Shortest path metric over G = induced subgraph of edit metric

  18. Expansion • Goal: • To find C such that for any set A of vertices: • |e(A, V-A)|≥C|A||V-A|/n (|e(A, B)|=set of edges between A and B) • Then: • Distortion ≥ S·C·avg(G)/Δ , where • S=const • avg(G)=average distance in G • C ≥ “eigenvalue gap”

  19. Eigenvalue gap • Can compute eigenvalues efficiently • Was not large enough: • ~2.7 for d=4,8,12,16 • for comparison: 2 for hypercube (embeddable isometrically into l1) • Gives lower bound for distortion <3/2 for d≤16

  20. Conclusion • Lower bound of 3/2 for distortion of embedding edit metric into l1 and (l2)2 • Using K2,n-metric • Tight bound for K2,n-metric

More Related