1 / 22

Overcoming the L 1 Non-Embeddability Barrier

Overcoming the L 1 Non-Embeddability Barrier. Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT). Algorithms on Metric Spaces. Hamming distance. Fix a metric M Fix a computational problem Solve problem under M. Ulam metric.

skyla
Télécharger la présentation

Overcoming the L 1 Non-Embeddability Barrier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overcoming the L1 Non-Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

  2. Algorithms on Metric Spaces Hamming distance • Fix a metric M • Fix a computational problem • Solve problem under M Ulam metric Compute distance between x,y Earthmover distance ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. … … Overcoming the L_1 non-embeddability barrier

  3. Motivation for Nearest Neighbor • Many applications: • Image search (Euclidean dist, Earth-mover dist) • Processing of genetic information, text processing (edit dist.) • many others… Generic Search Engine Overcoming the L_1 non-embeddability barrier

  4. A General Tool: Embeddings • An embeddingof M into a host metric (H,dH)is a map f : M→H • preserves distances approximately • has distortionA ≥ 1if for all x,yM, dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y) • Why? • If H is “easy” (= can solve efficiently computational problems like NNS) • Then get good algorithms for the original space M! f Overcoming the L_1 non-embeddability barrier

  5. Host space? ℓ1=real space with d1(x,y) =∑i |xi-yi| Popular target metric: ℓ1 • Have efficient algorithms: • Distance estimation: O(d) for d-dimensional space (often less) • NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98] • Powerful enough for some things… Overcoming the L_1 non-embeddability barrier

  6. Below logarithmic? (ℓ2)p=real space with dist2p(x,y)=||x-y||2p • Cannot work with ℓ1 • Other possibilities? • (ℓ2)pis bigger and algorithmically tractable • but not rich enough(often same lower bounds) • ℓ∞ is rich (includes all metrics), • but not efficient computationallyusually (high dimension) • And that’s roughly it…  • (at least for efficient NNS) ℓ∞=real space with dist∞(x,y)=maxi|xi-yi| Overcoming the L_1 non-embeddability barrier

  7. d1 d1 … … d∞,1 d∞,1 α Meet our new host d1 … • Iterated product space, Ρ22,∞,1= β d∞,1 d22,∞,1 γ Overcoming the L_1 non-embeddability barrier

  8. Why Ρ22,∞,1? • Because we can… • Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion • Dimensions (γ,β,α)=(d, log d, d) • Theorem 2.Ρ22,∞,1 admits NNS on n points with • O(log log n) approximation • O(nε) query time and O(n1+ε) space • In fact, there is more for Ulam… Rich Algorithmically tractable Overcoming the L_1 non-embeddability barrier

  9. Our Algorithms for Ulam ED(1234567, 7123456) = 2 • Ulam = edit on strings where each symbol appears at most once • A classical distance between rankings • Exhibits hardness of misalignments (as in general edit) • All lower bounds same as for general edit (up to Θ̃() ) • Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ̃(log d) • Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(nε) query time • Can improve to O(log log d) approx 2. Sketchingwith O(1)-approx in logO(1) d space 3. Distance estimation with O(1)-approx in time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam! [BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time Overcoming the L_1 non-embeddability barrier

  10. Theorem 1 • Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion • Dimensions (γ,β,α)=(d, log d, d) • Proof • “Geometrization” of Ulam characterizations • Previously studied in the context of testing monotonicity (sortedness): • Sublinear algorithms [EKKRV98, ACCL04] • Data-stream algorithms [GJKK07, GG07, EH08] Overcoming the L_1 non-embeddability barrier

  11. Thm 1: Characterizing Ulam • Consider permutations x,yover [d] • Assume for now: x = identity permutation • Idea: • Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) • Call them faulty characters • Issues: • Ambiguity… • How do we count them? 123456789 123456789 X= 234657891 341256789 y= Overcoming the L_1 non-embeddability barrier

  12. Thm 1: Characterization – inversions • Definition: chars a<b form inversion if b precedes a in y • How to identify faulty char? • Has an inversion? • Doesn’t work: all chars might have inversion • Has many inversions? • Still can miss “faulty” chars • Has many inversions locally? • Same problem Check if either is true! 123456789 123456789 123456789 X= 567981234 234567891 213456798 y= Overcoming the L_1 non-embeddability barrier

  13. Thm 1: Characterization – faulty chars • Definition 1: a is faulty if exists K>0 s.t. • a is inverted w.r.t. a majority of the K symbols preceding a in y • (ok to consider K=2k) • Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 123456789 234567891 4 characters preceding 1 (all inversions with 1) Overcoming the L_1 non-embeddability barrier

  14. Thm 1: CharacterizationEmbedding • To get embedding, need: • Symmetrization (neither string is identity) • Deal with “exists”, “majority”…? • To resolve (1), use instead X[a;K] … • Definition 2:a is faulty if exists K=2k such that • |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference) X[5;4] 123456789 123467895 Y[5;4] Overcoming the L_1 non-embeddability barrier

  15. Thm 1: Embedding – final step X[5;22] 123456789 • We have • Replace by weight? • Final embedding: 123467895 Y[5;22] equal 1 iff true )2 ( Overcoming the L_1 non-embeddability barrier

  16. Theorem 2 • Theorem 2.Ρ22,∞,1 admits NNS on n points • O(log log n) approximation • O(nε) query time and O(n1+ε) space for any small ε • (ignoring (αβγ)O(1)) • A rather general approach • “LSH” on ℓ1-products of general metric spaces • Of course, cannot do, but can reduce to ℓ∞-products Overcoming the L_1 non-embeddability barrier

  17. Thm 2: Proof • Let’s start from basics: ℓ1α • [IM98]:c-approx with O(n1/c) query time and O(n1+1/c) space • (ignoring αO(1)) • Ok, what about • Then: NNS for • O(cM * log log n) -approx • Õ(QM) query time • O(SM * n1+ε) space. • Suppose: NNS for M with • cM-approx • QM query time • SM space. [I02] Overcoming the L_1 non-embeddability barrier

  18. Thm 2: What about (ℓ2)2-product? • Enough to consider • (for us, M is the l1-product) • Off-the-shelf? • [I04]: gives space ~n or >log n approximation • We reduce to multiple NNS queries under • Instructive to first look at NNS for standard ℓ1 … Overcoming the L_1 non-embeddability barrier

  19. Thm 2: Review of NNS for ℓ1  • LSH family: collection H of hash functions such that: • For random hH (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p||1 /  • Query just uses primitive: • Can obtain H by imposing randomly-shifted grid of side-length  • Then for h defined by ri2[0, ] at random, primitive becomes: q p “return all points p such that h(q)=h(p) “return all p s.t. |qi-pi|<rifor all i[d] Overcoming the L_1 non-embeddability barrier

  20. Thm 2: LSH for ℓ1-product  • Intuition: abstract LSH! • Recall we had: for ri random from [0, ], point p returned if for all i: |qi-pi|<ri • Equivalently • For all i: q p ℓ∞ product of R! For ℓ1 “return all p s.t. |qi-pi|<rifor all i[d] “return all points p’s such that maxi dM(qi,pi)/ri<1 For Overcoming the L_1 non-embeddability barrier

  21. Thm 2: Final • Thus, sufficient to solve primitive: • We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) • Approximation is O(1)*O(log log n) • Done! “return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd)) For Overcoming the L_1 non-embeddability barrier

  22. Take-home message: • Can embed combinatorial metrics into iterated product spaces • Works for Ulam (=edit on non-repetitive strings) • Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 … Open: • Embeddings for edit over {0,1}d, EMD, other metrics? • Understanding product spaces? [Jayram-Woodruff]: sketching Thank you! Overcoming the L_1 non-embeddability barrier

More Related