1 / 111

k-Nearest Neighbors Search in High Dimensions

k-Nearest Neighbors Search in High Dimensions . Tomer Peled Dan Kushnir. Tell me who your neighbors are, and I'll know who you are. Outline. Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20)

hastin
Télécharger la présentation

k-Nearest Neighbors Search in High Dimensions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. k-Nearest NeighborsSearchin High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are

  2. Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing(high dimension approximate solutions) • l2 extension • Applications (Dan)

  3. Nearest Neighbor SearchProblem definition • Given: a set P of n points in RdOver some metric • find the nearest neighborp of q in P Q? Distance metric

  4. Applications • Classification • Clustering • Segmentation • Indexing • Dimension reduction (e.g. lle) Weight q ? color

  5. Naïve solution • No preprocess • Given a query point q • Go over all n points • Do comparison in Rd • query time = O(nd) • Keep in mind

  6. Common solution • Use a data structure for acceleration • Scale-ability with n & with d is important

  7. Parametric Non-parametric Probability distribution estimation Density estimation Nearest neighbors When to use nearest neighbor High level algorithms • complex models • Sparse data • High dimensions Assuming no prior knowledge about the underlying probability structure

  8. Nearest Neighbor q? Closest • min pi  P dist(q,pi)

  9. r,  - Nearest Neighbor q? (1 +  ) r r • dist(q,p1)  r • dist(q,p2)  (1 +  ) r r2=(1 +  ) r1

  10. Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)

  11. The simplest solution • Lion in the desert

  12. Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point

  13. Quadtree - structure X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X

  14. Quadtree - Query X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X In many cases works

  15. Quadtree – Pitfall1 X1,Y1 P<X1 P<Y1 P≥X1 P≥Y1 P≥X1 P<Y1 P<X1 P≥Y1 X1,Y1 Y P<X1 X In some cases doesn’t

  16. Quadtree – Pitfall1 Y X In some cases nothing works

  17. Quadtree – pitfall 2 X Y O(2d) Could result in Query time Exponential in #dimensions

  18. Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther

  19. Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)

  20. O( min(nd, nd) ) Naive Curse of dimensionality • Query time or space O(nd) • D>10..20  worst than sequential scan • For most geometric distributions • Techniques specific to high dimensions are needed • Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002

  21. Curse of dimensionalitySome intuition 2 22 23 2d

  22. Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)

  23. Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2

  24. Hash function

  25. Hash function Data_Item Hash function Key Bin/Bucket

  26. Hash function Data structure X=Number in the range 0..n X modulo 3 0 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin

  27. Recall r,  - Nearest Neighbor q? (1 +  ) r r • dist(q,p1)  r • dist(q,p2)  (1 +  ) r r2=(1 +  ) r1

  28. Locality sensitive hashing q? (1 +  ) r r • (r, ,p1,p2) Sensitive • ≡Pr[I(p)=I(q)] is “high” if p is “close” to q • ≡Pr[I(p)=I(q)] is “low” if p is”far” from q P1 P2 r2=(1 +  ) r1

  29. Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2

  30. Hamming Space • Hamming space = 2N binary strings • Hamming distance = #changed digits a.k.a Signal distance Richard Hamming

  31. SUM(X1 XOR X2) Hamming Space N • Hamming space • Hamming distance 010100001111 010100001111 Distance = 4 010010000011

  32. p L1 to Hamming Space Embedding C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000

  33. Hash function p ∈ Hd’ 11000000000 11111111000 11000000000 11111111000 1 0 1 Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101

  34. Construction p 1 2 L

  35. Query q 1 2 L

  36. p Alternative intuition random projections C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000

  37. p Alternative intuition random projections C=11 2 8 11111111000 11000000000 11000000000 11111111000

  38. p Alternative intuition random projections C=11 2 8 11111111000 11000000000 11000000000 11111111000

  39. p Alternative intuition random projections 11000000000 11111111000 11000000000 11111111000 1 0 1 110 111 100 101 101 23 Buckets 000 001

  40. k samplings

  41. Repeating

  42. Repeating L times

  43. Repeating L times

  44. Secondary hashing Support volume tuning dataset-size vs. storage volume 2k buckets Skip 011 Simple Hashing M*B=α*n α=2 Size=B M Buckets

  45. The above hashing is locality-sensitive • Probability (p,q in same bucket) = k=1 Pr k=2 Probability Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides

  46. Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l2

  47. Direct L2 solution • New hashing function • Still based on sampling • Using mathematical trick • P-stable distribution for Lp distance • Gaussian distribution for L2 distance

  48. v1* +v2* +… …+vn* = Central limit theorem (Weighted Gaussians) = Weighted Gaussian

  49. Central limit theorem v1* X1 +v2* X2 +… = …+vn* Xn v1..vn = Real Numbers X1:Xn = Independent Identically Distributed (i.i.d)

  50. Central limit theorem Dot Product Norm

More Related