580 likes | 713 Vues
Approximate Nearest Neighbor (Locality Sensitive Hashing) - Theory. Given by: Erez Eyal Uri Klein. Overview. Detailed. Lecture Outline. Exact Nearest Neighbor search Definition Low dimensions KD-Trees Approximate Nearest Neighbor search (LSH based)
E N D
Approximate Nearest Neighbor (Locality Sensitive Hashing) - Theory Given by: Erez Eyal Uri Klein
Overview Detailed Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summary
? Nearest Neighbor Search in Springfield
? Nearest “Neighbor” Search for Homer Simpson Home planet distance Height Weight Color
p q Nearest Neighbor (NN) Search • Given: a set P of n points in Rd (d - dimension) • Goal: a data structure, which given a query point q, finds the nearest neighborp of q in P (in terms of some distance function D)
Nearest Neighbor Search Interested in designing a data structure, with the following objectives: • Space: O(dn) • Query time: O(d log(n)) • Data structure construction time is not important
Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summery
1 4 7 8 13 19 25 32 q = 9 Simple cases: 1-D (d = 1) • A binary search will give the solution • Space: O(n); Time: O(log(n))
Simple cases: 2-D (d = 2) • Using Voronoi diagrams will give the solution • Space: O(n2); Time: O(log(n))
Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summary
KD-Trees • KD-tree is a data structure based on recursively subdividing a set of points with alternating axis-aligned hyperplanes. • The classical KD-tree uses O(dn) space and answers queries in time logarithmic in n (worst case is O(n)), but exponential in d.
4 6 l1 7 8 5 9 10 3 2 1 11 2 5 4 11 8 1 3 9 10 6 7 KD-Trees Construction l9 l1 l5 l6 l3 l2 l3 l2 l10 l4 l5 l7 l6 l8 l7 l4 l8 l10 l9
l1 q 2 5 4 11 8 1 3 9 10 6 7 KD-Trees Query 4 6 l9 l1 7 l5 l6 l3 8 l2 l3 l2 5 9 l10 10 3 l4 l5 l7 l6 l8 l7 2 1 l4 11 l8 l10 l9
Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summary
A conjecture: “The curse of dimensionality” In an exact solution, any algorithm for high dimension must use either nw(1) space or have dw(1) query time “However, to the best of our knowledge, lower bounds for exact NN Search in high dimensions do not seem sufficiently convincing to justify the curse of dimensionality conjecture” (Borodin et al. ‘99)
Why Approximate NN? • Approximation allow significant speedup of calculation (on the order of 10’s to 100’s) • Fixed-precision arithmetic on computer causes approximation anyway • Heuristics are used for mapping features to numerical values (causing uncertainty anyway)
Approximate Nearest Neighbor (ANN) Search • Given: a set P of n points in Rd (d - dimension) and a slackness parameter e>0 • Goal: a data structure, which given a query point q of which the nearest neighbor in P is a, finds any p s.t. D(q, p)b(1+e)D(q, a) a q (1+e)D(q, a)
Locality Sensitive Hashing A (r1, r2, P1, P2) - Locality Sensitive Hashing (LSH) family, is a family of hash functions H s.t. for a random hash function h and for any pair of points a, b we have: • D(a, b)br1 Pr[h(a)=h(b)]rP1 • D(a, b)rr2 Pr[h(a)=h(b)]bP2 • (r1<r2, P1>P2) (A common method to reduce dimensionality without loosing distance information) [Indyk-Motwani ’98]
Hamming Cube • A d-Dimensional hamming cube Qd is the set {0, 1}d • For any a, bQd we define Hamming distance H:
LSH – Example in Hamming Cube • H={h|h(a)=ai, i{1, …, d}} Pr[h(q)=h(a)]=1-H(q, a)/d Pr is a monotonically decreasing function in H(q, a) • Multi-index hashing: G={g|g(a)=(h1(a) h2(a)… hk(a))} Pr[g(q)=g(a)]=(1-H(q, a)/d)k Pr is a monotonically decreasing function in k
Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summary
LSH – ANN Search Basic Scheme Preprocess: • Construct several such ‘g’functions for each l{1,…, d} • Store each aP at the place gi(a) of the corresponding hash table Query: • Perform binary search on l • In each step retrieve gi(q) (of l, if exists) • Return the last non empty result
ANN Search in Hamming Cube b-test t: • Pick a subset C of {1, 2, …, d} independently, at random w.p. b • For each iC, pick independently and uniformly ri{0, 1} • For any aQd: (Equivalently, we may pick R{0, 1}d s.t. Ri is 1 w.p. b/2, and the test is an inner product of R and a. Such R represents a b-test t) [Kushilevitz et al. ’98]
ANN Search in Hamming Cube • Define: D(a, b)=Pr[t(a)Rt(b)] • For a query q, Let H(a, q)bl, H(b, q)>l(1+e) Then for b=1/(2l): D(a, q)bd1<d2<D(b, q) Where: And define: d=d2-d1=Q(1-e-e/2)
ANN Search in Hamming Cube Data structure: S ={S1, …, Sd} Positive integers - M, T For any l{1,…, d}, Sl={G1,…, GM} For any j{1,…, M}, Gj consists of a set {t1,…, tT} (each tk is a (1/(2l))-test) and a table Aj of 2T entries
ANN Search in Hamming Cube In each Sl, construct Gj as follows: • Pick {t1,…, tT} independently at random • For vQd, the trace t(v)=(t1(v),…, tT(v)){0,1}T • An entry z{0, 1}T in Ajcontains a point aP, if H(t(a), z)b(d1+(1/3) d)T (else empty) The space complexity:
ANN Search in Hamming Cube For any query q and a, bP s.t. H(q, a)bl and H(q, b)>(1+e)l, it can be proven using Chernoff bounds that: This gives the result that the trace t functions as a LSH family (in its essence) (When the event presented in these inequalities occur for some Gj in Sl, Gj is said to ‘fail’) [Alon & Spencer ’92]
ANN Search in Hamming Cube Search Algorithm: We perform a binary search on l. In every step: • Pick Gj in Sl uniformly, at random • Compute t(q) from the list of tests in Gj • Check the entry labeled t(q) in Aj: • If the entry contains a point from P, restrict the search to lower l’s • Otherwise restrict the search to greater l’s Return the last non-empty entry in the search
Initialize l=d/2 Access Sl Choose Gj Calculate t(q) No l covered already? lupper half Is Aj(t(q)) empty? Yes Yes No ResAj(t(q)), llower half ANN Search in Hamming Cube Search Algorithm: Example
ANN Search in Hamming Cube • Construction of S is said to ‘fail’, if for some l more than mM/log(d) structures Gj in Sl ‘fail’ • Define (for some g, m): Then S’sconstruction fails w.p. of at most g • If S does not fail, then for every query the search algorithm fails to find an ANN w.p. of at most m
ANN Search in Hamming Cube • Query time complexity: • Space complexity: • Complexities are also proportional to e-2
Lecture Outline • Exact Nearest Neighbor search • Definition • Low dimensions • KD-Trees • Approximate Nearest Neighbor search (LSH based) • Locality Sensitive Hashing families • Algorithm for Hamming Cube • Algorithm for Euclidean space • Summary
Euclidean Space • The d-Dimensional Euclidean Space lid is Rd endowed with the Li distance • For any a, bQd we define Li distance: • The algorithm presented deals with l2d, and with l1d under minor changes
Euclidean Space Define: • B(a, r) is the closed ball around a with radius r • D(a, r)=PIB(a, r) (A subset of Rd) [Kushilevitz et al. ’98]
LSH – ANN Search Extended Scheme Preprocess: • Prepare a data structure for each ‘hamming ball’ induced by any a, bP. Query: • Start with some maximal ball • In each step calculate the ANN • Stop according to some threshold
ANN Search in Euclidean Space For aP, Define a Euclidian to Hamming mapping (h:D(a, r){0, 1}DF): • Define a parameter L • Given a set of i.i.d. unit vectors z1, …, zD • For each zi, The cutting points c1, …, cF are equally spaced on: • Each zi and cjdefine a coordinate in the DF-hamming cube, on which the projection of any bD(a, r) is 0 iff
b a h(a) h(b) (aiR) (biR) 0 1 z1 z1 1 1 a1 b1 1 1 b2 a2 0 0 a3 b3 1 0 z2 z2 1 1 ANN Search in Euclidean Space Euclidian to hamming Mapping Example: d=3, D=2, F=3
ANN Search in Euclidean Space • It can be proven that, expectedly, the mapping h preserves the relative distances between points in P • This mapping gets more accurate as r grows smaller:
ANN Search in Euclidean Space Data structure: S={Sa|aP} Positive integers - D, F, L For any aP, Sa consists of: • A list of all other P’s elements sorted by increasing distance from a • A structure Sa,b for any bRa (bP)
ANN Search in Euclidean Space Let r=L2(a, b), then Sa,b consists of: • A list of D i.i.d. unit vectors {z1, …, zD} • For each unit vector zi, a list of F cutting points • A Hamming Cube data structure of dimension DF, containing D(a, r) • The size of D(a, r)
ANN Search in Euclidean Space Search Algorithm (using a positive integer T): • Pick a random a0P where b0 is the farthest point from a0, and start from Sa0,b0 (r0=L2(a0, b0)) • For any Saj,bj: • Query for ANN of h(q) in the Hamming Cube d.s. and get result h(a’) • If L2(q, a’)>r-1/10 return a’ • Otherwise, pick T points of D(aj, rj) at random, and let a” be the closest to q among them • Let aj+1 be the closest to q of {aj, a’, a”}
ANN Search in Euclidean Space • Let b’P be the farthest from aj+1 s.t. 2L2(aj+1, q)rL2(aj+1, b’), Using a binary search on the sorted list of Sa(j+1) • If can’t find, return aj+1 • Otherwise, let bj+1=b’
ai bi q ANN Search in Euclidean Space Each ball in the search contains q’s (exact) NN
bi-1 ANN Search in Euclidean Space • contains only points from • contains at most points w.p. of at least 1-2-T ai-1 ai bi-1 q
ai-1 ai bi q ANN Search in Euclidean Space
ANN Search in Euclidean Space Conclusion: In the expected case, this gives us an O(log(n)) number of iterations
a1 b1 ANN Search in Euclidean Space Search Algorithm: Example a0 b0 q
ANN Search in Euclidean Space • Construction of S is said to ‘fail’, if for some Sa,b, h does not preserve the relative distances • Define (for some z): Then S’sconstruction fails w.p. of at most z • If S does not fail, then for every query the search algorithm finds an ANN
ANN Search in Euclidean Space • Query time complexity: • Space complexity: • Complexities are also proportional to e-2