The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions

Target query types DB = set of m –d points. • Range search (RS) • k nearest neighbor (KNN) • Regional distance (self-) join (RDJ) • in Louisiana, find all pairs of music stores closer than 1mi to each other

Target problem Estimate • Query selectivity • Query (I/O) cost • for any Lp metric • using a single method

Target Problem • for any Lp metric • using a single method

Older Query estimation approaches • Vast literature • Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc • BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear

Main competitors • Local method • Representative methods: Histograms • Global method • Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations • Representative methods: Fractal and power law

Rationale and problems of histograms • Partition the data space into a set of buckets and assume (local) uniformity • Problems • uniformity • tricky/slow estimations, for all but the L norm

Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10

Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10

“Density Trap” • Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! • This ‘trap’ will appear for any non-uniform dataset • Almost ALL real point-sets are non-uniform -> the trap is real

“Density Trap” In short: is meaningless • What should we do instead?

“Density Trap” In short: is meaningless • What should we do instead? • A: log(count_of_neighbors) vs log(area)

Local power law • In more detail: ‘local power law’: • nb: # neighbors of point p, within radius r • cp: ‘local constant’ • np : ‘local exponent’ (= local intrinsic dimensionality)

Local power law Intuitively: to avoid the ‘density trap’, use • np:local intrinsic dimensionality • instead of density

Does LPL make sense? • For point ‘q’: LPL gives nbq(r) = <constant> r1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10

Local power law and Lx if a point obeys L.P.L under L, ditto for any other Lx metric, with same ‘local exponent’ ->LPL works easily, for ANY Lx metric

Examples #neighbors(<=r) p1 p2 radius p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2

Proposed method • Main idea: if we know (or can approximate) the cp and np of every point p, we can solve all the problems:

Target Problem • for any Lp metric • using a single method

Target Problem • for any Lp metric (Lemma3.2) • using a single method

Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends • only on the ‘local exponent’ • and NOT on the ‘local constant’, • nor on the cardinality of the dataset

Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do?

Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do? • A: exploit locality:

Implementation • nearby points: usually have similar local constants and exponents. Thus, one solution: • ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

Implementation • choose anchors: with sampling, DBS, or any other method.

Implementation • (In addition to ‘anchors’, we also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)

Experiments - Settings • Datasets • SC that contain 40k points representing the coast lines of Scandinavia • LB that include 53k points corresponding to locations in Long Beach county • Structure: R*-tree • Compare Power method to • Minskew • Global method (fractal)

Experiments - Settings • The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods • Queries: Biased (following the data distribution) • A query workload contains 500 queries • We report the average error i|actiesti|/iacti

Range search selectivity • the LPL method wins

Regional distance join selectivity • No known global method in this case • The LPL method wins, with higher margin

Range search query cost

k nearest neighbor cost

Regional distance join cost

Conclusions • We spot the “density trap” problem of the local uniformity assumption (<- histograms) • we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) • and we solved all posed problems:

Conclusions – cont’d • for any Lp metric • using a single method

Conclusions – cont’d • for any Lp metric (Lemma3.2) • using a single method (LPL & ‘anchors’)

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries