430 likes | 547 Vues
The Power Method presents an innovative approach for estimating query selectivity and costs for multi-dimensional queries like range search and k-nearest neighbor in databases. Addressing inherent flaws in traditional estimation methods, this technique leverages the concept of local power law to bypass the density trap, quantifying local intrinsic dimensionality instead of density. Through detailed experiments and theoretical analysis, the method shows promise in accurately estimating queries, paving the way for improved performance in complex query scenarios.
E N D
The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Target query types DB = set of m –d points. • Range search (RS) • k nearest neighbor (KNN) • Regional distance (self-) join (RDJ) • in Louisiana, find all pairs of music stores closer than 1mi to each other
Target problem Estimate • Query selectivity • Query (I/O) cost • for any Lp metric • using a single method
Target Problem • for any Lp metric • using a single method
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Older Query estimation approaches • Vast literature • Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc • BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear
Main competitors • Local method • Representative methods: Histograms • Global method • Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations • Representative methods: Fractal and power law
Rationale and problems of histograms • Partition the data space into a set of buckets and assume (local) uniformity • Problems • uniformity • tricky/slow estimations, for all but the L norm
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10
Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10
“Density Trap” • Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! • This ‘trap’ will appear for any non-uniform dataset • Almost ALL real point-sets are non-uniform -> the trap is real
“Density Trap” In short: is meaningless • What should we do instead?
“Density Trap” In short: is meaningless • What should we do instead? • A: log(count_of_neighbors) vs log(area)
Local power law • In more detail: ‘local power law’: • nb: # neighbors of point p, within radius r • cp: ‘local constant’ • np : ‘local exponent’ (= local intrinsic dimensionality)
Local power law Intuitively: to avoid the ‘density trap’, use • np:local intrinsic dimensionality • instead of density
Does LPL make sense? • For point ‘q’: LPL gives nbq(r) = <constant> r1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10
Local power law and Lx if a point obeys L.P.L under L, ditto for any other Lx metric, with same ‘local exponent’ ->LPL works easily, for ANY Lx metric
Examples #neighbors(<=r) p1 p2 radius p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Proposed method • Main idea: if we know (or can approximate) the cp and np of every point p, we can solve all the problems:
Target Problem • for any Lp metric • using a single method
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends • only on the ‘local exponent’ • and NOT on the ‘local constant’, • nor on the cardinality of the dataset
Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do?
Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do? • A: exploit locality:
Implementation • nearby points: usually have similar local constants and exponents. Thus, one solution: • ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q
Implementation • choose anchors: with sampling, DBS, or any other method.
Implementation • (In addition to ‘anchors’, we also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)
Experiments - Settings • Datasets • SC that contain 40k points representing the coast lines of Scandinavia • LB that include 53k points corresponding to locations in Long Beach county • Structure: R*-tree • Compare Power method to • Minskew • Global method (fractal)
Experiments - Settings • The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods • Queries: Biased (following the data distribution) • A query workload contains 500 queries • We report the average error i|actiesti|/iacti
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Range search selectivity • the LPL method wins
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Regional distance join selectivity • No known global method in this case • The LPL method wins, with higher margin
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Conclusions • We spot the “density trap” problem of the local uniformity assumption (<- histograms) • we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) • and we solved all posed problems:
Conclusions – cont’d • for any Lp metric • using a single method
Conclusions – cont’d • for any Lp metric (Lemma3.2) • using a single method (LPL & ‘anchors’)