Approximate Selection Queries over Imprecise Data
E N D
Presentation Transcript
Approximate Selection Queries over Imprecise Data Iosif Lazaridis and Sharad Mehrotra University of California, Irvine ICDE Conference, March 2004 Boston, MA, USA
Talk Outline • Regular vs. Approximate Selection Queries • Quality-Aware Queries (QaQs) • Optimization for QaQs • Performance Study • Conclusions
Exact set E σλT Selection predicate λ Regular Selection Queries Set of precise objects T
x(t) (x, y) a x b Imprecise Objects • An imprecise object o corresponds to a precise object ωo which can be retrieved (at cost) via a probe operation
Approximate answer A σλT ? ? Selection predicate λ Approximate Selection Queries Set of imprecise objects T
Formal Problem Setting • Let T be a set of imprecise objects • Let λbe a selection predicate which maps an imprecise object to set {YES, NO, MAYBE} • The exact set is: E = {ωo | oT λ(ωo)=YES} • The goal is to produce an approximate answer A with associated “quality guarantees” • A will potentially contain both precise and imprecise objects
Quality Metrics • Set-based quality • Precision: fraction of objects in A that are also in E p = |A E | / |A| • Recall: fraction of objects in E that are also in A r =|A E | / |E| • Value-Based Quality • Each imprecise object o has laxity l(o) • Each precise object ωo has laxity 0 • Answer Laxity lmax = maxxAl(x)
Total Set T M N Y Mns Ms Answer A AMs AY AY pG = AMs + AY AY rG = + Y Mns Ms- A Quality Guarantees Laxity Guarantee is: lmax = maxxAl(x)
Quality-Aware Query (QaQ) • Input consists of: • Set T • Predicate λ • Quality Requirements pq, rq, lqmax • Answer A should be such that: pG pq, rG rq and lqmax lmax
QaQ Selection Operator • Requires O(1) memory/processing per input object • Each object o is read, and λ(o) is evaluated • Three choices for each object o: • Forward it to A • Ignore it • Probe it, get ωo then Forward or Ignoreωo
Handling Objects • Ignore NO objects • YES objects • If l(o) > lqmaxProbe or Ignore • Else Forward • MAYBE objects • If l(o) >lqmaxProbe or Ignore • Else all three choices are feasible
Ensuring Correctness • No object with laxity l(o)>lqmax may be forwarded • The precision guarantee pGmay not be lower than pq • If no other YES objects remain to be seen, then pq will be violated • If |AY| / (|Y |+|Ms-A|) < rq then an object o cannot be ignored • If no other YES objects remain to be seen, then rqwill be violated
QaQ Evaluation Cost • R: number of objects read (R |T|) • Y, M: number of objects that were YES/MAYBE at the input • Yf, Yp: number of YES objects that are forwarded/probed (Yf+Yp Y) • Mf, Mp: number of MAYBE objects that are forwarded/probed (Mf+Mp M) • Mpy: number of probed MAYBE objects that become YES Cost W = Rcr + (Yp+Mp)cp + (Yf+Mf)cwi+(Yp+Mpy)cwp read probe write
lqmax 7 Forward The “Map” NO MAYBE YES Probe with probability ppy or Ignore l(o) 1 2 3 6 Probe Ignore s5 s3 4 5 Forward with probability pfm or Ignore Probe s(o)=0 0<s(o)<1 s(o)=1 s(o): probability MAYBEYES
Query Optimization • Free parameters ppy, s3, s5, pfm • Estimate # of YES, NO, MAYBE objects • Estimate # of YES, MAYBE objects above lqmaxlaxity requirement • Requires some knowledge of distribution of l(o) • Distribution of s(o) • Minimize cost W subject to pq, rq, lqmax • 4-parameter optimization problem
Query Evaluation • Get selectivity estimates • Solve optimization problem for ppy, s3, s5, pfm, thus instantiating the “Map” • Read one object at a time, handle it according to the “Map” • Make sure correctness criteria are enforced! • Finish when rG rq
Performance Study • Size of input |T| = 10,000 • Laxity ranges in [0,100] • Probe cost = 100 x read/write unit cost. • We vary: • Precision, Recall, Laxity Requirement • Query selectivity • Input Uncertainty (ratio of YES/MAYBE objects) • Costs are normalized by dividing with |T |
Competing Algorithms • We devised two simple heuristics: • STINGY avoids probes: it ignores MAYBE objects and objects exceeding the lqmax threshold. • STINGY is conservative, but sometimes it is forced to probe to meet the quality guarantees. • GREEDY forwards all MAYBE objects and probes all objects that exceed the lqmax threshold. • GREEDY tries to produce the result quickly by not ignoring objects, but sometimes it uses too many probes and forwards too many objects
Varying Laxity • Input has 20% YES, 20% MAYBE objects • 90% Precision and 50% Recall is requested • As the laxity requirement becomes looser, the cost is reduced since imprecise objects can be forwarded without a probe
Varying Precision • Input has 20% YES, 20% MAYBE objects • 50% Recall and laxity=50 is requested • Cost increases as Precision requirement increases, as objects can’t be forwarded unprobed
Varying Recall • Input has 20% YES, 20% MAYBE objects • 90% Precision and laxity=50 is requested • Cost increases as Recall requirement increases • When Recall requirement is low, only part of the input needs to be read • As Recall requirement tends to 100%, all the input must be read and no objects can be ignored
Varying Selectivity • Input has 20% YES, 20% MAYBE objects • 90% Precision, 50% Recall, and laxity=50 is requested • Cost increases as selectivity increases, since more objects need to be output
Varying Input Uncertainty • Input has 20% YES, 20% MAYBE objects • 90% Precision, 50% Recall, and laxity=50 is requested • When MAYBE objects are few, no probe cost needs to be paid: the few MAYBE objects can be ignored • When MAYBE objects are many, they cannot be ignored (Recall might be violated), or forwarded (Precision violated). Hence, they are probed, increasing the cost
Conclusions • Quality-Aware Queries (QaQs) • Query: predicate + quality requirement • Response: answer + quality guarantee • Quality Metrics for Set-Based Answers • On-line algorithm for evaluating QaQs • Works better than simple heuristics • Takes into account input characteristics/user requirements • Combines data read/write + probing cost • Future Work: • Indexes, Joins
Thank You! ?????