1 / 38

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. By : Surajid Chaudhuri Gautam Das Vivek Narasayya. Presented by : Sayed Muchallil September 21 st , 2010 . CONTENTS. INTRODUCTION ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING FIXED WORKLOAD

cheche
Télécharger la présentation

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : SurajidChaudhuri Gautam Das VivekNarasayya Presented by :SayedMuchallil September 21st, 2010

  2. CONTENTS • INTRODUCTION • ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING • FIXED WORKLOAD • STRATIFIED SAMPLING • SOLUTION • SUMMARY

  3. Pre-computed samples • Can give approximate answer very efficiently. • Workload are used to make sure that errors are acceptable.

  4. Previous Studies • Solution is difficult to evaluate theoretically. • Do not formally deal with uncertainty in the expected workload. • Ignoring the variance in the data distribution.

  5. Sample • Only 50% of R records can be used as sample • Query : “SELECT SUM(Revenue) FROM R” • The answer for is 1030 Table R

  6. Sample (cont.) • The answer for the query for table S1 is 40. • The answer for the query for table S2is 2020. • How to get these answer? Sample Table S1 Sample Table S2

  7. Sample (cont.) • large variance in the aggregate column can lead to large relative errors. • Relative error =|y- y’| / y • Relative error for S1 = |1030 – 40| / 1030 • Relative error for S2= |1030 – 2020| / 1030

  8. What’s New ? • The goal is to pick sample that minimize error. • If actual workload is identical to the given workload (fixed), error will be smaller. • Can work for identical and similar query to the given workload.

  9. Sampling • Two ways for selecting samples • Randomized • Deterministic • A Workload W is a set of pairs of queries and their weight. • W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>} • Σiwi = 1.

  10. Architecture for Approximate Query Processing

  11. Architecture (cont.) • Offline Component • Selects sample or records from relation R • Online Component • Rewrites an incoming query to use the sample. What is “rewrites” means? • Reports answer with an estimate error

  12. Architecture (cont.) • New method for automatically lifting a given workload. • It is unrealistic to assume that the incoming queries will be identical to the given workload. • The key : the ability to compute a probability distribution Pw.

  13. Error Metrics • Relative Error : |y - y’| / y • Squared Error : SE(Q) = (|y - y’| / y)² • Squared Error for GROUP BY query SE(Q) = (1/g) Σi((yi – yi’)/ yi)² • a probability distribution of queries pw • Mean squared error for the distribution: MSE(pw) =ΣQpw(Q)*SE(Q) • Root mean squared error: RMSE(pw) = √MSE(pw)

  14. Fixed Workload • Special case ? • A given workload are “identical” to the incoming queries. • Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

  15. Fundamental Regions • Relation R contains 9 records • W consists of 2 queries • Q1 = select records with C values between 10 -50 • Q2 = select records with C values between 40 -70 • These queries divide Relation R into 4 fundamental regions.

  16. Fundamental Regions (cont.)

  17. Fundamental Regions (cont.) • partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none. • Total number fundamental regions =? Min(2|W|, n)

  18. FIXEDSAMP Solution • Step 1. Identify Fundamental Regions in R • r <= k • r > k • Step 2 Pick Sample Records • Step 3 Assign values to additional columns

  19. LIFTING WORKLOAD TO QUERY DISTRIBUTION • Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not. • Q’ and Q are similar if selected records have significant overlap.

  20. LIFTED WORKLOAD • P{Q}(R’)is the probability of occurrence of any query that selects exactly the set of records R’. • For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

  21. LIFTED WORKLOAD (Cont.)

  22. LIFTED WORKLOAD (Cont.) δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

  23. RATIONALE FOR STRATIFIED SAMPLING • A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

  24. STRATIFIED SAMPLING • a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k). • Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4); • POPQ is population of query Q • POPQ1 = {0,0,1,1} = non-zero variance • Divided into two strata {0,0} and {1,1}

  25. SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION • Stratification • How many strata • How many records for each stratum • Allocation • Determines how to divide k • Sampling • Forms the final sample of k record

  26. SOLUTION FOR COUNT AGGREGATE • Stratification (lemma 1) • r is not known, divide R into fundamental regions and treat them as strata. • Allocation (lemma 2) • MSE(pW) = Σiwi MSE(p{Q}) • MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload

  27. SOLUTION FOR COUNT AGGREGATE (Cont.) • For any Q εW, we express MSE(p{Q}) as a function of the kj’s Lemma 3 : ApproxMSE(p{Q}) = Then,

  28. SOLUTION FOR COUNT AGGREGATE (Cont.) • Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’svariables. Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ. αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj. • Now we can minimize MSE(pw).

  29. SOLUTION FOR COUNT AGGREGATE (Cont.) Lemma 4: Σj(αj / kj) is minimized subject to Σjkj = k if kj = k * ( sqrt(αj) / Σisqrt(αi) ) • This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region

  30. SOLUTION FOR SUM AGGREGATE • Stratification • Bucketing technique • Divide fundamental regions with large variance into a set of finer regions. • Treat each region as strata • Allocation • Yj is average (sum) of the aggregate column values of all records in region Rj

  31. SOLUTION FOR SUM AGGREGATE (Cont.) • Each value in the region can be approximated as yj • An approximate formula for MSE(P{Q}) for SUM query Q in W

  32. Pragmatic Issues • Identifying Fundamental Regions • Handling Large Number of Fundamental Regions • Obtaining Integer Solution • Obtaining unbiased error

  33. STRAT ALGORITHM

  34. IMPLEMENTATION AND EXPERIMENTAL RESULT • This experiment compares the STRAT method to other methods. • USAMP – uniform random sampling • WSAMP– weighted sampling • OTLIDX– outlier indexing combined with weighted sampling • CONG– Congressional sampling

  35. COUNT AGGREGATE

  36. SUM AGGREGATE

  37. COUNT AGGREGATE

  38. THANK YOU

More Related