1 / 22

Overcoming Limitations of Sampling for Aggregation Queries

Overcoming Limitations of Sampling for Aggregation Queries. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Mayur Datar, Stanford University Rajiv Motwani, Stanford University Vivek Narasayya, Microsoft Research. Presented by Daniel Kuang CSE@UTA. Outline.

Patman
Télécharger la présentation

Overcoming Limitations of Sampling for Aggregation Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overcoming Limitations of Sampling for Aggregation Queries Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Mayur Datar, Stanford University Rajiv Motwani, Stanford University Vivek Narasayya, Microsoft Research Presented by Daniel Kuang CSE@UTA

  2. Outline Introduction Limitations of uniform random sampling Handling Data skew Handling low selectivity & small groups Implementation result Conclusion Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  3. Introduction • Approximate aggregation query processing using sampling • Uniform sampling performs poorly when • Distribution of aggregated attribute is skewed • low selectivity • Solution proposed by the paper • Outlier indexing • Weighted sampling based on workload information • A combined approach significantly reduces the approximation error. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  4. Limitations of uniform random sampling – an example • Relation R with 10,000 tuples aggregate on column ‘C’ • 99% tuples have value 1 - 9900 • 1% tuples have value 1000 – 100, 000 • 1% URS of R – Sample size = 100 • sum result = sum of sample x scale factor(100) • All of 100 tuples have value 1 in sample estimated result = 100x100 = 10, 000 • 2 or more tuples have value 1,000 in sample estimated result > 209, 800 • Reasonable result only when we get exactly one tuple of value 1000 per sample – probability 0.37 • With probability of 0.63, we would get a large error in the estimate. Actual sum =109, 900 Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  5. Limitations of uniform random sampling – skewed data • Contribution of the outliers to the error in estimating sum via uniform sampling • Relation R size N having values {y1, y2, ……yN}. Let U be a uniform random sample of yi’s of size n. Then • Actual sum • Estimated sum • Standard error • Where S is the standard deviation of the values in the relation defined as • If there are outliers in the data then S could be very large. For a given error bound we need to increase the sample size n. yi Ye = (N/n) Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  6. Limitations of uniform random sampling – low selectivity • Selection query partitions relation into 2 sub-relations – • Tuples that satisfy the condition – relevant sub-relation • Tuples that do not satisfy the condition • Number of tuples sampled from relevant sub-relation is proportional to its size. • If this relevant sample size is small due to low selectivity of the query, it may lead to larger error • Success of uniform sampling depends on the size of the relevant sub-relation Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  7. Handling data skew – outlier indexes • Idea of outlier indexing – identify outlier tuples and store them separately • Example: selection query (Q) with sum aggregate • Partition to two sub-relations Ro and RNO. • Apply the query to the sub-relation of outliers. • Apply the query to the sub-relation of RNO. • Combine the above two to get overall estimate result of query Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  8. Handling data skew – outlier indexes – an example • Aggregate Query Q - Select sum (sales) from lineitem • Preprocessing • 1. Determine outliers – lineitem_outlier view • 2. Sample non-outliers – sample table lineitem_samp • Query processing • 1. Aggregate outliers – sum(sales) from lineitem_outlier view • 2. Aggregate non-outliers – apply the query to sample T and extrapolate – sum(sales) for lineitem_samp * scale factor • 3. Combine aggregates – approximate result from RNO + exact result from RO = approximate result for R Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  9. Handling data skew – outlier indexes – selection of outliers • Query error is solely due to error in estimating the aggregate from non-outliers • Additional overhead of maintaining and accessing the outlier index • An optimal sub-relation Ro that leads to the minimum possible sampling error Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  10. Handling data skew – outlier indexes – selection of outliers Theorem 2 • Consider a multiset R = {y1, y2,…yN} in sorted order. Ro C R such that |Ro| ≤ τ S(R\ Ro) = min R’C R,|R’|≤ τ{S(R\R’)} • There exists some 0 ≤ τ’ ≤ τ such that Ro = {yi|1 ≤ i≤ τ’ }U{yi|(N+ τ’ +1- τ) ≤i ≤N} • States that the subset that minimizes the standard deviation over the remaining set consists of the leftmost τ’ elements and the right most τ- τ’ elements from the multiset R when elements are arranged in sorted order. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  11. Handling data skew – outlier indexes – selection of outliers • Alogirithm Outlier-Index(R,C, τ): • Let the values in column C be sorted in relation R • For i = 1 to τ+1, compute • E(i) = S({yi, yi+1,…yN- τ+i-1}) • Let i’ be the value of i where E(i) is minimum. • Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N} • where τ’ = i’ - 1 Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  12. Handling data skew – outlier indexes – storage allocation Given sufficient space to store m tuples, how do we allocate storage between samples and outlier-index in order to minimize error? S(τ) is the standard deviation in non-outliers for an optimal outlier-index of τ. If we allocate space m such that τ tuples in the outlier and m- τ in the sample. Then error will be proportional to S(τ)/√(m- τ). Finding the value of τ for which S(τ)/√(m- τ) will be minimum. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  13. Use of outlier-indexing – extension to other aggregates • Avg aggregate: same as sum aggregate • Count aggregate: outlier-indexing not beneficial since there is no variance among the data values • Rank order dependent aggregates: such as min, max, and median – outlier not useful Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  14. Handling low selectivity and small groups • Use weighted sampling with the help of workload information instead of uniform sampling • More samples from the subsets that are small in size but are more important Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  15. Handling low selectivity and small groups – Exploit workload info • Workload collection – collect workload information consisting of representative queries from query profilers and query logs • Trace query patterns – parsed information e.g. the set of selection conditions • Trace tuple usage – usage of specific tuples like frequency of access to each tuple • Weighted sampling – perform sampling by taking into account weights of the tuples Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  16. Handling low selectivity and small groups – Weighted Sampling • Let the weight of tuple ti be wi • Normalized weight be wi’ = wi/ Σwi • probability of acceptance of this tuple in the sample pi= n.wi’ • For each tuple included in the sample, store corresponding pi. • Each aggregate computed over this tuple gets multiplied by the inverse of pi • works well if • Access pattern is local i.e. most queries access a small part of relation • Workload is a good representative of actual queries Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  17. Implementation & Experimental Result • Parameters • (1) skew of the data (z) -- 1,1.5, 2, 2.5, and 3 • (2) sampling fraction (f) -- from 1% to 100% • (3)storage for the outlier-index - 1%, 5%, I0%, and 20% • Comparisons • (1) Uniform sampling • (2) Weighted sampling • (3) weighted sampling + outlier-indexing Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  18. Experimental Result - Varying the data skew Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  19. Experimental Result - Varying the sampling fraction Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  20. Experimental Result - Varying the selectivity of queries Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  21. Conclusion and Open Problems • Skew can lead to large errors - outlier-indexing significantly reduce the error. • Problem of low selectivity of queries – Weighted sampling based on workload information. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

  22. Questions ? Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

More Related