Overcoming Limitations of Sampling for Aggregation Queries

Overcoming Limitations of Sampling for Aggregation Queries Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Mayur Datar, Stanford University Rajiv Motwani, Stanford University Vivek Narasayya, Microsoft Research Presented by Daniel Kuang CSE@UTA

Outline Introduction Limitations of uniform random sampling Handling Data skew Handling low selectivity & small groups Implementation result Conclusion Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Introduction • Approximate aggregation query processing using sampling • Uniform sampling performs poorly when • Distribution of aggregated attribute is skewed • low selectivity • Solution proposed by the paper • Outlier indexing • Weighted sampling based on workload information • A combined approach significantly reduces the approximation error. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Limitations of uniform random sampling – an example • Relation R with 10,000 tuples aggregate on column ‘C’ • 99% tuples have value 1 - 9900 • 1% tuples have value 1000 – 100, 000 • 1% URS of R – Sample size = 100 • sum result = sum of sample x scale factor(100) • All of 100 tuples have value 1 in sample estimated result = 100x100 = 10, 000 • 2 or more tuples have value 1,000 in sample estimated result > 209, 800 • Reasonable result only when we get exactly one tuple of value 1000 per sample – probability 0.37 • With probability of 0.63, we would get a large error in the estimate. Actual sum =109, 900 Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Limitations of uniform random sampling – skewed data • Contribution of the outliers to the error in estimating sum via uniform sampling • Relation R size N having values {y1, y2, ……yN}. Let U be a uniform random sample of yi’s of size n. Then • Actual sum • Estimated sum • Standard error • Where S is the standard deviation of the values in the relation defined as • If there are outliers in the data then S could be very large. For a given error bound we need to increase the sample size n. yi Ye = (N/n) Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Limitations of uniform random sampling – low selectivity • Selection query partitions relation into 2 sub-relations – • Tuples that satisfy the condition – relevant sub-relation • Tuples that do not satisfy the condition • Number of tuples sampled from relevant sub-relation is proportional to its size. • If this relevant sample size is small due to low selectivity of the query, it may lead to larger error • Success of uniform sampling depends on the size of the relevant sub-relation Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes • Idea of outlier indexing – identify outlier tuples and store them separately • Example: selection query (Q) with sum aggregate • Partition to two sub-relations Ro and RNO. • Apply the query to the sub-relation of outliers. • Apply the query to the sub-relation of RNO. • Combine the above two to get overall estimate result of query Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes – an example • Aggregate Query Q - Select sum (sales) from lineitem • Preprocessing • 1. Determine outliers – lineitem_outlier view • 2. Sample non-outliers – sample table lineitem_samp • Query processing • 1. Aggregate outliers – sum(sales) from lineitem_outlier view • 2. Aggregate non-outliers – apply the query to sample T and extrapolate – sum(sales) for lineitem_samp * scale factor • 3. Combine aggregates – approximate result from RNO + exact result from RO = approximate result for R Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes – selection of outliers • Query error is solely due to error in estimating the aggregate from non-outliers • Additional overhead of maintaining and accessing the outlier index • An optimal sub-relation Ro that leads to the minimum possible sampling error Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes – selection of outliers Theorem 2 • Consider a multiset R = {y1, y2,…yN} in sorted order. Ro C R such that |Ro| ≤ τ S(R\ Ro) = min R’C R,|R’|≤ τ{S(R\R’)} • There exists some 0 ≤ τ’ ≤ τ such that Ro = {yi|1 ≤ i≤ τ’ }U{yi|(N+ τ’ +1- τ) ≤i ≤N} • States that the subset that minimizes the standard deviation over the remaining set consists of the leftmost τ’ elements and the right most τ- τ’ elements from the multiset R when elements are arranged in sorted order. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes – selection of outliers • Alogirithm Outlier-Index(R,C, τ): • Let the values in column C be sorted in relation R • For i = 1 to τ+1, compute • E(i) = S({yi, yi+1,…yN- τ+i-1}) • Let i’ be the value of i where E(i) is minimum. • Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N} • where τ’ = i’ - 1 Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling data skew – outlier indexes – storage allocation Given sufficient space to store m tuples, how do we allocate storage between samples and outlier-index in order to minimize error? S(τ) is the standard deviation in non-outliers for an optimal outlier-index of τ. If we allocate space m such that τ tuples in the outlier and m- τ in the sample. Then error will be proportional to S(τ)/√(m- τ). Finding the value of τ for which S(τ)/√(m- τ) will be minimum. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Use of outlier-indexing – extension to other aggregates • Avg aggregate: same as sum aggregate • Count aggregate: outlier-indexing not beneficial since there is no variance among the data values • Rank order dependent aggregates: such as min, max, and median – outlier not useful Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling low selectivity and small groups • Use weighted sampling with the help of workload information instead of uniform sampling • More samples from the subsets that are small in size but are more important Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling low selectivity and small groups – Exploit workload info • Workload collection – collect workload information consisting of representative queries from query profilers and query logs • Trace query patterns – parsed information e.g. the set of selection conditions • Trace tuple usage – usage of specific tuples like frequency of access to each tuple • Weighted sampling – perform sampling by taking into account weights of the tuples Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Handling low selectivity and small groups – Weighted Sampling • Let the weight of tuple ti be wi • Normalized weight be wi’ = wi/ Σwi • probability of acceptance of this tuple in the sample pi= n.wi’ • For each tuple included in the sample, store corresponding pi. • Each aggregate computed over this tuple gets multiplied by the inverse of pi • works well if • Access pattern is local i.e. most queries access a small part of relation • Workload is a good representative of actual queries Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Implementation & Experimental Result • Parameters • (1) skew of the data (z) -- 1,1.5, 2, 2.5, and 3 • (2) sampling fraction (f) -- from 1% to 100% • (3)storage for the outlier-index - 1%, 5%, I0%, and 20% • Comparisons • (1) Uniform sampling • (2) Weighted sampling • (3) weighted sampling + outlier-indexing Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Experimental Result - Varying the data skew Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Experimental Result - Varying the sampling fraction Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Experimental Result - Varying the selectivity of queries Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Conclusion and Open Problems • Skew can lead to large errors - outlier-indexing significantly reduce the error. • Problem of low selectivity of queries – Weighted sampling based on workload information. Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Questions ? Ranjan Dash CSE@UTA Daniel Kuang CSE@UTA

Overcoming Limitations of Sampling for Aggregation Queries

Overcoming Limitations of Sampling for Aggregation Queries

Presentation Transcript

Overcoming the limitations of wiring

Limitations, Aggregation, and Constraints

Limitations of MongoDB

Statute of Limitations

Overcoming Limitations of Sampling for Agrregation Queries

A Thin Monitoring Layer for Top- k Aggregation Queries over a Database

Nested Queries and Aggregation

Secure and Highly-Available Aggregation Queries via Set Sampling

AGGREGATION OF PREFERENCES

6 Rank Aggregation and Top-k Queries

Incremental Aggregation on Multiple Continuous Queries

Limitations, Aggregation, and Constraints

Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Overcoming the Limitations of Softswitch Architectures

New Sampling-Based Estimators for OLAP Queries

Queries for data modification: Action queries

Overcoming Interference Limitations in Networked Systems

DoS-Resilient Secure Aggregation Queries in Sensor Networks

Window-aware Load Shedding for Aggregation Queries over Data Streams

Robust Estimation With Sampling and Approximate Pre-Aggregation

Overcoming Local Crash Data Limitations

Teacher Training in Technology: Overcoming Limitations