Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010

Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003 Presented By Srikanth Vadada Fall 2010 - CSE 6339 23rd Sep 2010

3 Key Terms for the Topic Approximate Query Processing (AQP) Biased Sample Dynamic Selection Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing

Why Approximate Query Processing (AQP) ? • Rapid strides in Data Collection & Management Technologies • Resulting in very large Databases • Effective Data Analysis methods – Ongoing Research • Analysis Queries require aggregation or summarizations • Expensive Running Times • Requirements of Analysis Systems (Decision Support Systems) • Short Query Response Time • Exactness of Query Results less important AQP Techniques are the Solution

Why Sampling Techniques ? • Data Analysis Queries • OLAP – Materialized Views of Data Cubes • Building Indexes • Physical Data Design – Use Preprocessing Time & Space • Effective when Query Workload is known in advance • Expensive to build indexes for all possible queries • AQP & Physical Database Design Methods Complementary

Approximate Query Processing (Related Work) • Online Aggregation • Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997 • Approximate answers are produced during early stages • Gradual refinement until data is processed • Advantages: • No pre-processing required • Allows progressive refinement of answers at runtime • Disadvantages : • Require random disk access (slow). • Requires query processor code change.

Approximate Query Processing (Related Work) • Join Synopses - Sampling based method • Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999 • Join Queries – Primary Key Joins • Pre Computation - Join of Fact Tables with Dimension Tables • Disadvantages: • Does not extend to queries that involve non-foreign key joins

Approximate Query Processing (Related Work) • Icicles - Weighted Sampling method • Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000. • Frequency of Tuple Access by Queries in Workload • Disadvantages: • Addresses only low selectivity problem

Dynamic Sample Selection • “Appropriate Biased” Samples • Give accurate approximate results for most queries • Appropriate Biased Sample varies from Query to Query • Previous Sampling Methods vs Dynamic Sample Selection • Previous sampling - Single Sample with fixed bias • Individual Tailored Sample for each query • Creation of subsamples is done offline - Preprocessing • Assembly into an overall sample is done online - Runtime

Effectiveness of Biased Sampling - Example • Database consists of 100 Product Tuples • Product = “Stereo” - 90 Tuples • Product = “TV” – 10 Tuples • Sampling 10 tuples in 2 ways • 10% of the tuples uniformly each with weight 10 • 0% of “Stereo” tuples and 100% “TV” tuples with weight 1 • Query – Count of “TV” Tuples • Which gives a correct answer always? 2nd Sample - Always gives the exact answer 1st Sample - Only if exactly 1 of the TV tuples is chosen

Dynamic Sampling SAMPLE DATA DATA SAMPLE SAMPLE Query Query SAMPLE SAMPLE Static vs Dynamic Sampling Selection StandardSampling

Dynamic Sample Selection Architecture ( 1 / 3) • Pre- Processing Phase • Extra Disk space not taken advantage by Standard Sampling Methods • DSS uses this effectively by creating a large sample containing a family of differently biased subsamples • Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata • Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample Query Workload Sample Data Select Strata Build Sample Data Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Dynamic Sample Selection Architecture ( 2 / 3) • Runtime Phase • When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables • Appropriate Sample Tables to use are determined by comparing Query with the Meta data • Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described Query Sample Data Choose Samples Rewrite Query Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Dynamic Sample Selection Architecture ( 3 / 3) • Policies for Sample Selection • Choice of Sample guided by incoming Query Syntax • Examples • Separate sample for each table and choose sample based on the FROMclause • Separate sample for each pre-specified aggregate expression and choose sample based on the SELECTclause

Motivation - Small Group Sampling (1 / 2) • Sampling for answering aggregation queries with “group-bys” • Ex : SELECT DEPTID , COUNT(*) • FROM EMP_DEPT • GROUP BY DEPTID • Uniform Sampling give weight to each GROUP proportional to number of tuples falling in the GROUP • Skewness in data distribution results in Group Oversampling • Heuristic Method for samples satisfying most group by queries • Uniform Sampling - Good providing estimates for larger groups

Motivation - Small Group Sampling ( 2 / 2) • Small Groups are the problem. • Since the Groups are small – Take Advantage • Scan all tuples contributing to small groups • Need to identify the small groups • Small Group Sampling method • Overall sampling using Uniform Sampling- Overall Sample • Rows from small groups – Small group tables • No downsampling for small group tables -100% rows taken Large Groups Small Groups

Small Group Sampling - Example • Example : Aggregation queries with “group-bys” select Age, Income, count(*) from Employee_Tbl group by Age, Income

Small Group Sampling – Illustration (1/3) s_overall • sample – perform uniform sampling on large groups. • Small group tables - one or more sample tables for smaller groups. Pre-Processing Phase: • Create a overall sample s_overall • 2. “Age” Histogram ( Column Index: 0) • r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) • t : small group fraction, max size of each small group table (eg, 20%) s_age Small group table

Small Group Sampling – Illustration (2/3) s_overall Pre-Processing Phase: “Income” Histogram (Column Index: 1) s_income 011 011

Small Group Sampling – Illustration (3/3) Runtime Phase: s_age (Column Index: 0 - 001) s_overall Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income SELECT Age, Income, count(*) FROM s_age GROUP BY Age, Income UNION ALL SELECT Age, Income, count(*) FROM s_income GROUP BY Age, Income WHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/ UNION ALL SELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, Income WHERE Bitmask & 3 = 0 /* 3 = 20 + 21ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/ s_income (Column Index: 1 - 010)

Accuracy Metrics (1 / 2) • As many possible groups to be preserved in approximate answer • Error in the aggregate value for each group should be small • Q = Aggregation Query • Let G = {g1, g2, g3, … gn} be the set of n groups in the answer to Q • xi = aggregate value for group gi. • A= Approximate Answer to Q • G’ = {gi1, gi2, gi3, … gim} be the set of m groups in A • x’i1 = aggregate value for group gij.

Accuracy Metrics (2 / 2) Percentage of Groups from Q missed by A Average relative error on Q of A Average squared relative error on Q of A

Experimental Results ( 1 / 2) TPC – H Database , Count & Sum queries , Number of Columns in all Tables =245 RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling Increase was more pronounced for Uniform Sampling Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Experimental Results ( 2 / 2) Uniform Sampling outperforms Small group sampling at low skews Small group sampling does better at moderate to high skew Speedup decreases as the number of grouping columns increase Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Conclusions • Dynamic Sample Selection improves on previous AQP Methods • Productively utilizes additional Disk Space • Small Group Sampling targets aggregate queries with group bys • Small Group sampling outperforms other techniques

References • Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing • http://crystal.uta.edu/~cse6339/Fall08DBIR.htm • http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

Q & Questions ? A

Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010

Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010

Presentation Transcript

Fall 2010

Presented by: Service Banking Officers SEP 2010

Fall 2010

Fall 2010

ASA Conference 23 rd October 2010

DISCUSSION CSE 140L 3 rd November 2010

Fall 2010

Sep 13, 2010

Presented by Joseph Guse, Econ 398 Fall 2010

Fall 2010

Fall, 2010

Fall 2010

Srikanth Vadada Kishan Kumar B P Fall 2010 - CSE 5311

Sep 10, 2010

Fall 2010

Presented by: Ashish Chawla CSE 6339 Spring 2009

Fall 2010

Fall 2010

ICRARD 2010 RAPHAEL QUEIROZ Buxton, June 23 rd , 2010.

November 23 rd , 2010 Web Briefing

Fall 2010

Fall 2010