Dynamic Sample Selection for Approximate Query ProcessingBrian Babcock, Surajit Chaudari & Gautam Das Presented by: Mariam John CSE 6392 02/14/2006
Contents • Introduction • Dynamic Sample Selection • Policies for Sample Selection • Small Group Sampling • Pre-Processing Phase • Summary
Why do we do Approximate Query Processing? • Multi-gigabyte data repositories • Data Analysis Application • Data mining • Decision Support Analysis • Fast query response time • Acceptability of inexact query response
Problem • Constructing an optimal sample that well represents the underlying data. • Uniform sampling • Non-uniform sampling
Non-uniform sampling • Purpose is to produce more accurate results across a particular set of queries. • Produces more approximate results than uniform sampling. • Optimal bias differs from query to query.
Dynamic Sample Selection SAMPLE DATA DATA SAMPLE SAMPLE ? ? SAMPLE SAMPLE Dynamic Sample Selection Standard Sampling
Dynamic Sample Selection • Pre-Processing Phase Query Workload Sample Data Select Strata Build Sample Data Meta- Data
Dynamic Sample Selection • Runtime Phase Query Sample Data Choose Samples Rewrite Query Meta- Data
Dynamic Sample Selection • How to identify the set of biased samples to be created? • Occurs during pre-processing phase • How to determine which of the various samples to use to answer a query? • Occurs during runtime phase • Simplest and most efficient strategy is when choice of samples is guided by the syntax of incoming query.
Small Group Sampling • Specific dynamic sample selection technique which targets aggregate queries with “group-by’s”. • Small group sampling approach: • Overall sample – perform uniform sampling on large groups. • Small group tables-one or more sample tables for smaller groups.
Small group Sampling • Set of small groups depends on: • grouping columns • selection predicates
Small Group Sampling Idea behind Small Group Sampling: • Determine for which values in each column to create small group tables. • Create small group tables for each column of a table along with the overall sample. • During runtime, choose a subset of sample tables to answer a query most accurately. • Query is rewritten to run against the sample tables instead of the base tables.
Pre-processing Phase • For every column, identify the rare values within it and create small group tables. • Pre-processing phase produces three outputs: • Overall sample table • Small group tables • Metadata table
Pre-processing phase • Rows can appear in multiple sample tables. • Bitmask field is used to identify the set of sample tables to which a row was added. • Avoids double counting of rows assigned to multiple sample tables.
Summary • Dynamic Sample Selection • Takes advantage of available disk space • Creates multiple biased sample tables during the pre-processing phase • Picks best samples during runtime for query processing. • Small Group Sampling • Notion is to treat large and small groups differently • Creates an overall sample table for large groups and a number of small group tables for each rare values in each column.