Dynamic Sample Selection for Efficient Approximate Query Processing
160 likes | 306 Vues
This work discusses dynamic sample selection methods in approximate query processing, focusing on the need for effective data analysis in multi-gigabyte repositories. It covers various sampling techniques, including uniform and non-uniform sampling, to produce more accurate results. The process involves a pre-processing phase that creates bias-aware sample tables and a runtime phase that selects optimal samples for query execution. Specific strategies like small group sampling target aggregate queries, ensuring efficient query response times while maintaining acceptable levels of inexact responses.
Dynamic Sample Selection for Efficient Approximate Query Processing
E N D
Presentation Transcript
Dynamic Sample Selection for Approximate Query ProcessingBrian Babcock, Surajit Chaudari & Gautam Das Presented by: Mariam John CSE 6392 02/14/2006
Contents • Introduction • Dynamic Sample Selection • Policies for Sample Selection • Small Group Sampling • Pre-Processing Phase • Summary
Why do we do Approximate Query Processing? • Multi-gigabyte data repositories • Data Analysis Application • Data mining • Decision Support Analysis • Fast query response time • Acceptability of inexact query response
Problem • Constructing an optimal sample that well represents the underlying data. • Uniform sampling • Non-uniform sampling
Non-uniform sampling • Purpose is to produce more accurate results across a particular set of queries. • Produces more approximate results than uniform sampling. • Optimal bias differs from query to query.
Dynamic Sample Selection SAMPLE DATA DATA SAMPLE SAMPLE ? ? SAMPLE SAMPLE Dynamic Sample Selection Standard Sampling
Dynamic Sample Selection • Pre-Processing Phase Query Workload Sample Data Select Strata Build Sample Data Meta- Data
Dynamic Sample Selection • Runtime Phase Query Sample Data Choose Samples Rewrite Query Meta- Data
Dynamic Sample Selection • How to identify the set of biased samples to be created? • Occurs during pre-processing phase • How to determine which of the various samples to use to answer a query? • Occurs during runtime phase • Simplest and most efficient strategy is when choice of samples is guided by the syntax of incoming query.
Small Group Sampling • Specific dynamic sample selection technique which targets aggregate queries with “group-by’s”. • Small group sampling approach: • Overall sample – perform uniform sampling on large groups. • Small group tables-one or more sample tables for smaller groups.
Small group Sampling • Set of small groups depends on: • grouping columns • selection predicates
Small Group Sampling Idea behind Small Group Sampling: • Determine for which values in each column to create small group tables. • Create small group tables for each column of a table along with the overall sample. • During runtime, choose a subset of sample tables to answer a query most accurately. • Query is rewritten to run against the sample tables instead of the base tables.
Pre-processing Phase • For every column, identify the rare values within it and create small group tables. • Pre-processing phase produces three outputs: • Overall sample table • Small group tables • Metadata table
Pre-processing phase • Rows can appear in multiple sample tables. • Bitmask field is used to identify the set of sample tables to which a row was added. • Avoids double counting of rows assigned to multiple sample tables.
Summary • Dynamic Sample Selection • Takes advantage of available disk space • Creates multiple biased sample tables during the pre-processing phase • Picks best samples during runtime for query processing. • Small Group Sampling • Notion is to treat large and small groups differently • Creates an overall sample table for large groups and a number of small group tables for each rare values in each column.