Automatic Index Selection for Shifting Workloads

Automatic Index Selectionfor Shifting Workloads Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv)

On-line Index Selection • A heavily used database needs indexes • But an evolving workload makes it tough • Popularity shifts over time • Applications come and go • The problem: on-line index selection Maintain a dynamic set of indexes that improve performance of current queries

Attempted On-line Solution • Can off-line techniques be adapted? • Off-line algorithms need a fixed query load • This is not given in the on-line problem • We could try: no obvious implementation Gather query workload W Loop can be very slow Optimize indexes for W creation cost not considered

Our Solution • COLT: Continuous On-Line Tuning • A system for on-line index selection • Features: • Continuously tracks query load • Evaluates indexes with what-if optimizer • Adaptively controls overhead • Selects indexes based on recent trends • Prototype implementation in PostgreSQL

Problem Assumptions • We make the following assumptions • Data is in a relational database • Workload of SELECT statements • Restricted to single-column indexes • Given a budget for index storage • This problem still has major challenges • Lifting these assumptions in future work

System Architecture Parser epoch query Optimizer What-if Interface index benefits COLT indexes to profile query & plan Database CREATE INDEX ... DROP INDEX ... Executor (between epochs) queries

COLT Internal Organization COLT recent queries index benefits What-if Interface indexes to profile query & plan candidate indexes

Organizing the Queries • Similar queries are grouped in clusters • Each query is placed in a cluster on arrival • Aggregate benefit statistics for each cluster • Confidence intervals of past measurements Cluster 1 Cluster 2 Query 2 Query 5 Query 1 Query 3 Query 4

Organizing the Indexes new candidate discarded Cold Set • Relevant, but not promising candidates • Benefit measured with crude metric Hot Set • Promising candidates for materialization • Profiled accurately with what-if calls • Indexes materialized by COLT • Disk storage must fit in budget • Profiled with reverse what-if calls Materialized Set

Key Challenges Challenge: Selective Profiling Which hot and materialized indexes are profiled w.r.t. the current query? Challenge: Index Selection How are index candidates selected for the cold, hot, and materialized sets?

Selective Profiling (1/2) map to cluster query C1 C2 C3 C4 relevant for profiling Cold Hot Mat random sample focus on indexes with uncertain benefit in cluster indexes to profile

Selective Profiling (2/2) • Profiling budget: max what-if calls per epoch • Budget is adjusted after each epoch • Set proportional to potential of hot indexes • Potential based on optimistic assumptions • Result: stable workload  suspend profiling shifting workload  intensify profiling

Index Selection (1/3) new candidate Cold • How are new candidates chosen? • Use a crude benefit metric • Cost of index access vs. sequential scan • Approximate and cheap to compute • When each query arrives: • Compute crude benefit of relevant indexes • Indexes with benefit > 0 become candidates

Index Selection (2/3) Cold Hot • How are hot indexes chosen? • At the end of each epoch: • Get crude benefit of hot and cold indexes • Find cut-off point to be in hot set: COLD HOT Crude Benefit Cut-off point derived from two-cluster model

Observed Benefit Predicted Benefit time Index Selection (3/3) Hot Materialized • How to choose materialized set? • At the end of each epoch: • Predict benefit of hot and materialized indexes • Cost of materialization is discounted • Select indexes to maximize total benefit

Minimum Time COLT Extra Time Off-line Extra Time Performance of COLT Phase 1 Phase 2 Phase 3 Phase 4 • Experimented with a 4-phase workload • COLT adapts to each phase • Off-line chooses best static index set

Overhead of Profiling Phase 1 Phase 2 Phase 3 Phase 4 • Overhead peaks at start of each phase • Decreases when system is well tuned • Average < 1 what-if call per query

Closing Remarks • On-line index selection harder than off-line • Efficiency is a higher priority • Essentially need to guess the future • Index creation is an issue • COLT is our solution • Solves a constrained problem • Potentially extensible to other domains • Only some components would change

Thank You

Setting System Parameters • Epoch length = 10 • What-if limit = 20 • Or less if very worried about overhead • Averaging window • Used when predicting future benefit • Indicates the number of epochs that give a good picture of the query distribution • Not easy to set

Other On-line Systems • Bruno and Chaudhuri, ICDE 2007 • Avoids extra optimizer calls completely • Heuristics to capture index interaction • Very different mechanisms for index selection • Sattler, Schallehn, and Geist, IDEAS 2004 • More similarities to COLT • Different prediction of future benefit • No control on profiling overhead

100% 80% 60% 40% 20% Performance with Noise • The worst-case scenario for COLT • Concentrated bursts of noise queries • Performance loss in some cases

Automatic Index Selection for Shifting Workloads

Automatic Index Selection for Shifting Workloads

Presentation Transcript

Chameleon Automatic Selection of Collections

Chameleon Automatic Selection of Collections

Index (Brogden) of selection efficiency

Attacking Automatic Wireless Network Selection

CPE 619 Workloads: Types, Selection, Characterization

Academic Workloads

Materialized View Selection for XQuery Workloads

Feature Selection for Automatic Taxonomy Induction

Chapter 10 Index models for Portfolio Selection

Automatic Data Selection for In Situ Analysis

Automatic Shifting Bicycle

Automatic OpenCL Work-Group Size Selection for Multicore CPUs

On-line Index Selection for Physical Database Tuning

Automatic index construction for multimedia digital libraries

Chameleon Automatic Selection of Collections

Meta-learning for automatic selection of algorithms for text classification

An automatic algorithm selection approach for nurse rostering

Workloads

Automatic Indexing (Term Selection)

Selection on an Index of Traits

Structural Proteomics Automatic Target Selection

Selection on an Index of Traits