220 likes | 335 Vues
This paper addresses the challenges of online index selection in databases facing shifting workloads. Traditional offline methods struggle when query loads vary dynamically. We present COLT (Continuous On-Line Tuning), a system that adapts to recent query trends and efficiently manages index selection within storage budgets. By continuously analyzing query loads and leveraging a "what-if" optimizer, COLT streamlines the indexing process to enhance performance. A prototype implementation in PostgreSQL demonstrates its effectiveness, promising future applicability in various domains.
E N D
Automatic Index Selectionfor Shifting Workloads Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv)
On-line Index Selection • A heavily used database needs indexes • But an evolving workload makes it tough • Popularity shifts over time • Applications come and go • The problem: on-line index selection Maintain a dynamic set of indexes that improve performance of current queries
Attempted On-line Solution • Can off-line techniques be adapted? • Off-line algorithms need a fixed query load • This is not given in the on-line problem • We could try: no obvious implementation Gather query workload W Loop can be very slow Optimize indexes for W creation cost not considered
Our Solution • COLT: Continuous On-Line Tuning • A system for on-line index selection • Features: • Continuously tracks query load • Evaluates indexes with what-if optimizer • Adaptively controls overhead • Selects indexes based on recent trends • Prototype implementation in PostgreSQL
Problem Assumptions • We make the following assumptions • Data is in a relational database • Workload of SELECT statements • Restricted to single-column indexes • Given a budget for index storage • This problem still has major challenges • Lifting these assumptions in future work
System Architecture Parser epoch query Optimizer What-if Interface index benefits COLT indexes to profile query & plan Database CREATE INDEX ... DROP INDEX ... Executor (between epochs) queries
COLT Internal Organization COLT recent queries index benefits What-if Interface indexes to profile query & plan candidate indexes
Organizing the Queries • Similar queries are grouped in clusters • Each query is placed in a cluster on arrival • Aggregate benefit statistics for each cluster • Confidence intervals of past measurements Cluster 1 Cluster 2 Query 2 Query 5 Query 1 Query 3 Query 4
Organizing the Indexes new candidate discarded Cold Set • Relevant, but not promising candidates • Benefit measured with crude metric Hot Set • Promising candidates for materialization • Profiled accurately with what-if calls • Indexes materialized by COLT • Disk storage must fit in budget • Profiled with reverse what-if calls Materialized Set
Key Challenges Challenge: Selective Profiling Which hot and materialized indexes are profiled w.r.t. the current query? Challenge: Index Selection How are index candidates selected for the cold, hot, and materialized sets?
Selective Profiling (1/2) map to cluster query C1 C2 C3 C4 relevant for profiling Cold Hot Mat random sample focus on indexes with uncertain benefit in cluster indexes to profile
Selective Profiling (2/2) • Profiling budget: max what-if calls per epoch • Budget is adjusted after each epoch • Set proportional to potential of hot indexes • Potential based on optimistic assumptions • Result: stable workload suspend profiling shifting workload intensify profiling
Index Selection (1/3) new candidate Cold • How are new candidates chosen? • Use a crude benefit metric • Cost of index access vs. sequential scan • Approximate and cheap to compute • When each query arrives: • Compute crude benefit of relevant indexes • Indexes with benefit > 0 become candidates
Index Selection (2/3) Cold Hot • How are hot indexes chosen? • At the end of each epoch: • Get crude benefit of hot and cold indexes • Find cut-off point to be in hot set: COLD HOT Crude Benefit Cut-off point derived from two-cluster model
Observed Benefit Predicted Benefit time Index Selection (3/3) Hot Materialized • How to choose materialized set? • At the end of each epoch: • Predict benefit of hot and materialized indexes • Cost of materialization is discounted • Select indexes to maximize total benefit
Minimum Time COLT Extra Time Off-line Extra Time Performance of COLT Phase 1 Phase 2 Phase 3 Phase 4 • Experimented with a 4-phase workload • COLT adapts to each phase • Off-line chooses best static index set
Overhead of Profiling Phase 1 Phase 2 Phase 3 Phase 4 • Overhead peaks at start of each phase • Decreases when system is well tuned • Average < 1 what-if call per query
Closing Remarks • On-line index selection harder than off-line • Efficiency is a higher priority • Essentially need to guess the future • Index creation is an issue • COLT is our solution • Solves a constrained problem • Potentially extensible to other domains • Only some components would change
Setting System Parameters • Epoch length = 10 • What-if limit = 20 • Or less if very worried about overhead • Averaging window • Used when predicting future benefit • Indicates the number of epochs that give a good picture of the query distribution • Not easy to set
Other On-line Systems • Bruno and Chaudhuri, ICDE 2007 • Avoids extra optimizer calls completely • Heuristics to capture index interaction • Very different mechanisms for index selection • Sattler, Schallehn, and Geist, IDEAS 2004 • More similarities to COLT • Different prediction of future benefit • No control on profiling overhead
100% 80% 60% 40% 20% Performance with Noise • The worst-case scenario for COLT • Concentrated bursts of noise queries • Performance loss in some cases