Compression-Aware Physical Database Design for Optimizing Database Performance
This research study presents innovative methods for compression-aware physical database design, focusing on the trade-offs between storage efficiency and CPU overhead during data compression and decompression. By analyzing various compression schemes such as local and global dictionaries and null suppression techniques, the study showcases how different strategies impact indexing and overall database performance. The findings, having been implemented on SQL Server 2008, reveal effective configurations that help database administrators make informed decisions to optimize space, performance, and resource usage.
Compression-Aware Physical Database Design for Optimizing Database Performance
E N D
Presentation Transcript
Compression Aware Physical Database Design Microsoft Research Brown University Hideaki Kimura* Vivek Narasayya Manoj Syamala hkimura@cs.brown.edu {viveknar,manojsy}@microsoft.com (*) Graduates soon. On Job Market.
Background: Compression in DB • Every Major DBMS Supports • Saves Storage Consumption • Saves I/O Bandwidth DBMS A: 4x! DBMS B: 10x! DBMS C: 12x! Query Process Engine Tables, Indexes SELECT Decompress Compressed Data Compress INSERT
Compression Schemes in DB Dictionary Encoding NULL Suppression • Local dict. (Oracle, SQL Server) • Global dict. (DB2) + Prefix Suppression, LZO, RLE…
Two Types of Compression in DB Order Independent Order Dependent • NULL-Supp. • Global dict. • … IAB IBA IAB IBA IAB IBA page fragmented = ≠ • Run Length Enc. • Local dict. • …
Benefits and Overheads • Saves Storage Space, I/O • CPU Overhead to Compress & Decompress • Different Compression Scheme= Different Saving ↔ Overhead How Do We Use It? DBA
Issue 1: To Compress or not.. • Depends on Data • Depends on Workload • SELECTs/INSERTs Frequency • CPU bottleneck? IO bottleneck? 10GB 10GB 1GB 9GB -10% -90% Low Compression Ratio High Compression Ratio
Q1 Q2 Syntactically Relevant Indexes Issue 2: What Index to Create I4 I3 I5 I2 I1 Select Candidate Configurations I3 I5 I1 Physical DB Design Tool DBMS Configuration Enumerate BestConfiguration I1 I5 Hypothetical Indexes Prune What-if Analysis Query Optimizer Estimate Runtime
Naïve Solution: Staged Design • Run Design Tool to Select Indexes • Compress them, then Repeat. Stage 2 Stage 1 Compress! MV MV MV Workload Idx Idx Idx 100 MB Budget 100 MB 50 MB 100 MB
Problem in tight space budget SELECT SUM(Price*Discount) FROM Sales WHERE State='CA' and Jul 01 < Shipdate < Sep 01 • Misses an index that makes sense only with compression Sales Choice for 100 MB? I1 (State, Shipdate): 95 MB → 50 MB I2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB
Example: Tight Space Budget Good design: 175MB CREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT) Staged: 155MB CREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE) ?
Problem in plenty space budget Choice for 200 MB? • Result in too high CPU overheads for compression/decompression. I1 (State, Shipdate): 95 MB → 50 MB INSERT INTO Sales … UPDATE Sales SET Price=.. I2 (State, Shipdate) Include (Price, Discount): 170 MB→ 90 MB CPU Overheads
Example: Plenty Space Budget Worse with More Budget!
IntegratedSolution Needed! • How to Estimate Index-size after compression? • How to Evaluate benefits/overheads of compression? • How Compression affects Candidate Selection/Enumeration?
Size Estimation • Essential Metric of Indexes • To Fit Space Budget • To Estimate I/O cost • Need Compression Fraction Table #tuple=1M Col-A Width=8 Col-B Width=4 Col-C Width=10 Clust. Key Width=4 Stats Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB Comp. Size (IABC) = 26 MB * CF (IABC)
Prior work SampleCF Overheads • SampleCF[Idreoset al. ICDE'10] Table 1GB Sample 10MB CREATE COMPRESSED INDEX • Sample Size: Cost ↔ Accuracy • Still Expensive for 1,000s of indexes
Index Size Deduction Local dict. (ORD-DEP) NULL supp. (ORD-IND) IBA IAB Ia Ib Ia,b Sum-up Savings SampleCF Ia Ib Col-Set Deduction Col-Ext Deduction Ia,b Ib,a Estimate From Run-Length More Details in paper
Optimize Accuracy-Cost Trade-off • Size-Estimation Strategy • Sample Size? • Deduction Path? • Expected Errors? • Formulate as Graph Problem • Greedy algorithm to solve(details in the paper)
Issues in Design Tool • Query Cost model to consider (De)Compression CPU cost • Candidate Selection/Enumeration Key Challenge:Space-Performance Trade-off
Candidate Selection:Space-Performance Trade-off Q1 Q2 • Add CompressedIndexes IB ID IA IC Compressed Versions Most of them are Ignored! ID IB IC IA Select Fastest Compressed Indexes are often Slower-but-Smaller IA IC (exception: very highcompression ratio)
Skyline Candidate Selection • Construct Skyline of Configurations • Pick Both Fast-Indexes and Small-Indexes
Comp. IC IB 10MB ICB 10MB 5MB Enumeration: Problem ICB IB IA IC IA Seed IB IC IA IA IB IA Optimal Design 15MB Room IA ICB IC IA ICB • Greedy picks un-compressed indexes too early
IC IA ICB IB IA ICC … Local Backtrack in Enumeartion Recover If Oversized IB IB IC IA IA IA • Recover oversized configurations • Compress indexes in the config.
Experimental Results • Implemented on SQL Server 2008 • Modified Database Tuning Advisor (DTA) "DTAc" • Modified Query Cost Model • TPC-H Scale-1 (more results in paper) • SELECT-intensive/UPDATE-intensive • Compared Estimated Runtime
Candidate Selection/Enumeration • Both Skyline & Backtrack are required esp. for tight budget Clustered/2ndary Indexes
DTAc vs. DTA • Especially better in tight budget • Choose lightly compressed designs in UPDATE-intensive Clustered/2ndary/MV Indexes
Overhead in DTA • Reduce Size Estimation Overheads for a factor of 3 • Mostly <10% Estimation Error
Conclusion • Opportunities and Challenges • Integrated Approach to exploit compression in physical design • Space-Performance Tradeoff • Size Estimation • Open Issues • Column-Store