Incremental Maintenance for Non-Distributive Aggregate Functions

Incremental Maintenance for Non-Distributive Aggregate Functions Themis Palpanas (U of Toronto) Richard Sidle Bobbie Cochrane Hamid Pirahesh work done at IBM Almaden Research Center

Motivation • large amounts of data stored in databases • often times data warehouses are used • consolidate data from many sources • offer more general and descriptive view of data • queried by business intelligence tools and decision support systems • produce expensive OLAP queries • these OLAP queries have nice properties: • based on same set of tables • perform similar aggregations Themis Palpanas - U of Toronto

Motivation (cont’d) • can efficiently support such queries with Automatic Summary Tables (ASTs) • materialized queries defined over a set of base tables • precomputed once, used many times • answer complex queries fast • must maintain ASTs when base tables change • inserts, updates, deletes Themis Palpanas - U of Toronto

base tables insert/update/delete Motivation (cont’d) AST AST definition

Aggregate Functions • characterization of functions wrt insertion and deletion operations • updates are series of deletions and insertions • distributive aggregate functions • new value computed based on old value and value of operation • SUM() • non-distributive aggregate functions • above property does not hold • STDDEV() • MIN() (because of deletions) Themis Palpanas - U of Toronto

Problem Statement • given ASTs with aggregate functions • distributive • SUM, COUNT • non-distributive • STDDEV, CORRELATION, REGRESSION, MIN/MAX, XMLAGG, … • when base tables change • incrementally maintain affected ASTs efficient maintenance of ASTs with non-distributive aggregate functions Themis Palpanas - U of Toronto

Outline • Current Approach • Our Solution • Experimental Evaluation • Related Work • Conclusions Themis Palpanas - U of Toronto

Propagate phase Apply phase combine old and new values base tables insert/update/delete Current Approach AST delta AST definition

Current Approach (cont’d) • works for distributive • SUM, COUNT • does not work for non-distributive • STDDEV, CORRELATION, REGRESSION • MIN/MAX • XMLAGG • need new way to deal with these functions Themis Palpanas - U of Toronto

Our Solution • selective recomputation • no longer enough to compute delta • must recompute some aggregation groups • minimize work to be done • choose which groups to recompute • optimize query plan Themis Palpanas - U of Toronto

Propagate phase Apply phase base tables insert/update/delete Our Solution (cont’d) combine old and new values recompute affected groups AST delta AST definition

Our Solution (cont’d) • the 5 steps • compute new aggregate values • change column derivation • recompute only affected groups • eliminate unnecessary operations • optimize for special cases Themis Palpanas - U of Toronto

UDI LOJ AST prop Initial Query Plan • Query Graph Model (QGM) Themis Palpanas - U of Toronto

UDI LOJ AST LOJ prop AST 1. Compute New Aggregate Values • compute delta for distributive functions • recompute non-distributive functions • get those values only for affected groups • duplicate computation for distributive functions! Themis Palpanas - U of Toronto

2. Change Column Derivation UDI • change column derivation • rewrite phase projects out unused columns • entire AST gets recomputed! LOJ AST LOJ non-distributive only prop AST distributive only Themis Palpanas - U of Toronto

2. Change Column Derivation • example AST: SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary) FROM employees GROUP BY dept_id • result of COUNT() computed from old propagate phase • results of MAX() and STDDEV() from AST definition Themis Palpanas - U of Toronto

3. Recompute Affected Groups • push join predicate down in AST • only affected groups are recomputed • special rules for super-aggregates • GROUPING SETS • ROLLUP • CUBE UDI non-distributive only LOJ AST LOJ distributive only AST* J J prop … T1 Tk Themis Palpanas - U of Toronto

3. Recompute Affected Groups • special treatment for ASTs with super-aggregates • predicates not pushdownable • caution not to compute totals of totals • build special join predicate • ensure correct aggregations • change rewrite rules • allow predicate pushdown through super aggregates • applicable only for special join predicate Themis Palpanas - U of Toronto

4. Remove Unnecessary Operations • outerjoin not always needed • when changes are only inserts • reroute columns from propagate phase through AST • remove outerjoin operator • same for updates not referencing AST grouping columns and predicates UDI LOJ all columns AST distributive only AST J J prop … T1 Tk Themis Palpanas - U of Toronto

4. Remove Unnecessary Operations • example AST: SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary) FROM employees GROUP BY dept_id • modification on base tables: UPDATE employees SET salary=10 WHERE age>40 • outerjoin operation will not be built • update does not refer to grouping column (dept_id), and no predicate in AST refers to updated column (salary) • certain that no tuples in AST will be deleted • only STDDEV() will be recomputed • the rest are not affected by changes Themis Palpanas - U of Toronto

5. Optimize for Special Cases • recomputation step not needed when • only insertions and only MIN/MAX functions • build predicate in apply phase • check if new min/max should replace old values • only deletions referring only to grouping columns of AST • can only cause entire groups to be deleted • handled in apply phase Themis Palpanas - U of Toronto

5. Optimize for Special Cases • example AST: SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary) FROM employees GROUP BY dept_id • modification on base tables: DELETE FROM employees WHERE dept_id>40 • selective recomputation step not needed • deletion refers only to grouping column (dept_id) • certain that entire groups will be deleted from AST • no other groups will be affected Themis Palpanas - U of Toronto

Experimental Evaluation • prototype implementation in IBM DB2 UDB • star schema database • sales of products over 5 year time period • fact table: 10 million tuples • AST with non-distributive aggregate function • 240,000 tuples • workload simulates nightly updates • add/delete data for first day of month • add/delete data for second day of month • add/delete data for full month Themis Palpanas - U of Toronto

Experimental Evaluation (cont’d) • deletions require 40-60% of full refresh time • optimized deletions require 1-4% of full refresh time Themis Palpanas - U of Toronto

Experimental Evaluation (cont’d) • insertions/updates require 20-25% of full refresh time Themis Palpanas - U of Toronto

Related Work • incremental view maintenance • differential refresh algorithms • Lindsay et al. 1986, Blakeley et al. 1986, Qian and Wiederhold 1991, Ceri and Widom 1991 • deferred incremental maintenance • Colby et al. 1996, Salem et al. 2000 • views with aggregation • Quass 1996, Mumick et al. 1997 Themis Palpanas - U of Toronto

Conclusions • incremental maintenance for ASTs with non-distributive aggregate functions • support MIN/MAX, STDDEV, CORRELATION, REGRESSION, XMLAGG, … • efficient selective recomputation • recompute only affected groups • optimize query plan • customize for special cases • significant performance improvements Themis Palpanas - U of Toronto

Future Work • examine use of work areas • temporary storage space • store intermediate values • maintenance without recomputation • STDDEV, MIN/MAX(?), … • very helpful for ASTs defined with super-aggregates • ASTs with HAVING clauses • do not know when groups will enter/leave AST Themis Palpanas - U of Toronto

Incremental Maintenance for Non-Distributive Aggregate Functions