Efficient Incremental Maintenance of Data Cubes

Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of Computer Science Korea Advanced Institute of Science and Technology

Outline • Introduction • Related work • Incremental maintenance of aggregate views • Incremental maintenance of data cubes • Our approach • Key idea • Problem formulation • Heuristic algorithm • Performance evaluation • Conclusion

Data Cube • A generalized group-by operator [GBP96] • Computes group-bys for all possible combinations of a given set of attributes SELECT a, b, SUM(m) FROM F GROUP BY a, b SELECT a, ‘*’, SUM(m) FROM F GROUP BY a SELECT a, b, SUM(m) FROM F CUBE BY a, b 2n SELECT ‘*’, b, SUM(m) FROM F GROUP BY b Dimension attributes SELECT ‘*’, ‘*’, SUM(m) FROM F (GROUP BY )

 a b c d ab ac ad bc bd cd abc abd acd bcd abcd Cube Lattice • We represent a data cube as a lattice diagram [HRU96] • Each node represents a group-by, which is called a cuboid • Each edge (qi, qj) represents that qj can be computed from qi Cuboid (group-by) Aggregation

 a b c d ab ac ad bc bd cd abc abd acd bcd abcd Maintenance of Data Cubes • A data cube is typically stored as a materialized view • How can we update a data cube efficiently when the source relations change? SELECT a,b,c,d,SUM(m) FROM F CUBE BY a,b,c,d ? SELECT a,b,c,d,SUM(m) FROM F’ CUBE BY a,b,c,d

Related Work (1/2) • Incremental maintenance of an aggregate view A:SELECTa,b,c,SUM(m) FROMF GROUPBYa,b,c F ΔA:SELECTa,b,c,SUM(m) FROM ΔF GROUPBYa,b,c A’ ΔF

Related Work (2/2) • Incremental maintenance of a data cube • Propagate stage: computes the delta cube • Refresh stage: refreshes the data cube by the delta cube  Δ ’ a b c Δa Δb Δc a’ b’ c’ + bc’ ac bc Δac Δbc ab’ ac’ ab Δab abc Δabc abc’ Delta cuboid F ΔF Original cube Delta cube Updated cube

 ∆ a b c d ∆a ∆b ∆c ∆d ab ac ad bc bd cd ∆ab ∆ac ∆ad ∆bc ∆bd ∆cd abc abd acd bcd ∆abc ∆abd ∆acd ∆bcd abcd ∆abcd Motivation • To incrementally maintain a data cube with 2n cuboids, existing methods compute 2n delta cuboids • As n increases, the maintenance cost increases significantly Original cube Delta cube

∆  ∆a a ∆b b ∆c c ∆d d ∆ab ab ∆ac ac ∆ad ad ∆bc bc ∆bd bd ∆cd cd ∆abc abc ∆abd abd ∆acd acd ∆bcd bcd ∆abcd abcd Motivation (cont’d) • Each cuboid is refreshed separately in existing methods 2n delta cuboids are used

∆bc bc ∆cd cd b bc ∆bcd bcd c ac ∆acd acd d ad ∆abd abd  a ab abc ∆abcd abcd ∆ ∆a a ∆b b ∆c c ∆d d ∆ab ab ∆ac ac ∆ad ad ∆bc bc ∆bd bd ∆cd cd ∆abc abc ∆abd abd ∆acd acd ∆bcd bcd ∆abcd abcd Key Idea • Refresh more than one cuboid by a delta cuboid

Key Idea (cont’d) • Benefit • The number of delta cuboids that need to be computed is reduced ∆, ∆a,∆b,∆c,∆d, ∆ab,∆ac,∆ad,∆bc,∆bd,∆cd, ∆abc,∆abd,∆acd,∆bcd, ∆abcd ∆bc,∆cd, ∆abd,∆acd,∆bcd, ∆abcd 2n = 16 delta cuboids need to be computed Only 6 delta cuboids need to be computed

Key Idea (cont’d) • Refreshing more than one cuboid by a delta cuboid ∆ab ab  13 +4 ∆abc abc +4  8

Key Idea (cont’d) • But, if Δabc is sorted by a, b and c, ab can be refreshed by Δabc without more access to ab ∆ab ab  28 +19 ∆abc abc +4  8

Key Idea (cont’d) • A delta cuboid can be easily sorted during its computation with little or no additional cost • In most existing commercial relational database systems, aggregation algorithms are based on sorting [G93] • If a group-by is computed by sorting algorithms, sorted results on the grouping attributes can be easily obtained • We assume that a delta cuboid is computed by sorting based algorithms • Thus, the above method can be applied to any delta cuboid with no additional sorting cost

Generalization of the Idea • Δ[d1d2…dk] • A delta cuboid sorted in the order of attributes d1, d2, …, dk. • The following set of delta cuboids can be refreshed by Δ[d1d2…dk] • {d1d2…dk,d1d2…dk-1, …,d1,} • Δ[d1d2…dk]{q1,q2, …,qi} • Cuboids q1,q2, …,qi are refreshed by Δ[d1d2…dk] • Example • Δ[abcd] {abcd,abc,ab,a,} • Cuboids abcd,abc,ab,a, are refreshed by Δ[abcd] • Δ[acdb]{acdb,acd,ac,a,} • Cuboids acdb,acd,ac,a,are refreshed by Δ[acdb]

Our Approach • We propose an incremental maintenance method that can maintain a data cube with 2n cuboids using only a subset of 2n delta cuboids ∆[abc]{abc} ∆[ab]{ab} ∆[ca]{ca} ∆[bc]{bc} ∆[a]{a} ∆[b]{b} ∆[c]{c} ∆{} ∆[acb]{acb,ac} ∆[ba]{ba,b} ∆[cb]{cb,c} ∆[a]{a,} ∆[abc]{abc,ab,a,} ∆[ca]{ca,c} ∆[bc]{bc,b} 23 = 8 delta cuboids 4 delta cuboids 3 delta cuboids

∆ ∆ ∆ ∆a ∆b ∆c ∆a ∆b ∆c ∆a ∆b ∆c ∆ab ∆ca ∆bc ∆ab ∆ca ∆bc ∆ab ∆ca ∆bc ∆abc ∆abc ∆abc Cost of Computing Delta Cuboids • Different sets of delta cuboids incur different computation cost • We represent the cost of computing delta cuboids by the cost of a delta cuboid computation plan ∆[abc], ∆[ab], ∆[ca], ∆[bc], ∆[a], ∆[b], ∆[c], ∆ ∆[acb], ∆[ba], ∆[cb], ∆[a] ∆[abc], ∆[ca], ∆[bc] 3 7 6 8 8 15 15 16 14 16 14 14

Problem Formulation • Delta cube • ΔQ={Δq1, Δq2, …, Δqm}, where Δqi is a delta cuboid • Refresh chain • A sequence of elements <Δq1,Δq2,…,Δqi> in ΔQ such that q1q2…qi • <Δq1,Δq2,…,Δqi> implies Δ[q1]{q1,q2,…,qi} • Example • <Δabc,Δab,Δa> implies Δ[abc]{abc,ab,a} • <Δcba,Δcb,Δc> implies Δ[cba]{cba,cb,c}

Problem Formulation (cont’d) • Refresh partition • A partition of the elements of ΔQ into disjoint refresh chains • Example {<Δacb,Δac>, <Δab,Δb>, <Δbc,Δc>, <Δa,Δ>} Δ[acb]{acb,ac} Δ[ba]{ba,b} Δ[cb]{cb,c} Δ[a]{a,} implies Δ[abc]{abc,ab,a,} Δ[ca]{ca,c} Δ[bc]{bc,b} {<Δabc,Δab, Δa, Δ>, <Δac,Δc>, <Δbc,Δb>} implies

Δ Δa Δb Δc Δca Δbc Δab 15 14 Δabc Problem Formulation (cont’d) • Delta cuboid computation plan • A subtree of the delta lattice including at least all of the first elements of refresh chains in a given refresh partition • Example {<Δabc,Δab,Δa, Δ>, <Δca,Δc>, <Δbc,Δb>} Refresh partition Delta cuboid computation plan

∆ ∆a ∆b ∆c ∆ab ∆ca ∆bc ∆abc 15 14 Problem Statement • Given a delta cube and its delta lattice, find a refresh partition that minimizes the cost of a delta cuboid computation plan Delta cube: {Δabc,Δab,Δac,Δbc,Δa,Δb,Δa, Δ} find out Refresh partition: {<Δabc,Δab, Δa, Δ>, <Δca,Δc>, <Δbc,Δb>} Delta cuboid computation plan:

NP-Hardness of the Problem • For a given refresh partition, finding the minimum cost delta cuboid computation plan is NP-complete • (proved in the paper) • Our problem is NP-hard • Our problem is at least as hard as finding the minimum cost delta cuboid computation plan • Moreover, there can be many refresh partitions for a given delta cube • Hence, we resort to heuristic approaches

Idea behind Our Heuristic • As the number of delta cuboids to be computed increases, the cost of a delta cuboid computation plan increases • Hence, we minimize the number of refresh chains in a refresh partition as possible • The minimum number of refresh chains in a refresh partition for a delta cube with 2n delta cuboids = (proved in the paper) {Δab} {Δa} {Δb} {Δ} {Δab, Δa} {Δb} {Δ} {Δab, Δa} {Δb, Δ} n n/2

Heuristic Algorithm • Starts from the refresh partition with 2n refresh chains • Each refresh chain consists of only one delta cuboid • Repeatedly merge refresh chains until there are exactly aaa refresh chains in the refresh partition • Whenever refresh chains are merged, a new delta cuboid computation plan with less cost is produced n n/2 {Δb,Δ} {Δa,Δab},{Δc}, {Δca},{Δbc} {Δabc} {Δ} {Δa},{Δb},{Δc} {Δab},{Δca},{Δbc} {Δabc} {Δca,Δc}, {Δbc,Δb,Δ} {Δabc,Δab,Δa} 2n n n/2

Example of the Algorithm Lv(0): ∆ {∆} 3 Lv(1): {∆b,∆} ∆a ∆b ∆c {∆a} {∆b} {∆c} {∆a} {∆c} 7 7 8 6 8 6 Lv(2): ∆ab ∆ca ∆bc {∆ab} {∆ca} {∆bc} {∆ab} {∆ca} {∆bc} 15 15 16 14 16 14 Lv(3): {∆abc} ∆abc {∆abc} (3) Step 2 (1) Input (2) Step 1 {{∆ca,Δc}, {Δbc,Δb,Δ}, {Δabc,Δab,Δa}} {∆bc,∆b,∆} {∆ca,∆c} {∆bc,∆b,∆} {∆ab,∆a} {∆ca,∆c} 15 15 16 14 14 {∆abc,∆ab,∆a} {∆abc} (5) Step 4 (4) Step 3 (6) Output

Analysis of the Algorithm • Lemma 1: Given a delta cube with 2n delta cuboids, the proposed heuristic algorithm produces a refresh partition with exactly refresh chains • Thus, we need to compute only delta cuboids to refresh a data cube with 2n cuboids n n/2 n n/2 n n/2

Analysis of the Algorithm (cont’d) • Lemma 2: Let TC be a delta cuboid computation plan found by the proposed heuristic. Then the following is true. • G: a delta lattice with 2n delta cuboids • G/2: a subgraph of G such that Level(0), Level(1), …, Level(n/2) are removed from G • TG: the minimum spanning tree of G • TG/2: the minimum spanning tree of G/2 • Thus, the cost of TC is bounded by the cost of TG/2 Cost(TC) < Cost(TG/2) < Cost(TG)

Performance Evaluation (1/3) • Data warehouse environment • Oracle9i database • Sun Blade 1000 with UltraSparc III CPU and 512MB RAM • TPC-H benchmark schema and data • Data cubes used in the experiments • Defined over lineitem table in the TPC-H schema

Performance Evaluation (2/3) • By varying the size of changes (a) Q1 (b) Q2 (c) Q3

Performance Evaluation (3/3) • The number of tuples generated in the experiment (a) Q1 (b) Q2 (c) Q3

Summary • We proposed an efficient incremental maintenance method for data cubes • The proposed method can refresh a data cube with 2n delta cuboids using only delta cuboids • The cost of computing delta cuboids can be substantially reduced • We formulated the problem and developed a heuristic algorithm for this problem • We showed the efficiency of the proposed method through performance evaluation n n/2

References • [GBP96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the ICDE Conference, p. 152-159, 1996 • [HRU96] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing Data Cubes Efficiently. In Proceedings of the ACM SIGMOD Conference, p. 205-216, 1996. • [G93] Goetz Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys, Vol. 25, Issue 2, p. 73-169, 1993. • [FG82] L. R. Foulds and R. L. Graham. The Steiner Problem in Phylogeny is NP-Complete. Advances in Applied Mathematics, 3: 43-49, 1982.

Efficient Incremental Maintenance of Data Cubes