460 likes | 616 Vues
Implementing Data Cubes Efficiently. Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung. Content. Background Introduction of Datacube Problem defined Lattice model Greedy algorithm How to do? How good? How bad ? Evaluations
E N D
ImplementingData Cubes Efficiently Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung
Content • Background • Introduction of Datacube • Problem defined • Lattice model • Greedy algorithm • How to do? • How good? • How bad ? • Evaluations • Conclusion
Background • DSS (Decision Support System) • Gain competitiveness for business • Data warehouse • Maintain historical information • Use “Data cube” to summarize results • Identify trends • Performance issue (time and space) • Need to reuse result (materialization of views)
Introduction of datacube • Datacube • Dimensionality (number of GROUP-BYs) • Aggregated data: Values in each cell • Dimension of datacube Detail of summary • Higher Dimension Higher detail • Common operations • Drill down: Look in more detail • Roll up: Look in less detail
Total annual sales of TV in U.S.A. Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All What is a data cube?
Our problem • Physically materialize the whole data cube • Best query response • Heavy pre-computing, large storage space • i.e. Time efficient but space inefficient • Materialize nothing • Worse query response • Dynamic query evaluation, less storage space • i.e. Space efficient but time inefficient
Problem on materialized views • Materialize only part of the data cube • Balance the storage space and response • What is the best subject to materialize? • Addressed in this paper
Data? View? • We use data cube to modify aggregate data. • So what we use to model view? • Lattice!
psc 6M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M none 1 Example of lattice diagram • 8 possible grouping on the dimensions • p for Part • s for Supplier • c for Customer • # of rows of data shown next to the grouping An example of Regular Lattice
≼ operator • Suppose c ≼ d • The view d can be used to derive the view c • c is the ancestor of d in lattice diagram • Impose a partial order on the views • Usage on dimensions • (part) ≼ (part,customer) • (part) ⋠ (customer) • Usage within attribute value • (year) ≼ (quarter) ≼ (month) ≼(day) • (year) ≼ (quarter) ≼ (week) ≼(day) day week month quarter year An example of Irregular Lattice
Regular lattices with equal domain size • Grouping attributes: A1,A2,…, An (domain: r) • Attribute for aggregation: B • Efficient algorithm • m: # of rows in top views • k = ⌈log r m⌉
The problem • The previous technique cannot be applied to the irregular lattices • Irregular lattices is common in data warehouse • The optimization of views for irregular lattice is NP-complete problem (inefficient!) • Use Greedy Algorithm • i.e. use heuristics to obtain approximate solution
Greedy algorithm • Being as greedy as possible in each step!! • Simple example: Use the smallest number of coins to pay $50 cents • Suppose we have many coins of 20 cents, 10 cents and 5 cents.
How to be greedy? • Common sense approach: • Select the largest coin: 20 cents • Select the largest coin again: 20 cents • Remaining amount = 50 – 20 – 20 = 10 cents • We cannot select the largest coin again. • We choose the 2nd largest coin 10 cents instead. • Only 3 coins are needed! Optimal solution!
Definition of “benefit of view” • C(v) denotes cost of view (v) • B(v,S) denotes benefit of a view (v) relative to a set of views (S) • For each w ≼ v • Let u be the view of least cost in S such that w ≼ u • Bw = max{ C(u) – C(v) ,0} • B(v,S) = ∑w≼vBw
Greedy algorithm • In each step • Select the view with the most benefit • Add it to the result Algorithm S={top view}; for i=1 to k { select view v not in S such that B(v,S) is maximized S = S union {v} } return S;
Selecting the first view • After selecting coins, let us back to our problem, selecting views. • We must materialize the top view • i.e. the view grouping by all attributes • Cannot be constructed from other views • Avoid going to the raw data
Selecting k views more • Space is limited! Suppose we can only select k more views. • For each view which is not yet selected, calculate the benefit of materializing it. • Pick the one with maximum benefit!!! • Let’s set k = 2 for examples.
Example 100 • E.g. The cost of constructing view b given the view A is 100 • If we choose b to materialize, the new cost of constructing view b is 50. a 50 75 c b 30 e f d 20 40 g h 1 10
First round 100 • Notice that not only b, but also d, e, g and h can be calculated from b • So the total benefit is (100 – 50) x 5 = 250 a 50 75 c b 30 e f d 20 40 g h 1 10
Continue… 100 • Similarly, the benefit of materializing c is (100 – 75) x 5 = 125 a 50 75 c b 30 e f d 20 40 g h 1 10
Not yet finish… 100 • For e, Benefit = (100-30) x 3 = 210 a 50 75 c b 30 e f d 20 40 g h 1 10
Let’s choose b! • For d and f , Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively. 100 a 50 75 c b 30 e f d 20 40 g h 1 10
Next round? • Seems we should choose e, as it has the second largest benefit. • Let’s see what will happen in the second round.
Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b) Benefit = (100 – 75) x 2 = 50 Second round! 100 a 50 75 c b 30 e f d 20 40 g h 1 10
If we choose f, we found that h can be effectively calculated by using f instead of b. Benefit = (100 – 40) + (50 – 40) How about choosing f? 100 a 50 75 c b 30 e f d 20 40 h g 1 10
Easy to work out others 100 • Benefit of d = (50 – 20) x 2 = 60 • Benefit of e = (50 – 30) x 3 = 60 • Benefit of g = 50 – 1 = 49 • Benefit of h = 50 – 10 = 40 a 50 75 c b 30 e f d 20 40 h g 1 10
Observation • In the first round, the benefit of choosing f (only 120) is far from the best choice (250) • But in second round, choosing f gives the maximum benefit!
Simple? Optimal? • Trade off again! This simple algorithm is not optimal in all cases! • Consider the following case…
Bad example 200 a 100 100 d c b 99 20 nodes Total 1000
Bad example • Choose c • Benefit = (200-99) x (1 + 20 + 20) = 4141 = maximum 200 a 100 100 b c d 20 nodes Total 1000 99
Bad example • Now choose either 1 of b and d (same benefit) 200 a 100 100 b c d 20 nodes Total 1000 99
Bad example • How about these? • Very expensive!!! 200 a 100 100 b c d 20 nodes Total 1000 99
Optimal solution should be… 200 • Only c is a little bit expensive. a 100 100 b d c 20 nodes Total 1000 99
Some theoretical result • It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.
Extensions (1) • Problem • The views in a lattice are unlikely to have the same probability of being requested in a query. • Solution: • We can weight each benefit by its probability.
Extensions (2) • Problem • Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views. • Solution • We can consider the “benefit of each view per unit space”.
Conclusions • Materialization of views is an essential query optimization strategy for decision-support applications. • Reason to materialize some part of the data cube but not all of the cube. • A lattice framework that models multidimensional analysis very well.
Conclusions (cont.) • Finding optimal solution in the case of irregular lattice is NP-hard. • Introduction of greedy algorithm • Greedy algorithm work on this lattice and pick the almost right views to materialize.
Conclusions (the end) • There exists cases which greedy algorithm fails to produce optimal solution. • But greedy algorithm has guaranteed performance • Expansion of greedy algorithm.
Reference • Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.
Thank you~ Q & A Section