1 / 42

Implementing Data Cubes Efficiently

Implementing Data Cubes Efficiently. Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung. Content. Background Introduction of Datacube Problem defined Lattice model Greedy algorithm How to do? How good? How bad ? Evaluations

trynt
Télécharger la présentation

Implementing Data Cubes Efficiently

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ImplementingData Cubes Efficiently Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

  2. Content • Background • Introduction of Datacube • Problem defined • Lattice model • Greedy algorithm • How to do? • How good? • How bad ? • Evaluations • Conclusion

  3. Background • DSS (Decision Support System) • Gain competitiveness for business • Data warehouse • Maintain historical information • Use “Data cube” to summarize results • Identify trends • Performance issue (time and space) • Need to reuse result (materialization of views)

  4. Introduction of datacube • Datacube • Dimensionality (number of GROUP-BYs) • Aggregated data: Values in each cell • Dimension of datacube  Detail of summary • Higher Dimension  Higher detail • Common operations • Drill down: Look in more detail • Roll up: Look in less detail

  5. Total annual sales of TV in U.S.A. Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All What is a data cube?

  6. Our problem • Physically materialize the whole data cube • Best query response • Heavy pre-computing, large storage space • i.e. Time efficient but space inefficient • Materialize nothing • Worse query response • Dynamic query evaluation, less storage space • i.e. Space efficient but time inefficient

  7. Problem on materialized views • Materialize only part of the data cube • Balance the storage space and response • What is the best subject to materialize? • Addressed in this paper

  8. Data? View? • We use data cube to modify aggregate data. • So what we use to model view? • Lattice!

  9. psc 6M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M none 1 Example of lattice diagram • 8 possible grouping on the dimensions • p for Part • s for Supplier • c for Customer • # of rows of data shown next to the grouping An example of Regular Lattice

  10. ≼ operator • Suppose c ≼ d • The view d can be used to derive the view c • c is the ancestor of d in lattice diagram • Impose a partial order on the views • Usage on dimensions • (part) ≼ (part,customer) • (part) ⋠ (customer) • Usage within attribute value • (year) ≼ (quarter) ≼ (month) ≼(day) • (year) ≼ (quarter) ≼ (week) ≼(day) day week month quarter year An example of Irregular Lattice

  11. Regular lattices with equal domain size • Grouping attributes: A1,A2,…, An (domain: r) • Attribute for aggregation: B • Efficient algorithm • m: # of rows in top views • k = ⌈log r m⌉

  12. The problem • The previous technique cannot be applied to the irregular lattices • Irregular lattices is common in data warehouse • The optimization of views for irregular lattice is NP-complete problem (inefficient!) • Use Greedy Algorithm • i.e. use heuristics to obtain approximate solution

  13. Greedy algorithm • Being as greedy as possible in each step!! • Simple example: Use the smallest number of coins to pay $50 cents • Suppose we have many coins of 20 cents, 10 cents and 5 cents.

  14. How to be greedy? • Common sense approach: • Select the largest coin: 20 cents • Select the largest coin again: 20 cents • Remaining amount = 50 – 20 – 20 = 10 cents • We cannot select the largest coin again. • We choose the 2nd largest coin 10 cents instead. • Only 3 coins are needed! Optimal solution!

  15. Definition of “benefit of view” • C(v) denotes cost of view (v) • B(v,S) denotes benefit of a view (v) relative to a set of views (S) • For each w ≼ v • Let u be the view of least cost in S such that w ≼ u • Bw = max{ C(u) – C(v) ,0} • B(v,S) = ∑w≼vBw

  16. Greedy algorithm • In each step • Select the view with the most benefit • Add it to the result Algorithm S={top view}; for i=1 to k { select view v not in S such that B(v,S) is maximized S = S union {v} } return S;

  17. Selecting the first view • After selecting coins, let us back to our problem, selecting views. • We must materialize the top view • i.e. the view grouping by all attributes • Cannot be constructed from other views • Avoid going to the raw data

  18. Selecting k views more • Space is limited! Suppose we can only select k more views. • For each view which is not yet selected, calculate the benefit of materializing it. • Pick the one with maximum benefit!!! • Let’s set k = 2 for examples.

  19. Example 100 • E.g. The cost of constructing view b given the view A is 100 • If we choose b to materialize, the new cost of constructing view b is 50. a 50 75 c b 30 e f d 20 40 g h 1 10

  20. First round 100 • Notice that not only b, but also d, e, g and h can be calculated from b • So the total benefit is (100 – 50) x 5 = 250 a 50 75 c b 30 e f d 20 40 g h 1 10

  21. Continue… 100 • Similarly, the benefit of materializing c is (100 – 75) x 5 = 125 a 50 75 c b 30 e f d 20 40 g h 1 10

  22. Not yet finish… 100 • For e, Benefit = (100-30) x 3 = 210 a 50 75 c b 30 e f d 20 40 g h 1 10

  23. Let’s choose b! • For d and f , Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively. 100 a 50 75 c b 30 e f d 20 40 g h 1 10

  24. Next round? • Seems we should choose e, as it has the second largest benefit. • Let’s see what will happen in the second round.

  25. Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b) Benefit = (100 – 75) x 2 = 50 Second round! 100 a 50 75 c b 30 e f d 20 40 g h 1 10

  26. If we choose f, we found that h can be effectively calculated by using f instead of b. Benefit = (100 – 40) + (50 – 40) How about choosing f? 100 a 50 75 c b 30 e f d 20 40 h g 1 10

  27. Easy to work out others 100 • Benefit of d = (50 – 20) x 2 = 60 • Benefit of e = (50 – 30) x 3 = 60 • Benefit of g = 50 – 1 = 49 • Benefit of h = 50 – 10 = 40 a 50 75 c b 30 e f d 20 40 h g 1 10

  28. Observation • In the first round, the benefit of choosing f (only 120) is far from the best choice (250) • But in second round, choosing f gives the maximum benefit!

  29. Simple? Optimal? • Trade off again! This simple algorithm is not optimal in all cases! • Consider the following case…

  30. Bad example 200 a 100 100 d c b 99 20 nodes Total 1000

  31. Bad example • Choose c • Benefit = (200-99) x (1 + 20 + 20) = 4141 = maximum 200 a 100 100 b c d 20 nodes Total 1000 99

  32. Bad example • Now choose either 1 of b and d (same benefit) 200 a 100 100 b c d 20 nodes Total 1000 99

  33. Bad example • How about these? • Very expensive!!! 200 a 100 100 b c d 20 nodes Total 1000 99

  34. Optimal solution should be… 200 • Only c is a little bit expensive. a 100 100 b d c 20 nodes Total 1000 99

  35. Some theoretical result • It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

  36. Extensions (1) • Problem • The views in a lattice are unlikely to have the same probability of being requested in a query. • Solution: • We can weight each benefit by its probability.

  37. Extensions (2) • Problem • Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views. • Solution • We can consider the “benefit of each view per unit space”.

  38. Conclusions • Materialization of views is an essential query optimization strategy for decision-support applications. • Reason to materialize some part of the data cube but not all of the cube. • A lattice framework that models multidimensional analysis very well.

  39. Conclusions (cont.) • Finding optimal solution in the case of irregular lattice is NP-hard. • Introduction of greedy algorithm • Greedy algorithm work on this lattice and pick the almost right views to materialize.

  40. Conclusions (the end) • There exists cases which greedy algorithm fails to produce optimal solution. • But greedy algorithm has guaranteed performance • Expansion of greedy algorithm.

  41. Reference • Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

  42. Thank you~ Q & A Section

More Related