1 / 39

Ben Holm Jagadeeshwaran Ranganathan Aric Schorr

Ben Holm Jagadeeshwaran Ranganathan Aric Schorr. Agenda. Problem Description Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” Analysis of “Iceberg-cube Computation with PC Clusters”

hang
Télécharger la présentation

Ben Holm Jagadeeshwaran Ranganathan Aric Schorr

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ben Holm Jagadeeshwaran Ranganathan Aric Schorr

  2. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  3. What is a data cube? A data cube is the structure created by a CUBE operation, which counts all possible combinations of every value of each column in a dataset.    "Creating a data cube requires generating the power set (set of all subsets) of the aggregation columns." Data Cube: A Relational Aggregation OperatorGeneralizing Group-By, Cross-Tab, and Sub-Totals

  4. Data cube as SQL - Almost • New keyword 'ALL' gathers all values from that column • Even with this added syntax, the full power set would not be generated. SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales)   FROM      Sales             Model = ‘Chevy’   WHEREUNIONSELECT Model, ‘ALL’, ‘ALL’, SUM(Sales)   FROM      Sales             Model = ‘Chevy’   WHERE   GROUP BY ModelUNION SELECT Model, Year, ‘ALL’, SUM(Sales)   FROM      Sales             Model = ‘Chevy’   WHERE   GROUP BY Model, YearUNION SELECT Model, Year, Color, SUM(Sales)   FROM      Sales             Model = ‘Chevy’   WHERE   GROUP BY Model, Year, Color;

  5. Problems • A data cube can provide valuable data. • A full data cube has 2^d groupings • It takes a long time to compute • It takes a lot of storage Better algorithms must be found . . .

  6. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  7. Why Iceberg Cube • Standard CUBE operation require 2^d groupings for d dimensions. • The size of each grouping is the product of the cardinality of the dimensions. • For sparse data sets, many of these groupings are uninteresting - few or no results • Iceberg-CUBE computes only the groupings that satisfy an aggregate condition specified by the user • Iceberg-CUBE is computed bottom up, allowing leaves to be pruned when they are empty.

  8. Sparse Cubes • Research was done on weather data from September, 1985. • Nine dimensions, more than a million tuples in the dataset • The CUBE on this dataset has more than two-hundred million tuples - 200 times the size of the dataset • Computing the group-bys that aggregate two or more input tuples only requires 50 times the input size. • Larger values require even less space • By selecting group-bys that perform a little aggregation, I/O time can be drastically reduced.

  9. Iceberg-CUBE as SQL SELECT  A,B,C,COUNT(*),SUM(X)FROM    RCUBE BY A,B,CHAVING  COUNT(*) >= N • N is called the minimum support of a grouping (minsup) • a minsup of 1 is exactly the same as a full CUBE • A precomputed iceberg cube with minsup = N can be used to answer queries HAVING COUNT(*) >= M where M>=N

  10. Iceberg-CUBE – New Algorithm • Pruning reduces final size of the cube • Early pruning can reduce the computation time of the cube • A new algorithm was created that computes the cube from the bottom up, pruning as it progresses through the data. • BUC - Bottom Up Computation

  11. Iceberg-CUBE – BUC – How’s it work? • BUC proceeds from the bottom of the lattice (smallest, most aggregated groups) and works upward.  All previous algorithms work in the opposite direction. • A = chevy, B = 1970, C = green • When a small grouping is found, all ancestors are ignored because their size must be <= the size of the child.

  12. Iceberg-CUBE – How will we use it • This paper goes into some detail about how to implement Iceberg-CUBE.  We plan to implement this algorithm and demonstrate how it may be 'parallelized'.

  13. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  14. Nomenclature of C-Iceberg cell c = (a1, a2, . . . , an : m) (where m is a measure) is called a k-dimensional group-by cell (i.e., a cell in a k-dimensional cuboid),  Definition for closed cell c = (a1, a2, . . . , an : m)  c' =(a1, a2, . . . , an : m),  we denote V (c) ≤ V (c) if  ai (i = 1, . . . , n) which is not ∗, ai = ai.  A cell c is said to be covered by another cell c if ∀c such that V (c) ≤ V ''(c) ≤ V' (c), M(c) = M(c). A cell is closed cell if it is not covered by any other cells.

  15. How to evaluate a c-cube an Example  (four attributes , relational database). Let the measure be count, and the iceberg constraint be count ≥ 2. cell1 = (a1, b1, c1, ∗ : 2), cell2 = (a1, ∗, ∗, ∗ : 3) are closed iceberg cells;  but cell3 = (a1, ∗, c1, ∗ : 2)  cell4 = (a1, b2, c2, d2 : 1) are not, because the former is covered by cell1, whereas the latter does not satisfy the iceberg constraint

  16. Lemma1  If a cell is not closed on measure count, it cannot be closed w.r.t. any other measures. If a cell c1 is not closed, there exist c2 which covers c1. Reason: They are aggregated by the same group of tuples, Verify:  m is strictly monotonic or anti-monotonic, then closedness on measure count is equivalent to measure m.

  17. a)Two main Process in cubing                  1)closedness checking(c-checking)                  2)pruning nonclosed cells/itemsets (c-pruning)  b)Two major approaches For Closedness              1)output based c-checking:closedness of the result by comparing with previous outputs              2)tuple-based c-checking: scans the tuples in the data partition to collect the closedness information. Both methods suffers considerable overheads                                               *Closedness by aggregation 

  18.  a) Why compression?       data cube suffers from dimensionality  b)Previous work for lossless compression        1)Condensed cubes        2)Dwarf          3)Qucient Cube 

  19. Dwarf Based on Dwarf  complexity  O(T^ 1+logDC ),            T = number of tuples          D= number of dimensions,           C = cardinality of each dimensions.  Author's inference:              Since generally C is larger than D, the size complexity of the Dwarf cube does not grow exponentially with the number of dimensions 

  20. Dwarf Vs C-cube compression of Dwarf            suffix coalesce. checks cells’ closedness on the current collapsed dimension,  closed cube extends the c-checking on all previously collapsed dimensions.  Result Hence the number of output cells in Dwarf cube is always no less than that in closed cube  

  21. Quotient cube Vs C-Cube Both computes Upper bound Cells.  Quotient Cube        -uses depth-first search QC-DFS similar to BUC Method To ensure a cell as upper bound,scans all the dimensions which are not within the group-by conditions, in the current data set partition. This incurs a considerable overhead.

  22. Cubing algorithms a)BUC uses bottom-up computation  for Apriori-based pruning.  b)Star-Cubing  star-tree structure  c)MM-Cubing   partitioning the data into different subspace and using multi-way array aggregation on them (avoids tree operations) Mining algorithms             - External checking architecture,Tree or hash table                                                                    which is used to check closedness of later outputs. Drawbacks            -maintain all the closed outputs in memory.

  23. Closedness is a algebric measure Distributive Measure:  Data set can be computed on the measures of the parts of that data set. Algebraic Measure: If the measure can be computed based on a bounded  -min, count, and sum are distributive measures,  -avg is an algebraic measure  avg(A ∪ B) = (sum(A) +sum(B))/(count(A) +   count(B)).

  24. performance Study 1

  25. Inference C-Cubing On 3 algorithms              C-Cubing(MM),              C-Cubing(Star)              C-Cubing (Star Array).  -algorithms outperform the previous approach. - Among them, C-Cubing(MM) is good when iceberg   pruning dominates the computation.

  26. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  27. Overview • Developed parallel algorithms for computing Iceberg queries • Replicated Parallel BUC (RP) • Breadth-first writing Partitioned Parallel BUC (BPP) • Affinity SkipList (ASL) • Partitioned Tree (PT)

  28. Replicated Parallel • Defines tasks as subtrees rooted at each of the different dimensions • For two processors, one would get all cuboids beginning with A and D, and the other would get B and C. • Suffers from two problems. The first being poor load balancing.

  29. Breadth-first writing Partitioned Parallel • Range partition the cells so each processor calculates part of each cuboid. • For attribute Ai, dataset is partitioned into n chunks: Ai(1), ... Ai(n) where n is the number of processors.  • There are m attributes in the dataset, and each processor gets m chunks • Partial cuboids are calculated and then merged. • Disadvantage comes from the degree of skew in the dataset, again causing load balancing issues.

  30. Affinity SkipList • Iteratively reads in the tuples, inserts each tuple into the cell in the skiplist, and updates the aggregate and support counts. • Tasks are defined as the construction of an individual cuboid. • Processor assignment policy is top-down. A processor that has created the skiplist for ABCD should then compute ABC. • ASL's task granularity is too fine and cannot use the pruning to cut down on the work.

  31. Partitioned Tree • Balance between BPP and ASL task partitioning by using a recursive binary division of the tree intosubtrees of equal nodes. • PT attempts to exploit affinity scheduling as well.

  32. Contributions • This paper goes through several algorithms and discusses the common advantages and disadvantages of each. • We plan on using the Partitioned Tree method and algorithm to perform our parallel program.

  33. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  34. Sequential Algorithm •   We will be implementing the Iceberg-CUBE BUC algorithm. •   Pseudo-code is presented in the paper with several large gaps where the data structures go. •   We will provide our own dataset to run performance analysis against.

  35. Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

  36. Parallel Algorithm • We will be implementing the Partitioned Tree algorithm, which is a parallelization of the Iceberg-CUBE • Psuedo-code for this algorithm is also provided in the paper - minus those same tricky data structures. • We will run this parallel algorithm against the same data set we used for the sequential algorithms. • Performance graphs are provided in the paper - hopefully, our results will approach theirs.

  37. References • Jinguo You; Jianqing Xi; Pingjian Zhang, "A Parallel Algorithm for Closed Cube Computation," Computer and Information Science, 2008. ICIS 08. Seventh IEEE/ACIS International Conference on , vol., no., pp.95-99, 14-16 May 2008http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4529804&isnumber=4529780 • Dong Xin; ZhengShao; Jiawei Han; Hongyan Liu, "C-Cubing: Efficient Computation of Closed Cubes y Aggregation-Based Checking," Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on , vol., no., pp. 4-4, 03-07 April 2006http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1617372&isnumber=33902 • Ng, R. T., Wagner, A., and Yin, Y. 2001. Iceberg-cube computation with PC clusters. In Proceedings of the 2001 ACM SIGMOD international Conference on Management of Data (Santa Barbara, California, United States, May 21 - 24, 2001). T. Sellis, Ed. SIGMOD '01. ACM, New York, NY, 25-36.http://doi.acm.org/10.1145/375663.375666 • Beyer, K. and Ramakrishnan, R. 1999. Bottom-up computation of sparse and Iceberg CUBE. In Proceedings of the 1999 ACM SIGMOD international Conference on Management of Data (Philadelphia, Pennsylvania, United States, May 31 - June 03, 1999). SIGMOD '99. ACM, New York, NY, 359-370.http://doi.acm.org/10.1145/304182.304214

More Related