Ben Holm Jagadeeshwaran Ranganathan Aric Schorr

Ben Holm Jagadeeshwaran Ranganathan Aric Schorr

Agenda • Problem Description • Analysis of “Bottom-Up Computation of Sparse and Iceberg CUBES” • Analysis of “C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking” • Analysis of “Iceberg-cube Computation with PC Clusters” • Sequential Algorithm • Parallel Algorithm • Questions

What is a data cube? A data cube is the structure created by a CUBE operation, which counts all possible combinations of every value of each column in a dataset. "Creating a data cube requires generating the power set (set of all subsets) of the aggregation columns." Data Cube: A Relational Aggregation OperatorGeneralizing Group-By, Cross-Tab, and Sub-Totals

Data cube as SQL - Almost • New keyword 'ALL' gathers all values from that column • Even with this added syntax, the full power set would not be generated. SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales) FROM Sales Model = ‘Chevy’ WHEREUNIONSELECT Model, ‘ALL’, ‘ALL’, SUM(Sales) FROM Sales Model = ‘Chevy’ WHERE GROUP BY ModelUNION SELECT Model, Year, ‘ALL’, SUM(Sales) FROM Sales Model = ‘Chevy’ WHERE GROUP BY Model, YearUNION SELECT Model, Year, Color, SUM(Sales) FROM Sales Model = ‘Chevy’ WHERE GROUP BY Model, Year, Color;

Problems • A data cube can provide valuable data. • A full data cube has 2^d groupings • It takes a long time to compute • It takes a lot of storage Better algorithms must be found . . .

Why Iceberg Cube • Standard CUBE operation require 2^d groupings for d dimensions. • The size of each grouping is the product of the cardinality of the dimensions. • For sparse data sets, many of these groupings are uninteresting - few or no results • Iceberg-CUBE computes only the groupings that satisfy an aggregate condition specified by the user • Iceberg-CUBE is computed bottom up, allowing leaves to be pruned when they are empty.

Sparse Cubes • Research was done on weather data from September, 1985. • Nine dimensions, more than a million tuples in the dataset • The CUBE on this dataset has more than two-hundred million tuples - 200 times the size of the dataset • Computing the group-bys that aggregate two or more input tuples only requires 50 times the input size. • Larger values require even less space • By selecting group-bys that perform a little aggregation, I/O time can be drastically reduced.

Iceberg-CUBE as SQL SELECT A,B,C,COUNT(*),SUM(X)FROM RCUBE BY A,B,CHAVING COUNT(*) >= N • N is called the minimum support of a grouping (minsup) • a minsup of 1 is exactly the same as a full CUBE • A precomputed iceberg cube with minsup = N can be used to answer queries HAVING COUNT(*) >= M where M>=N

Iceberg-CUBE – New Algorithm • Pruning reduces final size of the cube • Early pruning can reduce the computation time of the cube • A new algorithm was created that computes the cube from the bottom up, pruning as it progresses through the data. • BUC - Bottom Up Computation

Iceberg-CUBE – BUC – How’s it work? • BUC proceeds from the bottom of the lattice (smallest, most aggregated groups) and works upward. All previous algorithms work in the opposite direction. • A = chevy, B = 1970, C = green • When a small grouping is found, all ancestors are ignored because their size must be <= the size of the child.

Iceberg-CUBE – How will we use it • This paper goes into some detail about how to implement Iceberg-CUBE. We plan to implement this algorithm and demonstrate how it may be 'parallelized'.

Nomenclature of C-Iceberg cell c = (a1, a2, . . . , an : m) (where m is a measure) is called a k-dimensional group-by cell (i.e., a cell in a k-dimensional cuboid), Definition for closed cell c = (a1, a2, . . . , an : m) c' =(a1, a2, . . . , an : m), we denote V (c) ≤ V (c) if ai (i = 1, . . . , n) which is not ∗, ai = ai. A cell c is said to be covered by another cell c if ∀c such that V (c) ≤ V ''(c) ≤ V' (c), M(c) = M(c). A cell is closed cell if it is not covered by any other cells.

How to evaluate a c-cube an Example (four attributes , relational database). Let the measure be count, and the iceberg constraint be count ≥ 2. cell1 = (a1, b1, c1, ∗ : 2), cell2 = (a1, ∗, ∗, ∗ : 3) are closed iceberg cells; but cell3 = (a1, ∗, c1, ∗ : 2) cell4 = (a1, b2, c2, d2 : 1) are not, because the former is covered by cell1, whereas the latter does not satisfy the iceberg constraint

Lemma1 If a cell is not closed on measure count, it cannot be closed w.r.t. any other measures. If a cell c1 is not closed, there exist c2 which covers c1. Reason: They are aggregated by the same group of tuples, Verify: m is strictly monotonic or anti-monotonic, then closedness on measure count is equivalent to measure m.

a)Two main Process in cubing 1)closedness checking(c-checking) 2)pruning nonclosed cells/itemsets (c-pruning) b)Two major approaches For Closedness 1)output based c-checking:closedness of the result by comparing with previous outputs 2)tuple-based c-checking: scans the tuples in the data partition to collect the closedness information. Both methods suffers considerable overheads *Closedness by aggregation

a) Why compression? data cube suffers from dimensionality b)Previous work for lossless compression 1)Condensed cubes 2)Dwarf 3)Qucient Cube

Dwarf Based on Dwarf complexity O(T^ 1+logDC ), T = number of tuples D= number of dimensions, C = cardinality of each dimensions. Author's inference: Since generally C is larger than D, the size complexity of the Dwarf cube does not grow exponentially with the number of dimensions

Dwarf Vs C-cube compression of Dwarf suffix coalesce. checks cells’ closedness on the current collapsed dimension, closed cube extends the c-checking on all previously collapsed dimensions. Result Hence the number of output cells in Dwarf cube is always no less than that in closed cube

Quotient cube Vs C-Cube Both computes Upper bound Cells. Quotient Cube -uses depth-first search QC-DFS similar to BUC Method To ensure a cell as upper bound,scans all the dimensions which are not within the group-by conditions, in the current data set partition. This incurs a considerable overhead.

Cubing algorithms a)BUC uses bottom-up computation for Apriori-based pruning. b)Star-Cubing star-tree structure c)MM-Cubing partitioning the data into different subspace and using multi-way array aggregation on them (avoids tree operations) Mining algorithms - External checking architecture,Tree or hash table which is used to check closedness of later outputs. Drawbacks -maintain all the closed outputs in memory.

Closedness is a algebric measure Distributive Measure: Data set can be computed on the measures of the parts of that data set. Algebraic Measure: If the measure can be computed based on a bounded -min, count, and sum are distributive measures, -avg is an algebraic measure avg(A ∪ B) = (sum(A) +sum(B))/(count(A) + count(B)).

performance Study 1

Inference C-Cubing On 3 algorithms C-Cubing(MM), C-Cubing(Star) C-Cubing (Star Array). -algorithms outperform the previous approach. - Among them, C-Cubing(MM) is good when iceberg pruning dominates the computation.

Overview • Developed parallel algorithms for computing Iceberg queries • Replicated Parallel BUC (RP) • Breadth-first writing Partitioned Parallel BUC (BPP) • Affinity SkipList (ASL) • Partitioned Tree (PT)

Replicated Parallel • Defines tasks as subtrees rooted at each of the different dimensions • For two processors, one would get all cuboids beginning with A and D, and the other would get B and C. • Suffers from two problems. The first being poor load balancing.

Breadth-first writing Partitioned Parallel • Range partition the cells so each processor calculates part of each cuboid. • For attribute Ai, dataset is partitioned into n chunks: Ai(1), ... Ai(n) where n is the number of processors. • There are m attributes in the dataset, and each processor gets m chunks • Partial cuboids are calculated and then merged. • Disadvantage comes from the degree of skew in the dataset, again causing load balancing issues.

Affinity SkipList • Iteratively reads in the tuples, inserts each tuple into the cell in the skiplist, and updates the aggregate and support counts. • Tasks are defined as the construction of an individual cuboid. • Processor assignment policy is top-down. A processor that has created the skiplist for ABCD should then compute ABC. • ASL's task granularity is too fine and cannot use the pruning to cut down on the work.

Partitioned Tree • Balance between BPP and ASL task partitioning by using a recursive binary division of the tree intosubtrees of equal nodes. • PT attempts to exploit affinity scheduling as well.

Contributions • This paper goes through several algorithms and discusses the common advantages and disadvantages of each. • We plan on using the Partitioned Tree method and algorithm to perform our parallel program.

Sequential Algorithm • We will be implementing the Iceberg-CUBE BUC algorithm. • Pseudo-code is presented in the paper with several large gaps where the data structures go. • We will provide our own dataset to run performance analysis against.

Parallel Algorithm • We will be implementing the Partitioned Tree algorithm, which is a parallelization of the Iceberg-CUBE • Psuedo-code for this algorithm is also provided in the paper - minus those same tricky data structures. • We will run this parallel algorithm against the same data set we used for the sequential algorithms. • Performance graphs are provided in the paper - hopefully, our results will approach theirs.

References • Jinguo You; Jianqing Xi; Pingjian Zhang, "A Parallel Algorithm for Closed Cube Computation," Computer and Information Science, 2008. ICIS 08. Seventh IEEE/ACIS International Conference on , vol., no., pp.95-99, 14-16 May 2008http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4529804&isnumber=4529780 • Dong Xin; ZhengShao; Jiawei Han; Hongyan Liu, "C-Cubing: Efficient Computation of Closed Cubes y Aggregation-Based Checking," Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on , vol., no., pp. 4-4, 03-07 April 2006http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1617372&isnumber=33902 • Ng, R. T., Wagner, A., and Yin, Y. 2001. Iceberg-cube computation with PC clusters. In Proceedings of the 2001 ACM SIGMOD international Conference on Management of Data (Santa Barbara, California, United States, May 21 - 24, 2001). T. Sellis, Ed. SIGMOD '01. ACM, New York, NY, 25-36.http://doi.acm.org/10.1145/375663.375666 • Beyer, K. and Ramakrishnan, R. 1999. Bottom-up computation of sparse and Iceberg CUBE. In Proceedings of the 1999 ACM SIGMOD international Conference on Management of Data (Philadelphia, Pennsylvania, United States, May 31 - June 03, 1999). SIGMOD '99. ACM, New York, NY, 359-370.http://doi.acm.org/10.1145/304182.304214

Ben Holm Jagadeeshwaran Ranganathan Aric Schorr