Parallelizing the Data Cube: Scalable Relational OLAP Solutions

PhD Oral Defence Todd Eavis July 23, 2003 Parallelizing the Data Cube

Overview • Motivation for Parallel, Relational OLAP • Core Algorithms and Methods • Primary Systems Contributions • Experimental Evaluation and Results • Conclusions and Future Work

Motivation for Parallel, Relational OLAP • Core Algorithms and Methods • Primary Systems Contributions • Experimental Evaluation and Results • Conclusions and Future Work

Why study OLAP and the Data Cube? • On-line Analytical Processing: the foundation for a range of essential business applications • sales and marketing analysis, planning and budgeting • 4 billion dollar industry by 2005 • Data Cube: a core OLAP construct, first proposed in 1996 by Gray et al [GBLP], that supports sophisticated multi-dimensional data analysis • Relevance to the Research Community? Results of Citeseer queries: • OLAP: 797 papers • Data Cube: 362 papers • Our interest: Data Cube Generation and Querying

Scale of OLAP Data Warehouses • Average size of production data warehouses currently 700 GB [survey.com/Olap Report] • Expected to reach 4 TB by 2004 • 1/3 currently <= 50 GB. In two years, this number will drop to just 6% • Biggest data warehouses growing by a factor of 20 [Winter Report] • Biggest expected to exceed 100 TB within 2 years • Our Interest: Exploiting Parallel Algorithms

Fundamental Design Alternatives • MOLAP (Multi-dimensional OLAP) • Materialize data cube as a multi-dimensional array • In theory: implicit indexing. In practice: hybrid schemes for sparse and dense regions • Best for dense, low-dimensional spaces • ROLAP (Relational OLAP) • Store data as relational tables • Requires an explicit multi-dimensional index • Scales well to higher dimensions and higher cardinalities • Our Interest: Highly Scalable ROLAP model

Computing the Full Cube in Parallel • Small number of previous projects [GC, LHL, MM, NWY] • Speedup quite limited • Our approach: Parallel Pipesort [DEHR2, DER3] • Model 2d views as a “task graph” • Create “Scan Pipelines” [AADG] as Minimum Cost Spanning Tree using O(dn(m + nlogn)) bipartite matching (n = nodes, m = edges) • Partition task graph into sub-trees with O(p3d + p2d) augmented k-min-max [BSP] and distribute sub-trees to p processors • Use over-sampling – S * p sub-trees – to improve load balance. High-low pairing in S – 1 rounds provides approximation to NP-Complete problem • Use performance optimized algorithms for sorting, scanning, and I/O to generation local views

Computing Partial Cubes in Parallel • Important in practice for environments with higher dimensions and/or specific visualization needs • Little previous work, only partial solutions [BR,GC,SAG] • Our approach: Greedy algorithms for “Schedule Tree” construction [DER2, DER5] • Solution consists of algorithms for generating efficieint “Essential Trees” (red) and algorithms for adding beneficial non-selected nodes (blue) • Greedy method: record “ state” information in Plan Objects. Incrementally add nodes with maximum benefit • Pre-sorting “candidate” views by estimated size can reduce run-time from O(n3) to O(n2) • O(d*n) heuristic extensions for higher dimensional space. A “confidence factor” β limits risk

A Parallel Date Cube Query Engine • Views must be indexed prior to access • Related work: sequential r-trees for data cube [RYR] and general purpose parallel r-trees [SL] • Our Approach: Parallel RCUBE [DER1, DER6] • Records ordered as per Hilbert Space Filling Curve • P-processor round-robin record striping • Construct Partial r-tree indexes on each node, “packing” page blocks in Hilbert order • Parallel Query Engine • Combines indexing and OLAP post-processing (query transformation, parallel Sample Sort, record permutation, etc.) • Uses surrogate views to support Partial Cubes • Supports “linear” dimension hierarchies

The Virtual Data Cube • Motivation: Hide the complexity of Data Cube algorithms and implementation • Requires no knowledge of: • Format or extent of indexing • Degree of materialization (full or partial) • Representation of hierarchies • Physical order of view attributes • Degree of parallelism

Systems Overview • Full, robust Data Cube “prototype” [DER4] • Approximately 20,000 lines of code • C/C++, LEDA, MPI, STL • Template-based graph algorithms • Designed for, and evaluated on, contemporary parallel machines: • “Shared nothing” Linux cluster (Dalhousie) • “Shared disk” SunFire multi-processor (HPCVL) • Supporting systems include: • Flex/Bison based data generator • Batch query generator • View Subset generator

Key Performance Issues • Dynamic selection of “best” sorting algorithm • Radix sort versus quicksort • Minimization of data movement • Use of horizontal and vertical indirection • New pipeline aggregation algorithm • “Lazy” aggregation • Streamlined I/O • I/O manager • Independent I/O and computation threads

Costing model • Sophisticated cost model, common to both full and partial cube [DEHR1] • Based upon view size estimator • Probabilistic counting technique • Experimentally supported metrics for: • Dynamic Sorting (linear time versus comparison based) • In-memory scanning and data movement • Machine specific “Read” and “Write” I/O • Dynamically considers impact of computation versus I/O

A Better Search Strategy • Standard r-tree search strategy employs Depth First Search • Our approach: Linear Breadth First Search • Map the search algorithm to the linearly ordered levels of the packed index • Resolve query with a left-to-right, top-to-bottom walk of the tree • Disk head never moves backwards • Resolution consists of a sequence of clustered scans • Degrades gracefully to a sequential scan of index + sequential scan of data

Experimental Evaluation • Default test environment includes: • 16 to 24 processors • 2 million records/80 MB • 4 to 14 dimensions • Random query batches • 24-node Linux cluster, 16-node SunFire MP (disk array) • Parallel Speedup approaching linear for all components • Efficiency between 80 and 95% Partial Cube Full Cube Query Processing

Full Cube Evaluation • Shared Disk • 80 – 90% efficiency • Disk array is bottleneck • Optimized pipeline processing • Order of magnitude improvement • Over sampling factor • SF = 2 consistently best

Partial Cube Evaluation • Tree pruning with confidence factor (on 14 d) • Can eliminate up to 60% of original tree • Virtually no reduction in tree quality • Partial cube of 3 dimensions or less • Reductions over “naïve” method of 65 – 70% • Using partial cube algorithms for full cube • All within 6% of “best” benchmark • Recursive algorithm with 0.1%

Query Evaluation • Overhead of using surrogate views (1 to 16 processors) • Run time on materialized views versus time when those views were unavailable • Record retrieval imbalance (16 processors) • Only 0.3% from optimal load balance • Ratio of blocks retrieved to required seeks • Random query batches • Up to 140:1 for large, sparse spaces

Thesis Conclusions • ROLAP a viable alternative to MOLAP in parallel setting • Partial cubes can be efficiently generated • ROLAP cubes can be efficiently indexed • Virtual cube abstraction can be efficiently supported

Research Highlights • First parallel ROLAP system in the Data Cube literature • A balanced approach to data cube research • Algorithm design • Systems engineering • Extensive performance analysis • Evaluated on contemporary parallel machines • Commodity-style shared nothing cluster • Shared disk architectures • Integration of three “independent” data cube research projects into a single cohesive OLAP framework – the Virtual Cube

Future Work • Automated partial cube specification • Extension of “virtual cube” • Parallel Query optimization • In addition to range queries or linear hierarchies • High volume query environments • OLAP visualization • New projects are building on the current base • Generation of Iceberg Cubes • Mining of association rules

Thank You! References: Our own Virtual Data Cube Research References: The Data Cube Literature GBLP J. Gray and A. Bosworth and A. Layman and H. Pirahesh", Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals", ICDE, 1996 GC S. Goil and A. Choudhary, A Parallel Scalable Infrastructure for OLAP and Data Mining, IDEAS,1999 LHL H. Lu, X. Huang and Z. Li,, Computing Data Cubes Using Massively Parallel Processors, PCW '97, 1997 MM S. Muto and M. Kitsuregawa, A dynamic Load Balancing Strategy for Parallel Datacube Computation, WDW O,1999 NWY R. Ng, A. Wagner and Y. Yin, Iceberg-cube Computation with PC Clusters, SIGMOD, 2001 AADG S. Agarwal and R. Agrawal and P. Deshpande and A. Gupta and J. Naughton and R. Ramakrishnan and S. Sarawagi, On the Computation of Multidimensional aggregates, VLDB, 1996 BSP R. Becker and S. Schach and Y. Perl, A shifting algorithm for min-max tree partitioning, Journal of the ACM, 1982 BR K. Beyer and R. Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBEs, SIGMOD,1999 RYR N. Roussopoulos and Y. Kotidis and M. Roussopolis, Cubetree: Organization of the bulk incremental updates on the data cube, SIGMOD, 1997 SL B. Schnitzer and S. Leutenegger, Master-client r-trees: a new parallel architecture, SSDM, 1999 • DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A. Rau-Chaplin, Parallelizing The Data Cube,Parallel and Distributed Databases: An International Journal, 2001 • DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin, Distributed Multi-dimensional ROLAP Indexing for the Data Cube, CCGrid, 2003. • DER2 F. Dehne, T. Eavis and A. Rau-Chaplin, Computing Partial Data Cubes for Parallel Data Warehousing Applications, Euro PVM-MPI, 2001. • DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin, Coarse Grained Parallel On-Line Analytical Processing (OLAP) For Data Mining, ICCS, 2001. • DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A Cluster Architecture for Parallel Data Warehousing, CCGrid, 2001. • DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A. Rau-Chaplin, Parallelizing The Data Cube, ICDT, 2001. • CHER Y. Chen, F.Dehne, Todd Eavis, A. Rau-Chaplin, Parallel ROLAP Data Cube Construction on Shared Nothing Multi-Processors, IPDPS, 2003. • DER5 F Dehne, T.Eavis, and A. Rau-Chaplin, Computing Partial Data Cubes, Submitted to HICCS, 2003. • DER6 F Dehne, T.Eavis, and A. Rau-Chaplin, RCUBE: Parallel Multi-Dimensional ROLAP Indexing, Submitted to Journal of to Data Mining and Knowledge Discovery., 2003

Parallelizing the Data Cube: Scalable Relational OLAP Solutions

Parallelizing the Data Cube: Scalable Relational OLAP Solutions

Presentation Transcript

Data Cube and OLAP Server

Data Warehouse and Data Cube

OMEGA DATA CUBE ORBNNNN_M.QUB

Parallelizing MiniSat

Parallelizing Programs

DATA CUBE

Parallelizing the GAP Kernel

DATA CUBE

Data Cube Interface

Data Cube Training

Parallelizing Computations

Parallelizing METIS

Quotient Cube: How to Summarize the Semantics of a Data Cube

OMEGA DATA CUBE ORBNNNN_M.QUB

Parallel Data Cube

OMEGA DATA CUBE ORBNNNN_M.QUB

The Environmental Data Cube Distributing Realistic Weather

Parallelizing Data Race Detection