Efficient OLAP Query Processing for Distributed Data Warehouses

Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’2002 -- March 24-28, Prague

Motivation • Analysis of network data • Collect, correlate, and analyze data across the network • Huge amounts of decentrally collected data • Complex OLAP operations • Performed using ad-hoc Perl scripts • Experiment: OLAP technology • Pro: Improves specification, performance • Con: Expensive (or data loss) when centralized • Existing centralized OLAP tools are inadequate

DWs close to the data collection points Network Administrators & Control Systems Source Coordinator Source Query Coordinator Source DW DW DW Solution:Distributed Datawarehouse • Local DW at each collection point (e.g., router) • Compute queries across multiple DWs • A technology is needed fordistributed processing of complex OLAP queries Application Application

Complex OLAP Queries • Examples: • Network usage: For each IP address, what fraction of the total number of flows is due to web traffic? • Principal components: On an hourly basis, what fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum? • Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol • Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation

Skalla • Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans • Salient Features: • Efficiently handles a significant variety of complex OLAP queries (incl. pivots, correlations,etc.) • Only partial results are shipped between the sites and the coordinator -- never subsets of the detail data • No site to site communication

Extended Algebra: GMDJ • Algebraic OLAP operator [Chatziantoniou et al., 2001] • Salient feature: Splits grouping and aggregation • MD(B, R, l,) • B is the base-values table (the “groups”) • R is the detail table (fact data) • l is the list of aggregate functions •  : possibly complex condition over B and R describing what fact data is to be aggregated • Result: The table B extended with the aggregates in l

Extended Algebra: Example • For each IP address, what fraction of the total number of flows is due to web traffic? MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)), Flows, (Cnt2), (IPT.key = Flows.key and Flows.Source = WEB )) • Result of inner GMDJ: (IPT, Cnt1) • Result of outer GMDJ: (IPT, Cnt1, Cnt2) • Sequences of GMDJs instead of multiple aggregate-join expressions

Administers local GMDJ queries Coordinator Coordinator Source Query Engine Query Engine Mediator Mediator Source • Computes distributed query plans • Synchronizes sub-results Source DW DW DW Skalla Architecture & Evaluation Skalla Site Wrapper Application Site Wrapper Application Site Wrapper • Skalla Evaluation Rounds: • Computation of GMDJ at the local DWs • Synchronize sub-results at coordinator

Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates DW DW Skalla Evaluation: Example • For each IP address, what fraction of the total number of flows is due to web traffic?

Skalla Evaluation: Features • Each round of computation in the distributed query evaluation computes a single GMDJ. • Features of the Evaluation: • Semantics of the query plans ensure that the amount of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database! • The algorithm permits for a wide variety of optimizations

Optimizations: Group Reduction • During processing, we only ship data that has actually been changed • Example: • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each local DW receives a base-values table containing all source data • Coordinator has a copy of base-values table • Local DWs ship only those tuples back that have actually been changed

DW DW Group Reduction: Example Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates • For each IP address, what fraction of the total number of flows is due to web traffic?

Optimizations: Synchronization Reduction • It is possible to detect cases where no synchronization is required between passes. • Example: • DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system

DW DW Synch Reduction: Example (1) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates

DW DW Synch Reduction: Example (2) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize result S1 • Distribute • Compute • Aggregates S2

Experiment: Number of Sites (GR)

Experiment: Number of Sites (SR)

Experiments: Size of Database

Experiments: Cost Breakdown

Conclusions • We develop a framework for evaluating complex OLAP queries on a distributed data warehouse • Efficient query plans that minimize data transfer over the network • Further work: • Additional developments of the architecture • Cost-based query optimization

Efficient OLAP Query Processing for Distributed Data Warehouses

Efficient OLAP Query Processing for Distributed Data Warehouses

Presentation Transcript

Distributed Query Processing

OLAP Query Processing in Grids

BUSINESS DRIVEN TECHNOLOGY Data Warehouses, OLAP and Data Mining

Data Warehouses and OLAP

Tutorial 2: Distributed Query Processing

Distributed Query Processing

Distributed Query Processing –An Overview

Data Warehouses, OLAP and Data Mining

Data Warehouses and OLAP

Trie Indexes for Efficient XML Query Processing

Efficient Query Processing On Massive Multi-dimension Data

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

5. Distributed Query Processing

Data Warehouses and OLAP

Tutorial 2: Distributed Query Processing

Data Warehouses and OLAP

Data Warehouses and OLAP Data Management

Distributed Databases and Query Processing

Distributed Query Processing

Chapter 6 DATABASES, DATA WAREHOUSES AND OLAP

OLAP Query Processing in Grids

Distributed Query Processing