1 / 20

Efficient OLAP Query Processing for Distributed Data Warehouses

Efficient OLAP Query Processing for Distributed Data Warehouses. Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada

rudolf
Télécharger la présentation

Efficient OLAP Query Processing for Distributed Data Warehouses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’2002 -- March 24-28, Prague

  2. Motivation • Analysis of network data • Collect, correlate, and analyze data across the network • Huge amounts of decentrally collected data • Complex OLAP operations • Performed using ad-hoc Perl scripts • Experiment: OLAP technology • Pro: Improves specification, performance • Con: Expensive (or data loss) when centralized • Existing centralized OLAP tools are inadequate

  3. DWs close to the data collection points Network Administrators & Control Systems Source Coordinator Source Query Coordinator Source DW DW DW Solution:Distributed Datawarehouse • Local DW at each collection point (e.g., router) • Compute queries across multiple DWs • A technology is needed fordistributed processing of complex OLAP queries Application Application

  4. Complex OLAP Queries • Examples: • Network usage: For each IP address, what fraction of the total number of flows is due to web traffic? • Principal components: On an hourly basis, what fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum? • Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol • Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation

  5. Skalla • Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans • Salient Features: • Efficiently handles a significant variety of complex OLAP queries (incl. pivots, correlations,etc.) • Only partial results are shipped between the sites and the coordinator -- never subsets of the detail data • No site to site communication

  6. Extended Algebra: GMDJ • Algebraic OLAP operator [Chatziantoniou et al., 2001] • Salient feature: Splits grouping and aggregation • MD(B, R, l,) • B is the base-values table (the “groups”) • R is the detail table (fact data) • l is the list of aggregate functions •  : possibly complex condition over B and R describing what fact data is to be aggregated • Result: The table B extended with the aggregates in l

  7. Extended Algebra: Example • For each IP address, what fraction of the total number of flows is due to web traffic? MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)), Flows, (Cnt2), (IPT.key = Flows.key and Flows.Source = WEB )) • Result of inner GMDJ: (IPT, Cnt1) • Result of outer GMDJ: (IPT, Cnt1, Cnt2) • Sequences of GMDJs instead of multiple aggregate-join expressions

  8. Administers local GMDJ queries Coordinator Coordinator Source Query Engine Query Engine Mediator Mediator Source • Computes distributed query plans • Synchronizes sub-results Source DW DW DW Skalla Architecture & Evaluation Skalla Site Wrapper Application Site Wrapper Application Site Wrapper • Skalla Evaluation Rounds: • Computation of GMDJ at the local DWs • Synchronize sub-results at coordinator

  9. Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates DW DW Skalla Evaluation: Example • For each IP address, what fraction of the total number of flows is due to web traffic?

  10. Skalla Evaluation: Features • Each round of computation in the distributed query evaluation computes a single GMDJ. • Features of the Evaluation: • Semantics of the query plans ensure that the amount of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database! • The algorithm permits for a wide variety of optimizations

  11. Optimizations: Group Reduction • During processing, we only ship data that has actually been changed • Example: • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each local DW receives a base-values table containing all source data • Coordinator has a copy of base-values table • Local DWs ship only those tuples back that have actually been changed

  12. DW DW Group Reduction: Example Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates • For each IP address, what fraction of the total number of flows is due to web traffic?

  13. Optimizations: Synchronization Reduction • It is possible to detect cases where no synchronization is required between passes. • Example: • DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system

  14. DW DW Synch Reduction: Example (1) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates

  15. DW DW Synch Reduction: Example (2) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize result S1 • Distribute • Compute • Aggregates S2

  16. Experiment: Number of Sites (GR)

  17. Experiment: Number of Sites (SR)

  18. Experiments: Size of Database

  19. Experiments: Cost Breakdown

  20. Conclusions • We develop a framework for evaluating complex OLAP queries on a distributed data warehouse • Efficient query plans that minimize data transfer over the network • Further work: • Additional developments of the architecture • Cost-based query optimization

More Related