1 / 100

The Sum is Greater Than the Parts Global Query Optimization in Federated Systems

The Sum is Greater Than the Parts Global Query Optimization in Federated Systems. Tony Young M.Math Candidate CS 848 - Fall 2004. Outline. Introduction Motivation Issues System Overview Optimization Semijoin Algorithm Reduction Algorithm The Garlic Approach Conclusion My Project.

kalare
Télécharger la présentation

The Sum is Greater Than the Parts Global Query Optimization in Federated Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sum is GreaterThan the PartsGlobal Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

  2. Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project

  3. Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project

  4. Introduction • References: • L. M. Haas, E. T. Lin, and M. A. Roth. Data integration through database federation. IBM Systems Journal, 41(4):578–596, 2002. • David K. Hsiao. Federated databases and systems: part i — a tutorial on their data sharing. The VLDB Journal, 1(1):127–180, 1992. • David K. Hsiao. Federated databases and systems: part ii — a tutorial on their resource consolidation. The VLDB Journal, 1(2):285–310, 1992. • Hongjun Lu, Beng-Chin Ooi, and Cheng-Hian Goh. On global multidatabase query optimization. SIGMOD Rec., 21(4):6–11, 1992.

  5. Introduction • References: • Neil Coburn and Per-Ake Larson. Multidatabase services: issues and architectural design. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 57–66, Toronto, Ontario, Canada, 1992. IBM Press. • Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992.

  6. Introduction • References: • H. Lu, B. C. Ooi, and C. H. Goh. Multidatabase query optimization: Issues and solutions. In Proceedings of Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems, pages 137–143, 1993.

  7. Introduction

  8. Introduction

  9. Introduction

  10. Introduction • What is a federated database system (FDBS)? • Also referred to as a multidatabase system (MDBS) • FDBS’s combine multiple heterogeneous data sources into one global view • Users think that the data all resides in one place

  11. Introduction • The notion of an FDBS has been around for many years • First semi-commercial product was Mermaid (later became InterViso) in 1984 • Flurry of papers began to surface in late 1980’s - early 1990’s • What motivated the development of these systems?

  12. Motivation • Replacement of Data Processing With Databases • Organizations were moving away from traditional data processing techniques (such as storing information in flat files • The rise in popularity of DBMS's can be attributed to powerful data mining applications as well as ease of data access

  13. Motivation • Proliferation of Heterogeneous Databases Within an Organization • It is not uncommon for different departments within an organization to make use of their own database servers • Departments often do not coordinate to ensure that a corporation is using a homogeneous DBMS to store data • There is no guarantee that the schema individual departments use to store data will be homogeneous

  14. Motivation • Data Sharing Within Organizations • Many organizations seek to share data between different departments • Finance department may require information regarding projects in progress in the marketing department • Such information sharing is difficult without the guarantee of schema, data model or access language homogeneity • Finance may use SQL on a relational database and marketing may use Xquery on an XML database

  15. Motivation • Different Rates of Technology Adoption • Different departments will adopt technology at different rates • IT will adopt quickly to store inventory information, call tickets, etc. and are tech savvy • HR may adopt slowly as they use many paper forms and are not so tech savvy • Older systems are in place, and newer systems to come online use different products

  16. Motivation • Geographic Separation of Teams • Different teams may be broken up across geographic locations • Different sites will hold teams working on different projects • Each site may have their own IT staff and make their own purchasing and installation decisions • Often there is no coordination between sites

  17. Motivation • Mergers and Acquisitions • When companies join forces, their IT systems must be joined as well • Old applications will depend on the old software, and users might be reluctant to learn an entirely new system • If we can merge the two system so that each user can make use of their old applications and old access language, transitions might be easier

  18. Issues • Several issues affect query optimization in FDBS’s

  19. Issues • Site Autonomy • Data - Local database administrators have direct and complete control over the schemas • This information cannot be modified in any way • Design - Local database administrators decide when and how to replicate and fragment data • Communication - Each site decides locally whether or not to communicate with the FDBS • Execution - Each site can determine how, when and whether to execute global queries, as well as how queries are prioritized

  20. Issues • Local Parameters • Local cost parameters for individual sites are not always available to the FDBS • The FDBS often doesn't know what indices are available for relations at local sites • Can’t predict what access methods will be used by local sites • No idea what page size and disk latency are • … etc.

  21. Issues • Translation • Queries must be translated to and from the local schema, query language, and data model on-the-fly • This requires additional query processing time

  22. Issues • Heterogeneous Capabilities • Not all local sites have the same capabilities. • Some sites may not implement any ranking operations • This means that intermediate results might have to be shifted to sites that can provide these capabilities, further increasing processing time

  23. Issues • Additional Costs • Cost based optimization needs to take into consideration some additional factors such as: • Transmission speeds • Network loads • Local site configurations • As with local parameters, this information is not always available to the FDBS

  24. Issues • Overriding theme - the FDBS is just another application as far as the local sites are concerned • We can make no assumptions about the local sites • We can assume no global control over the local sites • We have no hooks into the local sites to directly access information

  25. Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project

  26. System Overview • References: • Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992. • M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1999. • John Grant, Witold Litwin, Nick Roussopoulos, and Timos Sellis. Query languages for relational multidatabases. The VLDB Journal, 2(2):153–172, 1993.

  27. System Overview • Two main approaches to a federated system: • Multidatabase Language Approach • Users must learn a special access language • Users must use one standardized data model • Users must know the sites they are contacting and how data is organized at those sites • Users must enter their authentication information each time they use a site • Users must…

  28. System Overview

  29. System Overview • Main idea: • User specifies the sites, relations and columns used in the query by their name at the local site • Queries are still submitted to a middleware, but user must know where things are stored

  30. System Overview • Two main approaches to federated system: • Global Schema Approach • Global DBA implements wrappers/agents to convert access language and data model before sending to source • Global DBA generates an integrated global schema • Global DBA stores authentication for individual users at each local site and FDBS handles login • Global DBA…

  31. System Overview

  32. System Overview • Main idea: • User specifies the relations and columns used in the query by their global name • Queries are submitted to a middleware that does conversion and subquery generation

  33. System Overview • The difference is how users perceive the system • MDBL: As far as the user is concerned, the data is stored at separate sites and their query must explicitly use those sites • GS: As far as the user is concerned, the data is stored in the middleware and they can access it directly

  34. Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project

  35. Optimization • There are many many optimization algorithms for federated systems • 2-Phase • Statistical Sampling • Adaptive • Probing Query-based • …etc. • We will look at three • Semijoin and Reduction: From Mermaid - state of the art for many years and pioneer in the field • Garlic: From IBM - incorporated into a shipping product (DB2 Information Integrator)

  36. Semijoin Optimization Algorithm • References: • David Brill, Marjorie Templeton, and Clement T. Yu. Distributed query processing strategies in mermaid, a frontend to data management systems. In Proceedings of the First International Conference on Data Engineering, pages 211–218. IEEE Computer Society, 1984.

  37. Semijion Optimization Algorithm • The semijoin algorithm was proposed in Mermaid (1984) • Assumes that the cost of data transfer through a network outweighs local site CPU overhead • Seeks to reduce the size of relations required for a query at local sites before transferring results back to the Controller • Four steps

  38. Step 1: Site Selection • A set of sites that will be used to perform a query must first be chosen • Requires finding a set of minimal size that includes one copy of each local, partitioned and replicated relation • i.e. each site holding a data fragment must be in the set, but only one replica of a relation must be in the set • Some sites may hold more than one relation required by the query • Allows us to further reduce the size of the site set

  39. Step 1: Site Selection • The system statistics that can be used to optimize this selection, such as link speeds and system loads, remains an open problem

  40. Step 1: Site Selection

  41. Step 1: Site Selection

  42. Step 2: Local Reduction • In parallel at each local site in the chosen site set, reduce each relation by performing selections and projections • Parameters used to perform these operations are taken from select, where and join conditions in the original query • It might be possible to optimize the order in which site reduction queries are performed by exploiting network traffic and speed, CPU load at local sites, etc. • i.e. submit queries to slow sites first and hope they don't increase the overall execution time too much

  43. Step 2: Local Reduction

  44. Step 3: Global Reduction • Find and execute an efficient sequence of semijoins that will reduce the set of records to be transmitted • Mermaid uses a hill-climbing algorithm to determine this set • Once the semijoins are performed, the smallest amount of data required to answer the query is ready for transport • Some other algorithms to determine the optimal semijoin sequence should be investigated as this one is slow!

  45. Step 3: Global Reduction

  46. Step 4: Assembly • Transfer the data to one central query site and generate the result set. Return the result set to the user • May perform joins at local sites or wait until we get to the FDBS

  47. Step 4: Assembly

  48. Step 4: Assembly

  49. Step 4: Assembly

  50. Step 4: Assembly

More Related