1 / 22

CSE 636 Data Integration

CSE 636 Data Integration. Overview. Data Warehouse Architecture. . OLAP / Decision Support Data Cubes / Data Mining. . Users. Applications. Relational Database (Warehouse). ETL Tools (Extract-Transform-Load). Data Cleaning. Data Source. Data Source. Data Source.

liam
Télécharger la présentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 636Data Integration Overview

  2. Data Warehouse Architecture  OLAP / Decision Support Data Cubes / Data Mining  Users Applications Relational Database (Warehouse) ETL Tools (Extract-Transform-Load) Data Cleaning Data Source Data Source Data Source

  3. Virtual Integration Architecture • Leave the data in the sources • When a query comes in: • Determine the relevant sources to the query • Break down the query into sub-queries for the sources • Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration

  4. Virtual Integration Architecture Schema Mappings Schema Mappings Schema Mappings Design-Time   End Users Applications • Sources can be: • Relational DBs • Excel Files • Web Sites • Web Services Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  5. Schema Mappings • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes Inventory Database B Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Inventory Database A BookCategories ISBN Category BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName

  6. Issues for Schema Mappings Schema Mappings Schema Mappings Schema Mappings Design-Time   End Users Applications • What formalisms to express them? • How to create them? • Can we discover them somehow? • How do we use them? Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  7. Virtual Integration Architecture Reformulation Optimization Execution Wrapper Wrapper Run-Time Query Result Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  8. Issues for Query Processing Reformulation Query • User queries refer to the global schema • Data is stored in the sources in a local schema • Rewriting algorithms Mediator Reformulation Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  9. Issues for Query Processing Reformulation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Schema A SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ AND ItemType = ‘Books’ BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords

  10. Issues for Query Processing Wrapper Query Translation Query • Different query languages Mediator Reformulation Global Schema Optimization Execution Data Source Data Source Data Source Local Schema Local Schema Local Schema

  11. Issues for Query Processing Query Translation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Source A http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

  12. Issues for Query Processing Reformulation Optimization Execution Wrapper Data Translation Query • Different data models Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  13. Issues for Query Processing Data Translation Global Schema Books Title ISBN Price DiscountPrice Edition <table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr> </table> Local Result A

  14. Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Execution Query • Access as many data sources as needed • Duplicate/redundant and irrelevant data • Limited query capabilities Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  15. Issues for Query Processing SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ A Limited Query Capabilities SELECT ISBN, Price, DiscountPrice FROM Books WHERE Title = ‘on the road’ Global Schema Books Title ISBN Price DiscountPrice Edition E Local Schema A Local Schema B BooksAndMusic Title Author ItemID ItemType SuggestedPrice DiscountBooks Title Edition ISBN GreatPrice B D SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ? SELECT GreatPrice FROM DiscountBooks WHERE ISBN = 123 SELECT GreatPrice FROM DiscountBooks WHERE ISBN = ? C

  16. Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Answering Query Result • Combine the results and further process them if needed • Mainly union and merge • Inconsistencies Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  17. Issues for Query Processing Query Answering (Union)

  18. Issues for Query Processing Query Answering (Merge) Primary Key Primary Key Primary Key

  19. Issues for Query Processing Query Answering (Inconsistencies) Primary Key Primary Key Primary Key

  20. Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3

  21. Peer-Based Integration • No need for a central mediated schema • Peers serve as mediators for other peers • A peer can be both a server and a client • Semantic relationships are specified locally(between small sets of peers) • Queries are posed using the peer’s schema • Answers come from anywhere in the system • This is not P2P file sharing. • Data has rich semantics

  22. References • Information integration • Maurizio Lenzerini • Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 • Invited Tutorial • Data Integration: a Status Report • Alon Halevy • German Database Conference (BTW), 2003 • Invited Talk

More Related