260 likes | 363 Vues
CSE 636 Data Integration. Overview. Fall 2006. What is Data Integration?. The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is a problem!) autonomous (not affect the behavior of sources)
E N D
CSE 636Data Integration Overview Fall 2006
What is Data Integration? The problem of providing • uniform (sources transparent to user) • access to (query, and eventually updates too) • multiple (even 2 is a problem!) • autonomous (not affect the behavior of sources) • heterogeneous (different data models, schemas) • structured (at least semistructured) • data sources (not only databases)
Motivation • Enterprise data integration; web-site construction. • World-wide web: • comparison shopping (Netbot, Junglee) • portals integrating data from multiple sources • XML integration • Science & culture • Medical genetics: integrating genomic data • Astrophysics: monitoring events in the sky • Environment: Puget Sound Regional Synthesis Model • Culture: uniform access to all the cultural databases produced by different countries.
Principle Dimensions of Data Integration • Virtual vs. materialized architecture • Access: query only or query&update? • problem similar to updating through views • need distributed transactional services. • Mediated schema: yes or no? • Mediated schema requires schema integration and then query reformulation. • Without mediated schema, we lose some of the advantages of data integration.
Data Warehouse Architecture OLAP / Decision Support Data Cubes / Data Mining Users Applications Relational Database (Warehouse) ETL Tools (Extract-Transform-Load) Data Cleaning Data Source Data Source Data Source
Virtual Integration Architecture • Leave the data in the sources • When a query comes in: • Determine the relevant sources to the query • Break down the query into sub-queries for the sources • Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration
Virtual Integration Architecture Schema Mappings Schema Mappings Schema Mappings Design-Time End Users Applications Sources can be: • Relational DBs • Excel Files • Web Sites • Web Services Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Schema Mappings • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes Inventory Database B Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Inventory Database A BookCategories ISBN Category BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName
Issues for Schema Mappings Schema Mappings Schema Mappings Schema Mappings Design-Time End Users Applications • What formalisms to express them? • How to create them? • Can we discover them somehow? • How do we use them? Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Virtual Integration Architecture Reformulation Optimization Execution Wrapper Wrapper Run-Time Query Result Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing Reformulation Query • User queries refer to the global schema • Data is stored in the sources in a local schema • Rewriting algorithms Mediator Reformulation Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing Reformulation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Schema A SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ AND ItemType = ‘Books’ BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords
Issues for Query Processing Wrapper Query Translation Query • Different query languages Mediator Reformulation Global Schema Optimization Execution Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing Query Translation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Source A http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road
Issues for Query Processing Reformulation Optimization Execution Wrapper Data Translation Query • Different data models Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing Data Translation Global Schema Books Title ISBN Price DiscountPrice Edition <table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr> </table> Local Result A
Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Execution Query • Access as many data sources as needed • Duplicate/redundant and irrelevant data • Limited query capabilities Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ A Limited Query Capabilities SELECT ISBN, Price, DiscountPrice FROM Books WHERE Title = ‘on the road’ Global Schema Books Title ISBN Price DiscountPrice Edition E Local Schema A Local Schema B BooksAndMusic Title Author ItemID ItemType SuggestedPrice DiscountBooks Title Edition ISBN GreatPrice B D SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ? SELECT GreatPrice FROM DiscountBooks WHERE ISBN = 123 SELECT GreatPrice FROM DiscountBooks WHERE ISBN = ? C
Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Answering Query Result • Combine the results and further process them if needed • Mainly union and merge • Inconsistencies Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema
Issues for Query Processing Query Answering (Union)
Issues for Query Processing Query Answering (Merge) Primary Key Primary Key Primary Key
Issues for Query Processing Query Answering (Inconsistencies) Primary Key Primary Key Primary Key
Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3
Peer-Based Integration • No need for a central mediated schema • Peers serve as mediators for other peers • A peer can be both a server and a client • Semantic relationships are specified locally(between small sets of peers) • Queries are posed using the peer’s schema • Answers come from anywhere in the system • This is not P2P file sharing. • Data has rich semantics
References • Information integration • Maurizio Lenzerini • Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 • Invited Tutorial • Data Integration: a Status Report • Alon Halevy • German Database Conference (BTW), 2003 • Invited Talk