1 / 35

CSE 636 Data Integration

CSE 636 Data Integration. Data Integration Approaches. Fall 2006. Virtual Integration Architecture. Leave the data in the sources When a query comes in: Determine the relevant sources to the query Break down the query into sub-queries for the sources

lona
Télécharger la présentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 636Data Integration Data Integration Approaches Fall 2006

  2. Virtual Integration Architecture • Leave the data in the sources • When a query comes in: • Determine the relevant sources to the query • Break down the query into sub-queries for the sources • Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration

  3. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services XML 1 Data Source Data Source Local Schema Local Schema

  4. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services 2 XML 1 Data Source Data Source Local Schema Local Schema

  5. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Data Source Data Source Local Schema Local Schema

  6. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Data Source Data Source Local Schema Local Schema

  7. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result 5 End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Data Source Data Source Local Schema Local Schema

  8. Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result 5 End User Mediation Language Optimization & Execution 3 6 Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

  9. Virtual Integration Approaches Dimensions to Consider: • How many sources are we accessing? • How autonomous are they? • Meta-data about sources? • Is the data structured? • Queries or also updates? • Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world?

  10. Mediation Languages Global Schema CD ASIN Title Genre … Artist ASIN Name … Logic CDs Album ASIN Price DiscountPrice Studio Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Artists ASIN ArtistName GroupName CDCategories ASIN Category BookCategories ISBN Category

  11. Desiderata from Source Descriptions • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

  12. Source Descriptions Elements of source descriptions: • Contents: source contains movies, directors, cast. • Constraints: only movies produced after 1965. • Completeness: contains all American movies. • Capabilities: – Negative: source requires movie title or director as input – Positive: source can perform selections, joins, …

  13. Approaches to Specification of Source Descriptions • Global-as-View (GAV): Mediator relation defined as a view over source relations Ex: TSIMMIS (Stanford), HERMES (Maryland). • Local-as-View (LAV): Source relation defined as view over mediator relations Ex: Information Manifold (AT&T), Tukwila(UW), InfoMaster (Stanford). • GLAV: combines both (Friedman & Millstein 1999)

  14. Approaches to Specification of Source Descriptions Q Q’ Q’ Q’ Q’ Q’ Mediator Global Schema Mediated Schema GAV GLAV LAV Source Source Source Source Source Local Schema Local Schema Local Schema Local Schema Local Schema

  15. Global-as-View (GAV) Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View of Movie: SELECT * FROM S1 [S1(title,dir,year,genre)] union SELECT * FROM S2 [S2(title,dir,year,genre)] union SELECT S3.title, S3.dir, S4.year, S4.genre FROM S3, S4 [S3(title,dir), WHERE S3.title = S4.title S4(title,year,genre)]

  16. Global-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View of Schedule: SELECT title, dir, year, NULL FROM S1 [S1(title,dir,year)] union SELECT title, dir, NULL, genre FROM S2 [S2(title,dir,genre)]

  17. Global-as-View: Example 3 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating Views: SELECT NULL, NULL, NULL, genre FROM S4 [S4(cinema, genre)] SELECT cinema, NULL, NULL FROM S4 [S4(cinema, genre)]

  18. Global-as-View (GAV): Example 4 Global Schema: MovieActor(title,actor) MovieReview(title, review) Integrating Views: MovieActor(title,actor) ← S1(id,title,actor,year) union MovieActor(title,actor) ← S2(title,director,actor,year) MovieReview(title, review) ← S1(id,title,actor,year), S3(id,review)

  19. Query Reformulation in GAV Query reformulation= rule unfolding+simplification Query: Find reviews for ‘DeNiro’ movies q(title,review) :- MovieActor(title,‘DeNiro’), MovieReview(title,review) 1. q’(title,review) :- S1(id,title,‘DeNiro’, year), S3(id,review) 2. q’(title,review) :- S2(title,director,‘DeNiro’,year), S1(id,title, ‘DeNiro’,year), S3(id,review)

  20. Global-as-View Summary • Query reformulation boils down to view unfolding. • Very easy conceptually. • Can build hierarchies of global schemas. • You sometimes loose information. Not always natural. • Adding sources is hard. Need to consider all other sources that are available.

  21. Local-as-View (LAV) Create View R1 AS SELECT B.ISBN, B.Title, A.Name FROM Book B, Author A WHERE A.ISBN = B.ISBN AND B.Year < 1970 Create View R5 AS SELECT B.ISBN, B.Title FROM Book B WHERE B.Genre = ‘Humor’ Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R5 ISBN Title R1 ISBN Title Name

  22. Reformulation Problem Given: • A query Q posed over the global schema • Descriptions of the data sources Find: • A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers to Q given the sources.

  23. Query Reformulation Query: Find authors of humor books Plan: R1 Join R5 Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R5 ISBN Title R1 ISBN Title Name

  24. Query Reformulation Query: Find authors of humor books after 1970 Plan: Can’t do it! Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R5 ISBN Title R1 ISBN Title Name

  25. Local-as-View: Example 1 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S1 AS [S1(title, dir, year, genre)] SELECT * FROM Movie Create Source S3 AS [S3(title, dir)] SELECT title, dir FROM Movie Create Source S5 AS [S5(title, dir, year)] SELECT title, dir, year FROM Movie WHERE year > 1960 AND genre=‘Comedy’

  26. Local-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S4 [S4(cinema, genre)] SELECT cinema, genre FROM Movie M, Schedule S WHERE M.title=S.title

  27. Local-as-View (LAV): Example 3 Global Schema: Movie(title,year,director,genre) American(director) MovieReview(title, review) Source Views: S1(title, year, director)→ Movie(title,year,director,genre), American(director), year ≥1960, genre= ‘Comedy’ S2(title, review)→Movie(title,year,director,genre), year≥1990, MovieReview(title, review)

  28. Query Reformulation in LAV • Query: Reviews for comedies produced after 1950 q(title,review) :- Movie(title,year,director,’Comedy’), year ≥1950, MovieReview (title,review) • Reformulated query: q’(title,review) :- S1(title,year,director), S2(title, review) S1(title, year, director)→ Movie(title,year,director,genre), American(director), year ≥1960, genre= ‘Comedy’ S2(title, review)→Movie(title,year,director,genre), year≥1990, MovieReview(title, review)

  29. Local-as-View Summary • Very flexible. You have the power of the entire query language to define the contents of the source. • Hence, can easily distinguish between contents of closely related sources. • Adding sources is easy: they’re independent of each other. • Query reformulation: answering queries using views!

  30. LAV vs. GAV See [Ullman,ICDT-1997] for a detailed comparison. • Local-as-View: – Easier to add sources: specify the query expression. – Easier to specify constraints on contents of the sources: they are part of the query expression describing them. • Global-as-View: – Easier query reformulation • GLAV combines both (Friedman & Millstein 1999)

  31. The General Problem • Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? • Many, many papers on this problem • The best performing algorithm: The MiniCon Algorithm (Pottinger & Halevy, VLDB 2000)

  32. Local Completeness Information • If sources are incomplete, we need to look at each one of them. • Often, sources are locally complete. • Movie(title, director, year) complete for years after 1960, or for American directors. • Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?

  33. Example • Movie(title, director, year) • complete after 1960 • Show(title, theater, city, hour) • Query: find movies (and directors) playing in Seattle: SELECT M.title, M.director FROM Movie M, Show S WHERE M.title=S.title AND city=‘Seattle’ • Complete or not?

  34. Example #2 • Movie(title, director, year), Oscar(title, year) • Query: find directors whose movies won Oscars after 1965: SELECT M.director FROM Movie M, Oscar O WHERE M.title=O.title AND M.year=O.year AND O.year > 1965 • Complete or not?

  35. References • Information integration • Maurizio Lenzerini • Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 • Invited Tutorial • Data Integration: a Status Report • Alon Halevy • German Database Conference (BTW), 2003 • Invited Talk

More Related