Data Integration

Data Integration

Data Integration Challenge Find houses with 4 bedrooms priced under 300K New faculty member realestate.com homeseekers.com homes.com

wrapper wrapper wrapper Architecture of Data Integration Systems Find houses with 4 bedroomspriced under 300K mediated schema source schema 1 source schema 2 source schema 3 homes.com realestate.com houses.com Provide a uniform query interface

cost | city | numbeds | numbaths price location beds baths $185,000 Urbana, IL 2 2 $270,000 Seattle, WA 3 - Architecture of Data Integration Systems Find houses with 4 bedroomspriced under 300K price location beds $185K Urbana, IL 2 $299K Kent, WA 3 wrapper wrapper $185,000 <em>Urbana, IL</em> 2 beds / 2 baths Century 21 $270,000 <em>Seattle, WA</em> 3 beds REMAX real estate homeseekers.com homeseekers.com Involve many tasks

Another Example Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet

More Motivating Examples • An organization has on average 49 databases • can talk about the same topic, but use different vocabularies, different schemas • how can we access them as if accessing a single db? • Hundreds of online bookstores • amazon.com, barnes&noble.com, etc. • how can we query them as if querying a single source? • Hundreds of CS websites in US, in text format • can we consolidate information about all of them and query them as if querying a giant relational database?

The General Problem • How can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database? • Arises in numerous contexts • on the Web, at enterprises, military, scientific cooperation, bio-informatics domains, e-commerce, etc. • Currently very hot • in both database research and industry

Current State of Affairs • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. • Long-standing challenge in the DB community • AI/WWW communities are on board • Annual workshops, vision papers, ... • Companies • Informatica, many others, ...

A Brief Research History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Semi-structured data, XML • Wrappers, info extraction • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • simplify integration tasks • New types of data sharing systems • P2P systems, Semantic Web

Data warehousing vs.Virtual DI systems

Data Warehouse Architecture OLAP / Decision support/ Data cubes/ data mining User queries Relational database (warehouse) Data extraction programs Data cleaning/ scrubbing Data source Data source Data source

Data warehousing • Data warehousing: load all the data periodically into a warehouse. • 6-18 months lead time • Separates operational DBMS from decision support DBMS. (not only a solution to data integration). • Performance is good; data may not be fresh. • Need to clean, scrub you data.

The Virtual Integration Architecture • Leave the data in the sources. • When a query comes in: • Determine the relevant sources to the query • Break down the query into sub-queries for the sources. • Get the answers from the sources, and combine them appropriately. • Data is fresh. • Challenge: many

Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Which data model? Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.

Architecture of (Virtual) Data Integration System Find books written by Isaac Asimov & priced under $15 global query interface query interface 1 query interface 2 query interface 3 amazon.com bn.com powell.com

A Brief History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Semi-structured data, XML • Wrappers • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • New types of data sharing systems • P2P systems, Semantic Web

Semi-structured Data • What should be the underlying data model for DI contexts? • relational model is not an ideal choice • Developed semi-structured data model • started with the OEM (object exchange model) • Then XML came along • It is now the most well-known semi-structured data model • Generating much research in the DB community

HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML is hard for applications

XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content: easy for applications

DTDs as Grammars Same thing as: • A DTD is a EBNF (Extended BNF) grammar • An XML tree is precisely a derivation tree db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string XML Documents that have a DTD and conform to it are called valid

More on DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> XML documents can be nested arbitrarily deep

<persons> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </persons> XML for Representing Data XML: persons persons row row row phone name phone name phone name “John” 3634 “Sue” 6343 “Dick” 6363

XML vs Data Models • XML is self-describing • Schema elements become part of the data • Relational schema: persons(name,phone) • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data

Semi-structured Data Explained • Missing attributes: • Repeated attributes <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person>  no phone ! <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person>  two phones !

Semistructured Data Explained • Attributes with different types in different objects • Nested collections (no 1NF) • Heterogeneous collections: • <db> contains both <book>s and <publisher>s <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person>  structured name !

XML Data v.s. E/R, ODL, Relational • Q: is XML better or worse ? • A: serves different purposes • E/R, ODL, Relational models: • For centralized processing, when we control the data • XML: • Data sharing between different systems • we do not have control over the entire data • E.g. on the Web • Do NOT use XML to model your data ! Use E/R, ODL, or relational instead.

Exporting Relational Data to XML • Product(pid, name, weight) • Company(cid, name, address) • Makes(pid, cid, price) makes product company

Export data grouped by companies <db><company> <name> GizmoWorks </name> <address> Tacoma </address> <product> <name> gizmo </name> <price> 19.99 </price> </product> <product> …</product> … </company> <company> <name> Bang </name> <address> Kirkland </address> <product> <name> gizmo </name> <price> 22.99 </price> </product> … </company> … </db> Redundant representation of products

The DTD <!ELEMENT db (company*)> <!ELEMENT company (name, address, product*)> <!ELEMENT product (name,price)> <!ELEMENT name (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT price (#PCDATA)>

Export Data by Products <db> <product> <name> Gizmo </name> <manufacturer> <name> GizmoWorks </name> <price> 19.99 </price> <address> Tacoma </address> </manufacturer> <manufacturer> <name> Bang </name> <price> 22.99 </price> <address> Kirkland </address> </manufacturer> … </product> <product> <name> OneClick </name> … </db> Redundant Representation of companies

Which One Do We Choose ? • The structure of the XML data is determined by agreement, with our partners, or dictated by committees • Many XML dialects (called applications) • XML Data is often nested, irregular, etc • No normal forms for XML 

XML Query Languages • Xpath • XML-QL • Xquery

Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Which data model? Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.

Wrapper Programs • Task: to communicate with the data sources and do format translations. • They are built w.r.t. a specific source. • They can sit either at the source or at the mediator. • Often hard to build (very little science). • Can be “intelligent”: perform source-specific optimizations.

Example Transform: <b> Introduction to DB </b> <i> Phil Bernstein </i> <i> Eric Newcomer </i> Addison Wesley, 1999 <book> <title> Introduction to DB </title> <author> Phil Bernstein </author> <author> Eric Newcomer </author> <publisher> Addison Wesley </publisher> <year> 1999 </year> </book> into:

Wrapper Construction • Huge amount of research in the past decade • Two major approaches • machine learning: typically requires some hand-labeled data • data-intensive, completely automatic • Different focuses • pull out each record (i.e., segment page into records) • pull out fields in each record • remove junk portions (ads, etc.) • Current solutions are still brittle • Unclear whether “standards” such as XML & Web services will eliminate the problem • the need likely will still remain

Information Extraction • If the source cannot be wrapped with a grammar or some easy-to-parse rules • must do information extraction • Huge research in the AI community

Data Source Catalog • Contains all meta-information about the sources: • Logical source contents (books, new cars). • Source capabilities (can answer SQL queries) • Source completeness (has all books). • Physical properties of source and network. • Statistics about the data (like in an RDBMS) • Source reliability • Mirror sources • Update frequency.

Content Descriptions • User queries refer to the mediated schema. • Data is stored in the sources in a local schema. • Content descriptions provide the semantic mappings between the different schemas. • Data integration system uses the descriptions to translate user queries into queries on the sources.

Desiderata from Source Descriptions • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

Reformulation Problem • Given: • A query Q posed over the mediated schema • Descriptions of the data sources • Find: • A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers from to Q given the sources.

Approaches to Specifying Source Descriptions • Global-as-view: express the mediated schema relations as a set of views over the data source relations • Local-as-view: express the source relations as views over the mediated schema. • Can be combined with no additional cost.

Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union[S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2

Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?

Global-as-View Summary • Query reformulation boils down to view unfolding. • Very easy conceptually. • Can build hierarchies of mediated schemas. • You sometimes loose information. Not always natural. • Adding sources is hard. Need to consider all other sources that are available.

Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy”

Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title . Now if we want to find which cinemas are playing comedies, there is hope!

Data Integration