Data Integration: A Status Report

Data Integration:A Status Report Alon Halevy University of Washington, Seattle BTW 2003

Data Integration Report • Recent progress • Mediation languages • Query processing (XML and other) • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003

Data Integration Systems • This is one possible architecture (virtual integration) • Only logical mediated schema is central. Data stays at the sources.

Motivation and Activity • Application areas of data integration: • Enterprise information integration ($$) • The government • Data sources on the web • Scientific data sharing. • Many research projects: • Mine: Information Manifold, Tukwila, LSD. • Companies: • Many startups, big guys getting in. BTW 2003

Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. • Crossing the Structure Chasm. BTW 2003

Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Mediation Languages Goal: Mediated Schema Language for Specifying Semantic relationships BTW 2003

Source Source Source Source Source Global-as-View (GAV) Create view Actor AS R1 Union Select A,B From S2 Union … Mediated Schema Title, Actor, … R1 R2 R3 R4 R5 BTW 2003

Source Source Source Source Source Local-as-View (LAV) (GLAV) Create View R5 as Select * From Movie Where lang=“German” Create View R1 as Select title, name From Title Join Actor Where Year>1970 Mediated Schema Title, Actor … R1 R2 R3 R4 R5 BTW 2003

Adaptive Query Processing • Problem: no stats, network unstable • Cannot ‘Plan and then execute’ • Need to adapt plan during execution. • Idea already in Ingres (1976) • Proposed before data integration: • Cole and Graefe (choose nodes) • Kabra and Dewitt (mid-query re-opt). BTW 2003

Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn] • Processor starts with initial plan • Monitors execution, accumulating stats. • Switches plan when a better one found • Reuses intermediate results. • Final, cleanup phase. • Possible transformation types: • Plan partitioning, data partitioning, low-level rescheduling. • Can be aggressive (e.g., with aggregations). BTW 2003

XML Query Processing • XML facilitates integration. • Mediator query processor may manipulate XML directly. • Progress on: • Publishing to XML, XML views on relations • Physical algebras for manipulating XML • Optimization of XQuery. BTW 2003

The Commercial World • Some startups: • Nimble, MetaMatrix, Calixa, Enosys, … • Big guys making announcements: • IBM, BEA, MS, (Oracle still being defiant). • Progress: analysts have buzzword -- EII. • Challenges: • Integration with EAI? • Yet another middleware? • Horizontal vs. vertical? BTW 2003

Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003

Peer Data-Management • PDMS: a network of peers • Peers can: • Export base data • Provide views on base data • Serve as logical mediators for other peers • A peer can be both a server and a client. • Semantic relationships are specified locally(between small sets of peers). BTW 2003

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Network of Mappings (Piazza) CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin

Advantages of PDMS • No need for a central mediated schema. • Can map data opportunistically, as is most convenient. • Queries are posed using the peer’s schema. Answers come from anywhere in the system. • Semantic Web. • This is not P2P file sharing. • Data has rich semantics • Membership is not as dynamic. BTW 2003

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Schema Mediation When can LAV and GAV be combined to form such a network structure? [ICDE-03], [WWW-03 for XML] CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Query Optimization • Problems: • redundant paths • expensive reformulation. CiteSeer UW Stanford • Possible solution: • Pre-compose some paths DBLP Leipzig Saarbruecken Berlin

Mapping Composition • Incredibly subtle! [w/ Madhavan] • In general, composition can be an infinite set of GLAV formulas. • Results: • Finite in many cases • Even when infinite, often has finite, useful encoding. • Hence, compositions can usually be pre-optimized. BTW 2003

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Management of Updates[w/ Mork, Gribble] • Problem: when updates are generated, we don’t know who will use them. • Solution: • represent updates as first-class citizens • Complement with boosters • Rules for usage. CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Other Research Issues Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Indexing of views CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin

Schema Matching/Mapping • Given • S1 and S2: a pair of schemas/DTDs/ontologies,… • Possibly, data accompanying instances • Additional domain knowledge • Find: • A match between S1 and S2 • A set of correspondences between the terms. • Ultimately, a mapping • Should enable translating data between the schemas. BTW 2003

Example: House Listings house address Water view num-baths LakeMountains ? 1-1 mapping non 1-1 mapping house location view full-baths half-baths front back

Motivations • Heart of any data sharing architecture • Virtual, warehouse, messaging, • web services, semantic web • Translation of legacy data, EAI, … • Key operator in model management • Algebra for manipulating models of data • See [Bernstein, CIDR-03], Melnik et al. [SIGMOD 03]. • Currently, a bottleneck. Done mostly by hand. BTW 2003

Approaches to Matching • Matching is hard because schema does not fully capture the semantics. • Many techniques proposed. They consider similarities in: • Attribute names (synonyms) • Data values, data types • Relationships between columns • Structural similarities • Anything a human expert would try! • Hence, let’s try to simulate a human. BTW 2003

Philosophy of Solutions • Effective schema matching requires a principled combination of techniques. • Like human experts, the matcher should improve over time • Learn from seeing many schemas, matches. • LSD [Doan, Ph.D 2002, U. of Illinois] • COMA [Do et al.] BTW 2003

Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] • Collect a corpus of schemas and matches. • Learn from the corpus: • Create a classifier for every corpus element • Use multi-strategy learning. • Given S1 and S2 : • Compare each schema element to corpus elements. • If two elements’ similarity vectors are close, then maybe they match each other. BTW 2003

Learning from Corpus vs. Learning from the schemas BTW 2003

Finding Different Matches BTW 2003

Other Corpus Based Tools • Conjecture: a corpus of schemas can be the basis for many useful tools. • Auto-complete: • I start creating a schema (or show sample data), and the tool suggests a completion. • Query reformulation: • I ask a query using my terminology, and it gets reformulated appropriately. • Improving structured queries over structured web sites (and focused crawling, a la BINGO!) BTW 2003

The Corpus • Contents: • Schemas, ontologies, meta-data, data, queries. • Sample statistics: • How often does a word appear as a relation name? • When it does, what tend to be the attribute names? • What other tables are there? What are the foreign keys? BTW 2003

schema mapping Conclusion: Crossing the Structure Chasm • Data authoring, querying and sharing is everywhere; done by novices too. • Semantic web: the extreme example. Corpus Of schemas BTW 2003

Some References • www.cs.washington.edu/homes/alon • Piazza: WebDB01, ICDE03, WWW03 • The Structure Chasm: CIDR-03 • Mediation surveys: VLDB Journal 01 • Lenzerini, PODS 02 tutorial. • Schema matching: • Rahm and Bernstein, VLDB Journal 01. BTW 2003

Data Integration: A Status Report

Data Integration: A Status Report

Presentation Transcript

Information Integration: A Status Report

HALL-A STATUS REPORT

Electricity: a status report

HALL-A STATUS REPORT

Longitudinal Data system Governance: Status Report

Data Integration: A Status Report

JRA1 - Data Acquisition Status Report

BalloonWinds Integration Status

Inflation: a Status Report

APTLD – A Status Report

METS: A Status Report

NESTOR – a status report and KM3NeT – a short status report

HALL-A STATUS REPORT

DST data structures: Status report

HALL-A STATUS REPORT

HALL-A STATUS REPORT

HALL-A STATUS REPORT

Data integration

NESTOR – a status report

Integration status

Cyberinfrastructure A Status Report

Barrel integration status