Clio Schema Mapping & Data Exchange

Clio Schema Mapping & Data Exchange Renée J. Miller Bell Canada Chair of Information Systems Department of Computer Science University of Toronto

Clio Team University of Toronto • Renée J. Miller • Ariel Fuxman • Yannis Velegrakis IBM Almaden • Laura Haas • Ron Fagin, Mauricio Hernández • Howard Ho, Phokion Kolaitis, • Lucian Popa, Ling-Ling Yan

Brief History of Clio • Project begun in 1999 • VLDB 2000 - Schema Mapping as Query Discovery • SIGMOD 2001 Relational Demo • ICDE 2002 XML Demo • SIGMOD 2001 - Interactive Refinement of Schema Mappings • VLDB 2002 - Using Mappings for Data Exchange • ICDT 2003 (TCS05)- Theory of Data Exchange • VLDB 2003 (VLDBJ04) - Incremental Mapping Adaptation • ICDE 2004 Demo - “pay-as-you-go” mapping discovery • ICDE 2005 - Mapping provenance • VLDB 2006 - Discovery of Richer Mappings • Correlations between mappings • Declarative representation of grouping semantics • ICDE 2007 Demo • ICDE 2008 - Mapping Debugging by Example • SIGMOD 2008 Demo • Retrospective book chapter • Conceptual Modeling, 2009

Prehistory: Data Integration 1999 Garlic Tsimmis Information Manifold Disco Pegasus etc. Wrapper Wrapper Wrapper Wrapper User Query Integrated or Global Schema Mediated Schema Metadata Mediator City Database County Database Pubic Data Server Good Survey: Lenzerini PODS 2002

Renée J. Miller Schema Integration Integrated schema Q “conforms to” “conforms to” data data Source schema S1 Source schema Sn Schema Creation Problem: create integrate schema (and mappings) for a set of independently designed schemas [Batini et al CS86, Buneman et al EDBT92, Pottinger&Bernstein VLDB03]

Schema Mapping Q “conforms to” data Source Schema S Mapping Target Schema T “conforms to” data Mapping Creation Problem:create mapping between independently designed (legacy) schemas

Challenges • Independent Schemas • One schema does not have to be a mediator (global) • Both schemas may have data not contained in the other • Consider any combination of XML and relational schemas • Common use case publishing legacy relational data on the web • Use for data translation between data models • Flexible mapping granularity • Fine grained mappings (attribute-level) • Map concepts (master data management) • Orders, customers, products, etc. • Map full documents or translate entire schemas • Incremental • Users with only partial knowledge of schema/data can still construct useful (usable mappings)

Data Integration Q Mappings (M) “conforms to” Source Schema S’’ Source Schema S’ I’’ Source Schema S Target Schema T I’ “conforms to” I • Data Integration – answer target queries using data from source(s) • Target data is virtual • At query time, compute Q(M(II’I’’))

Data Exchange Source Schema S’’ Q Source Schema S’ data Source Schema S data Mappings (M) “conforms to” “conforms to” data data Target Schema T • Data Exchange – answer target queries using local data • Target data is materialized • Query is answered with target data

Mapping Creation: Example • Leverage schema matches • Manually entered or • Automatically discovered • Even in 1999 matchers already gave good quality results • Preserve data semantics • Discover data associations • Leverage nesting • Leverage constraints • Mine for approximate constraints • Use workload • Model incompleteness • Generate new values for data exchange • Produce correct grouping

Schema Matching • Graphical User Interface • DBA interactively specifies • Automatic Discovery • Attribute (Element) Classifier • Extensible to • Other Schema Matchers • Based on syntactic information • Within schema or data Creating Correspondences Well understood even in 2000 [Rahm, Bernstein VLDBJ01] Most Produce (Binary) Correspondences

Mapping Specification (History) grant cid gid amount project funding gid proj aid financial aid amount date Local Global Global-as-View (GAV) create view funding(gid, proj, aid) as ( select gid, project, NULL from grant ) create view financial(aid, date, amount) as ( select NULL, NULL, amount from grant ) What is total funding (amt) for “Clio” project? select sum(amount) from funding U, financial I where U.aid = I.aid and proj = ‘Clio’ Local-as-View (LAV) [LRO VLDB96] create view grant(cid, gid, amount, project) as ( select NULL, gid, amount, proj from funding, financial where funding.aid = financial.aid )

Incompleteness grant cid gid amount project funding gid proj aid financial aid amount date Global-as-View (GAV) create view funding(gid, proj, aid) as ( select gid, project, SK(gid,proj,amount) from grant ) create view financial(aid, date, amount) as ( select SK(gid,proj,amount), NULL, amount from grant ) select sum(amount) from funding U, financial I where U.aid = I.aid and proj = ‘Clio’ • LAV explicitly model global incompleteness • GAV: can “invent” values • Now global has data that was not in local

Multiples Sources grant cid gid amount project loan lid lname funding gid proj aid financial aid amount date Local1 Local2 GAV LAV: views are independent create view grant(gid, amount, project) as ( select gid, amount, proj from funding, financial where funding.aid = financial.aid ) create view funding(gid, proj, aid) as ( select gid, project, NULL from grant union select lid, lname, NULL from loan …) create view loan(lid, lname) as ( select gid, proj from funding)

Type Conversion grant cid gid amount project funding gid proj aid financial aid amount date Global proj is ProjID Local project is ProjID:ProjName Global-as-View (GAV): easy create view funding(gid, proj, aid) as ( select gid, substring(project,1,3), NULL from grant ) create view financial(aid, date, amount) as ( select NULL, NULL, amount) from grant ) Local-as-View (LAV): for query answering need to invert view and invert any fcts of view create view grant(cid, gid, amount, project) as ( select NULL, gid, amount, concat(proj,’unknownProjName’) from funding, financial where funding.aid = financial.aid )

LAV • Main innovation of LAV • Model incompleteness in Global • Local provides some, but not all data in target • Some advantage for mapping multiple local sources • Mapping expression for each source can be specified separately • Disadvantages • Basic things that were simple in GAV are now hard • Type conversions • User defined functions • Fundamentally assumes one schema is Global and each relation of Local(s) is mapped independently into Global • Query answering now requires answering queries using views • I.e., view inversion • Get unintuitive query results • select * from funding • always gives empty results independent of source data

Mapping Specification: Example • What do we need? • Source query • Join of company and grant • Can be used to populate • Target query • Join of organization (with nested fundings) and financial • Clio pioneered used of • Source-to-target tuple generating dependencies • GLAV mappings

Mapping Generation • Mapping: xQs(x)  y Qt(x, y) • Query on source contained in query on target (Qs  Qt) • Mapping discovery becomes • Query discovery (of Qs and Qt) • How do we find source (Qs) and target (Qt) queries? • Use chase [Maier, Mendelzon, Sagiv 79] to find connections within the schemas • Originally defined to solve inference problem for relational dependencies • We use it to generate possible alternative representations of information (associations) [Miller, Hernandez, Haas, VLDB00] • Generalized to nested-relational model [Popa et al, VLDB02] • Generalized to discover grouping and correlation semantics [Fuxman et al, VLDB 06]

Interpreting Correspondences statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project What semantics do we associate to an arrow? Good enough for one arrow ! cidexpenseDB.companiescidstatDB.cityStat.orgs Still works for these two arrows! cid,nameexpenseDB.companiescid,namestatDB.cityStat.orgs How about now? gidexpenseDB.grantsgidstatDB.cityStat.orgs.fundings

Associations statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project company(CID,N,C) cityStat(C,Os),Os(CID,N) company(CID,N,C),grant(CID,GID,A,S,P) cityStat(C,Os),Os(CID,N), funding(GID,P,AID), financial(AID,D,A) Red Arrows are referential constraints like keyrefs in XML

Associations statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project M1 company(CID,N,C)  C’,CID’,Os cityStat(C’,Os),Os(CID’,N) company(CID,N,C),grant(CID,GID,A,S,P) cityStat(C,Os),Os(CID,N), funding(GID,P,AID), financial(AID,D,A)

Associations statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project M2 M1 company(CID,N,C)  C’,CID’,Os cityStat(C’,Os),Os(CID’,N) company(CID,N,C),grant(CID,GID,A,S,P)  C’,CID’,Os, P’,AID,D cityStat(C’,Os),Os(CID’,N), funding(GID,P’,AID’), financial(AID’,D,A)

Alternative Mappings statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor • Grants may be associated with companies in multiple ways • Association 1: grants⋈companies join on cid = cid • Association 2: grants⋈companies join on sponsor = cid • We do not make the one flavor assumption (URA)

Schema Mapping • Enumerate ALLassociations consistent with schema semantics • Constraints • Nesting (schema structure) • May additionally use • Data (statistical models of data) • Query Workload • Interpret correspondences (arrows) over pair source & target associations • Consider maximal set of correspondences covered by a pair of associations

Features of Algorithm • Generate associations • Modified dependency chase algorithm • Nested schemas • Broad class of (nested) referential constraints • Including (nested) foreign keys (NRI - Popa00) • Consider both source and target constraints • Algorithm is complete: generate all associations (in the UR sense) for acyclic dependencies • Schema mapping • Set of (alternative) inter-schema dependencies

What is a Mapping? • Mapping • Source query - join on company and grant • company(C,N),grant(C,G,A) • Target query • organizations(O,M,Fid),funding(Fid,I,Aid),financials(Aid,B) • Matching • company.C --> org.O • company.N --> org.M • grant.G --> funding.I • grant.A --> financials.B • C,N,G,A company(C,N),grant(C,G,A)  Aid,Fid organizations(C,N,Fid),funding(Fid,G,Aid),financials(Aid,A)

What is a schema mapping? • Foundations of Clio’s schema mapping language • Tuple-generating dependencies (TGDs) • TGDs are expressions of the form • ∀ x,z ϕ(x,z) →∃ y ψ(x, y) whereϕ(x,z), ψ(x, y) are conjunctions of atomic formulas • Foundations • Relational Dependency Theory • Extensive study of integrity constraints in relational databases in the 1970s and 1980s(Codd, Fagin, Beeri, Vardi et al) • TGDs emerged as an important class of constraints with a balance between high expressive power and good algorithmic properties. Include: • Inclusion Dependencies • SQL Referential Constraints

Where are we? • Schema Mapping Discovery • Data Exchange • Given a schema mapping • Material a single target instance • Determine “best” instance to materialize

Data Exchange <statisticsDB> { FOR $x0 IN /expenseDB/grant, $x1 IN /expenseDB/project, $x2 IN /expenseDB/company WHERE $x2/cid/text() = $x0/cid/text() $x0/project/text() = $x1/name/text() RETURN <cityStatistics> { FOR $x0L1 IN /expenseDB/grant, $x1L1 IN /expenseDB/project, $x2L1 IN /expenseDB/company WHERE $x2L1/cid/text() = $x0L1/cid/text() $x0L1/project/text() = $x1L1/name/text() $x2/city/text() = $x2L1/city/text() RETURN <organization> <cid> { $x0L1/cid/text() } </cid> <cname> { $x2L1/name/text() } </cname> { FOR $x0L2 IN /expenseDB/grant, $x1L2 IN /expenseDB/project, $x2L2 IN /expenseDB/company WHERE $x2L2/cid/text() = $x0L2/cid/text() $x0L2/project/text() = $x1L2/name/text() $x2L1/name/text() = $x2L2/name/text() $x2L1/city/text() = $x2L2/city/text() $x0L1/cid/text() = $x0L2/cid/text() RETURN <funding> ………………………………..

Data Exchange Query (source): • unnest • join Query (target): • nest • split • create new values Schema Mapping 3: Data translation Source Associations (logical views) Target Associations (logical views) Attribute-Attribute correspondences Source schema S Target schema T 1: Generate mappings 2: Select subset mappings

Query Semantics • Data Integration • Query is answered using source data • Answer is set of tuples in query result on ALL possible target instances: certain answers • Data Exchange • Query is answered using ONE materialized target • Can single target give same information as source(s)? • Is query result the same in both settings? • Fagin, Kolaitis, M-, Popa ICDT03 and TCS05

Data Exchange Issues • Use queries for data translation • Xquery (XSLT) or SQL • Translate logical mappings into queries • Two parts, conceptually: • Unnest (join) the source association • Nest (partition) according to the shape of the target association • Skolemization algorithm: the heart of query generation • Achieves a good nesting (grouping) • Generates new values (ids) • Not to many ids, but not too few either !

Query Generation = Sk4[name,gid,amt] = Sk4[name,gid,amt] statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount • Correspondences map only into some of the atomic attributes • We use Skolem functions to control the creation of the other elements • sets (this controls how we group elements in the target) • atomic values (this enforces the integrity of the target) expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount sponsor project = Sk2[] M2 = Sk1[name] = Sk3[name]

Correctness Guarantees • Information preservation: • Map no less and no more data than was intended to be mapped • In data exchange terms, for relational schemas we create • universal solution, (in fact a canonical universal solution) • Target integrity constraints satisfied!! • Target nested referential constraints Tc • Generate an It whereIt╞Tc

Data Exchange Transform data structured under a source schema into data structured under a different target schema. Source Schema Target Schema MST S T J I

Solutions in Schema Mappings M M ST ST Definition: Schema MappingM = (S,T, ) If I is a valid source instance then a solution forI is a valid target instanceJ such that <I, J > satisfy Fact: In general, for a given source instance I, • No solution for I may exist (Moverspecifies). or • Multiple solutions for I may exist; in fact, infinitely many solutions for I may exist (Munderspecifies). • When more than one solution exist, which solutions are “better” than others? • How do we compute a “best” solution? • In other words, what is the “right” semantics of data exchange? [Fagin, Kolaitis, M-, Popa, 2003]

Data Exchange: Algorithmic Problems • The existence-of-solutions problemSol(M): (decision problem) Given a source instance I, is there a solution J for I? Trivial if T has no constraints. Undecidable in general. Polynomial time algorithm for weakly acyclic sets of dependencies*. • The data exchange problem associated withM: (function problem) Given a source instance I, construct a solution J for I, provided a solution exists. We introduced universal solutions as best solution for data exchange. Polynomial time algorithm to create a universal solution for weakly acyclic sets of dependencies*. *Popa & Deustch

Universal Solutions M Schema S Schema T J I Universal Solution h2 Homomorphisms h1 h3 J2 J1 J3 Solutions

Query Answering in Data Exchange M Schema S Schema T q J* I Question: What is the semantics of target query answering? Definition: The certain answers of a query q over T on I certain(q,I) = ∩{ q(J): J is a solution for I }. Note: This is the standard semantics in data integration. Importantly: universal solutions can be used to compute certain answers for conjunctive queries; and these are the only solutions with this property: For a universal solution J*: certain(q,I) = q(J*)

How do we create J* Mapping Generation Source schema S Target schema T Schema Mapping “conforms to” “conforms to” data Transformation Code • Target instance must be a universal solution J* • How to we create J*? We developed two ways: • Data chase: chase instance I with mapping and target constraints • Generate queries that use Skolem values for missing data • SQL/XQuery/XSLT Theory Practice (IBM’s Infosphere Product)

Clio • Can map between any combination of • XML schemas/DTDs/Relational schemas • The generated queries are XQuery + Skolem fcts • (or XSLT or SQL) • Tested Clio with many complex real data: • LifeScience schemas (private and public GeneX, SwissProt) • Amalgam integration benchmark, DBLP, Mondial, several others,… • Clio technology already transferred into DB2 V8 • IBM’s Rational Data Architect

ToMAS: Mapping Adaptation • Mapping Adaptation (Velegrakis et al, VLDB 03) • When schemas change, mappings may become invalid or inconsistent. • Old mappings (views and queries) need to be rewritten for the new schema • ToMAS: Provides tool support to reduce the required user effort • Detects & updates mappings affected by the schema and matching changes • VLDB03, VLDBJ04, ICDE04 (demo) Source Schema S Target Schema T mappings X New Source Schema S’ New Target Schema T’ ?

Data Viewer • Explain and debug mappings using data • Relational data viewer [SIGMOD01] • XML data viewer (MUSE) [ICDE08,SIGMOD08] • Key idea • Provide a data example that illustrates the alternative interpretations in a compact way • Use data to elicit additional semantics

Clio Schema Mapping & Data Exchange