Nested Mappings: Schema Mapping Reloaded

Nested Mappings: Schema Mapping Reloaded Clio P. Papotti Universita’ Roma Tre M.A. Hernandez - H. Ho - L. Popa IBM Almaden Research Center A. Fuxman - R.J. Miller University of Toronto

The Problem of Mapping Generation • Schemas can be arbitrarily different • E.g., different normalization & naming, missing/extra elements • Input: correspondences between atomic schema elements • (Automatic discovery) • Logical and declarative expressions of relationships between schemas. • Abstraction for data interoperability tasks • Simpler than actual implementations of data exchange (SQL/XQuery/XSLT) • Must generate transformation that: • Preserves data relationships: pname-dname, pname-ename, etc. • Creates new target values (pid) • Produces “correct” groupings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Outline • Schema mapping generation • [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) • From basic to nested: • Issues with basic mappings • Nested mappings and their advantages • Generation algorithm • Performance impact • Conclusion • Related work • Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Source Concepts (relational views) Target Concepts (relational views) Schema Mapping Generation Schema Correspondences Source schema S Target schema T • Step 1. Extraction of “concepts” (in each schema). • Concept = one category of data that can exist in the schema • Step 2. Mapping generation • Enumerate all non-redundant maps between pairs of concepts Mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Example The concept of “project of a department” dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] m2 m1 • m1 maps proj to dept-projects proj: Set [ dname pname emps: Set [ ename salary ] ] m1: (p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0in proj) (e0in p0.emps) (d in dept) (p in d.projects) (e in d.emps) (w in e.worksOn) w.pid = p.pid  p0.dname = d.dname  p0.pname = p.pname  e0.ename = e.ename  e0.salary = e.salary • m2 maps proj-emps to dept-emps-worksOn-projects expression for dept-emps-worksOn-projects The concept of “project of an employee of a department” • Two ‘basic’ mappings (or source-to-target tgds or GLAV formulas) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Outline • Schema mapping generation • [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) • From basic to nested: • Issues with basic mappings • Nested mappings and their advantages • Generation algorithm • Performance impact • Conclusion • Related work • Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 1: Many Small Uncorrelated Formulas dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] ] • m1: “for every proj tuple there must be dept and project tuples such that …“ • m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “ • If we also had dependents under employees, then: “for every dependent of an emp of a proj … “ and so on … • There is a lot of common mapping behavior that is repeated • E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 2: Redundancy in the Generated Data Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m2 CS B1 { } { X1 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] ] CS uSearch { Alice John 120K, 90K } Required to exist based on m2 Required to exist based on m1 • m2 repeats the mapping behavior of m1: • “duplicate” dept and project tuples • “duplicate” nulls (pid values: X2 and X3, and budget values) • Moreover, this duplication happens for each joining emp tuple in the source Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 3: No Grouping in the Target Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m2 CS B2 { Alice, John 120K, 90K { X2} { X3 } } { X3 uSearch } CS B1 { } { X1 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] ] CS uSearch { Alice John 120K, 90K } Required to exist based on m2 Required to exist based on m1 • Alice and John are in different singleton sets (E and E’) • There can be as many singleton sets as emp tuples in the source nested set • It is desirable to enforce the grouping on the target data Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Summary of issues • Fragmentation of the specification • (Too) many small tgds • Fragmentation of the data • Generate redundant data (which later needs to be removed or fused) • No grouping enforced on the target data (need additional phase to enforce any grouping) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Idea dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] • We would like to reuse (in m2) the “dept” and “project” tuples that the simpler mapping m1 asserts. • Make m2 assert only the “extra” information • Also accumulate the corresponding employees into one set • Idea: Correlate the mapping formulas based on their common part m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] ] Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Correlating Mapping Formulas m1: (p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0in proj)(e0in p0.emps) (d in dept) (p in d.projects)(e in d.emps) (w in e.worksOn) w.pid=p.pid  p0.dname = d.dname  p0.pname = p.pname  e0.ename = e.ename  e0.salary = e.salary This is a nested mapping proj tuples mapped only once Submapping, correlated to the parent mapping Replace with n:(p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname  [ (e0inp0.emps) (e ind.emps) (w in e.worksOn) w.pid=p.pid  e0.ename = e.ename  e0.salary = e.salary ] • For every proj tuple, we map all employees, as a group. • (Source grouping is preserved) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Advantages of Nested Mappings • Nested tgds can exploit the natural hierarchy that exists on the concepts of a schema • e.g., proj-emps is a “subconcept” of proj, in the source schema • Map higher concept only once; use submappings for subconcepts • Nested mappings are strictly more expressive: There is no set of source-to-target tgds that is equivalent to n. proj: Set [ dname pname emps: Set [ ename salary ] ] Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nesting Algorithm: Sketch • Step 1. Discovery: construct a DAG of basic mapping based on the concepts hierarchy • Step 2. Correlation: construct nested mappings by traversing the DAG, starting from each root, and repeatedly applying the nesting step hinted before. • We get a forest of nested mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nesting Algorithm: Example dept: Set of [ dname budget emps: Set of [ ename salary worksOn: Set of [ pid ] ] projects: Set of [ pid pname ] ] P D X proj: Set of [ dname pname emps: Set of [ ename salary ] ] PE DE DP DEPW A DAG of basic mappings for p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and PDP ( for e inp.emps exists e’ ind’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary ) PEDEPW Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Experimental evaluation • Goal: show empirically that nested mappings can dramatically: • reduce the cost of producing a target instance • improve the quality of the generated data • DBLP-like schema, on both source and target, with four levels of nesting/grouping: • authors – level 1 • conferences – level 2 • years – level 3 • publications – level 4 • Mappings are implemented by generating queries (in XQuery) • Qbasic based on basic mappings • Qnested based on nested mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Example Queries – 2 Levels Only Qbasic Qnested let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author, $x1 in $x0/conf return <author> <name> { $x0/name/text() } </name> { for $x0L1 in $doc0/authorDB/author, $x1L1 in $x0L1/conf where $x0/name/text()=$x0L1/name/text() return <conf> <name> { $x1L1/name/text() } </name> </conf> } </author> } { for $x0 in $doc0/authorDB/author return <author> <name> { $x0/name/text() } </name> </author> } </authorDB> let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author return <author> <name> { $x0/name/text() } </name> { for $x1 in $x0/conf return <conf> <name>{ $x1/name/text() }</name> </conf> } </author> } </authorDB> Multiple query terms (one per basic mapping) • Single pass over the data • No duplicates • Need re-grouping (over entire data) • Generate duplicates Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Execution time comparison • Qbasic execution time / Qnested execution time • Logarithm scale Execution time for basic: 22 minutes Execution time for nested: 1.1 seconds Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Output file size comparison • Qbasic output file size / Qnested output file size • Logarithm scale • Size of generated data for basic (including duplicates): 45MB • Size of generated data for nested: 552KB The nested mapping results in much more efficient execution with less redundant data Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Related work • Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al. VLDB’05] support nested data, but do not support nesting of mappings. • Nested mappings are less general than languages used for composition [Fagin et al. PODS’04, Nash et al. PODS’05], but are more compact and easier to understand/program • The generation algorithm identifies common expressions within mappings: same spirit of work in query optimization [e.g., Roy et al. SIGMOD’00]. • But query optimization preserves query equivalence, while our techniques lead to mappings with better semantics (do not preserve query equivalence). • There are already commercial tools that use similar paradigms (e.g., IBM Ascential DataStage TX) but most of the mapping generation work is manual. Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Conclusion • Nested tgds: better specification language for transformation • Use correlation (hierarchy) between concepts • Less redundancy in the output, more efficient • Naturally preserve source grouping • For more complex mappings we expose Skolem functions to let users alter the default grouping behavior • Nested tgds are more compact and easier to understand/program • Humans think top-down: map top concepts, then submappings, etc. • Can be generated too ! Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Future Directions • Extend existing solutions to use nested mappings • Data integration, mapping analysis and reasoning, schema evolution, etc. • Nested tgds are more complex as a logic formalism ! • Study the formal foundation of nested mappings • More generally, develop methods for deciding when and why is a schema mapping specification “better” than another • Need to look at issues such as: • preservation of the source data (associations, correlations, etc.) • minimization of incompleteness Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nested Mappings: Schema Mapping Reloaded