410 likes | 574 Vues
ICDE2014. April, 1st. Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview.
E N D
ICDE2014 April, 1st FlorisGeerts (University of Antwerp) Giansalvatore Mecca, DonatelloSantoro (Universitàdella Basilicata) Paolo Papotti (Qatar Computing Research Institute)
Overview • Motivations and Goals • Semantics • Experimental Results
Overview • Motivations and Goals • Semantics • Experimental Results
Schema Mapping System A Mapping and Cleaning Task STRONGLY INTERRELATED PROBLEMS Data Cleaning Tools
A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)
Step1: To exchange data from source to target A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)
A Motivating Example Source #1 (Confidence 0.7) ST-TGD [Popa et al., VLDB’02] Schema Mappingstrasformation can be expressedas a set of source to target tuplegeneratingdependencies(st-tgds) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)
A Motivating Example Source #1 (Confidence 0.7) ST-TGD Target Source-to-Target TGD MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)
A Motivating Example Source #2 (Confidence0.5) ST-TGD Pre-Solution for the TGDs Target Source-to-Target TGD Pat(ssn, n, p, s, c), Surg(ssn, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.5, s, c, Y3), Treat(ssn, Y4, i, t, d)
Step2: To ensure Data Quality A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)
A Motivating Example Functional Dependencies ST-TGD FD ID CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# fd2. Cust: Name, Str, City → SSN fd3. Treat: SSN → Salary Inclusion Dependencies id4. Treat[SSN] ⊆Customers[SSN] Conditional Functional Dependencies cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Streetfrom MD
A Motivating Example Functional Dependencies VIOLATIONS ST-TGD FD ID CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# fd2. Cust: Name, Str, City → SSN fd3. Treat: SSN → Salary Inclusion Dependencies id4. Treat[SSN] ⊆Customers[SSN] Conditional Functional Dependencies cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Streetfrom MD
PreviousSemantics? A Motivating Example Source #1 (Confidence 0.7) Target Source #2 (Confidence0.5) Source #3 (Confidence1.0)
Data Exchange [Faginet al., TCS ’05] ST-TGD ID FD • Elegantsemantics • Scalable algorithms CFD ER fd1. Cust: SSN→ Name, Phone, Str, City, CC# Soft Violation Hard Violation
Data Repairing INTERACTION! FD ID CFD ER TGD Hard Violation • Manyapproachesand techniques[Bohannon SIGMOD ’05] [Cong VLDB ’07] [KolahiICDT ’09] [Fan VLDB ’10] [Beskales VLDB ’10] • No support for mapping • No way to handleourexample • Main-memoryimplementationonly!
Pipeline • Negative Result: There exist scenarios such that pipeline doesn’t return solutions • Even when it works, its quality is usually poor Data Repairing Data Exchange ✔ Mappings ✔ Cleaning Rules ✔ Cleaning Rules ✗ Cleaning Rules ✔ Mappings ✗ Mappings
Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5)
Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) PreSolution for TGDs
Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) 123 fd2. Cust: Name, Str, City → SSN
Pipeline Target Source #1 (Confidence 0.7) Source #2 (Confidence0.5) 123
Contributions A Uniform Framework for Type 1 Type 2 Type 3 MD MD Schema Mapping Scenarios Data Repairing Scenarios Mapping and Cleaning Scenarios ST-TGD ID FD FD CFD ER TGD With a fast and general-purpose chase engine FD ID CFD ER
Overview • Motivations and Goals • Semantics • Experimental Results
Llunatic Data Repairing [Geerts et al., VLDB ‘13] • An extension of the data-repairing framework • Let’s see a quick summary… Partial Order Cell Groups LLUNs Upgrades
Llunatic Data Repairing PREFERRED VALUE [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • Standard preferencerulesOrderingattribute • No order
Llunatic Data Repairing [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • LLUNs • a new class of symbols • placeholders used to mark conflicts L0
Llunatic Data Repairing [Geerts et al., VLDB ‘13] • The Partial Order Π • Elegantway to model preferencerules • LLUNs • a new class of symbols • Cell Groups • Represent the set of changes Sky • 122-1876 g1 = <122→ {t4.phn, t5.phn} > g2= <Sky→ {t4.str, t5.str} by {tm.strauth}>
Upgrades • Upgrade: an improvementover J, sinceitcontainsbettervaluewrtΠ CardinalityMinimal Update 1 Update 2 Update 3 Update 4 Update 5 • g1 <L0→ {t4.cc, t5.cc}> g3<555→ {t4.cc, t5.cc}> g4 <777→ {t4.ssn}> g2 <L1→ {t4.ssn}> g5 <333→ {t4.cc, t5.cc}> Forward Backward Upgrades J Not an upgrade e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’
Upgrades over generalization Update 6 g6 <L2→ {allcells}> Update 1 Update 2 • g1 <L0→ {t4.cc, t5.cc}> g2 <L1→ {t4.ssn}> Forward Backward MinimalSolutions J e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’
ST-TGDs T-TGDs User Inputs + = Non trivial extension!
Mapping and Cleaning Scenario M&C Scenario M={S, Sa,T,Σt,Σe,Π, User} • S: source schema, Sa: authoritative source tablesT: target schema, Σt: TGDs, Σe: EGDs • Π: the partial order specification • User: a partial function to abstract user interaction • Solution: Given M, an instance I of S, and an instance J of T, a solution is an instance J’ such that: • it is a repair, i.e., “I and J’ satisfy Σt∪ Σe” • and “J’ is an upgradeof J according to Π”
How to handle TGDs Target Source #1 (Confidence 0.7) KEY INTUITION • We model it in terms of cell groups and updates m1: MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d) g1 = <124→ {t8.ssnnew, t9.ssnnew} by {t1.ssn}> • we do not disrupt key – fkey equality in the following g2= <W. Smith→ {t8.namenew} by {t1.name}> ... new cells
UserInputs • In the presence of inconsistencies user inputs are crucial. User may • change the value of a cell group • refuse a cell group • We model user interaction using a partial function over cell groups 555 g1 <123→ {t4.ph, t5.ph}> g2 <L1→ {t4.ssn}>
Non trivial extension • Data cleaning semantics has some nice properties • scenario C always has a solution for <I, J> • the chase always terminates (it never fails) • Adding TGDs and User Inputs • concept of upgrade change significantly • requires to completely rework upgrades
Upgrades • Must take into account many issues • some target cells are “better” than others • source cells may be authoritative • compare instances with different new values • compare instances with different number of tuples • some cells may be changed by users
A Few Results • Conservative extension of the data exchange • Every (core) solution of a data exchange scenario corresponds to a (minimal) solution of its associated mapping scenario, and vice versa • Given a MC scenario, if Σtis a set of weakly-acyclictgds, then the chaseterminates • in essencewemay re-use terminationconditions for data exchange
Overview • Motivations and Goals • Semantics • Experimental Results
Chase Tree Differentorders of application givedifferentresults • Chase algorithmfor chasingegds and tgds J e1, b2 e0, b1 e0, f e1, f e0, b2 e1, b1 R1 R2 R3 R10 R11 R12 e1, b1 e0, b1 e1, f e1, b2 e0, f e0, b2 R4 R5 R6 R13 R14 R15 the e0-e1 sequence the e1-e0 sequence
ScalabilityTechniques • Chase implementationbased on equivalenceclasses • Delta Databases • a representationsystem for chasetrees • Costmanagers • pluggablestrategies to prune the chasetree
Scalability Llunatic-FR-S5 Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S50 Doctors-MC sec.
Quality of Repairs Llunatic-FR-S1 Pipeline Hospital-MC Norm max. rep-rate(Rep, DBexp) 5k, 6%-10% 10k, 6%-10% 25k, 6%-10%