Download
managing inconsistent data in data integration and data exchange n.
Skip this Video
Loading SlideShow in 5 Seconds..
Managing Inconsistent Data in Data Integration and Data Exchange PowerPoint Presentation
Download Presentation
Managing Inconsistent Data in Data Integration and Data Exchange

Managing Inconsistent Data in Data Integration and Data Exchange

272 Vues Download Presentation
Télécharger la présentation

Managing Inconsistent Data in Data Integration and Data Exchange

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Managing Inconsistent Datain Data Integration and Data Exchange Renée J. Miller University of Toronto Periklis Andritsos, Ariel Fuxman Tasos Kementsietsidis, Yannis Velegrakis

  2. Outline • Schema Mapping – reconciling differences in schemas • Clio: Creating mappings (VLDB00,VLDB02) • Using semantics of schemas and data • ToMAS: Managing schema mappings (VLDB03) • Evolving schemas and semantics • Using Mappings • Data Exchange (ICDT03) • Querying Inconsistent Data (IJCAI/IIWeb03) • Data Mapping – reconciling differences in data • Hyperion: managing data mappings (SIGMOD03) • Using networks of P2P data mappings R.J. Miller - U. Toronto

  3. Q Mapping “conforms to” “conforms to” data data Mapping Independent Data Sources Source Schema S’’ Source Schema S’ data Source Schema S Target Schema T data • Data Integration – answer target queries using data from source(s) • Target data is virtual R.J. Miller - U. Toronto

  4. Q Mapping “conforms to” data Mapping Independent Data Sources Source Schema S Target Schema T “conforms to” data • Data Exchange – answer target queries answered locally • Target data is materialized R.J. Miller - U. Toronto

  5. Overview • Goal: interoperability between independent data sources • Creating Mappings • Managing Mappings – as sources change • Using Mappings – to query and exchange data • Even when data is dirty or inconsistent • Challenges • Schemas can be arbitrarily different • Still, data must not lose its meaning • Use semantics embedded in schemas & data • Facilitate specification of any additional semantics • Performed manually: complex user queries, programs, etc. • Hard to debug; understand; verify correctness R.J. Miller - U. Toronto

  6. Q Mapping “conforms to” Schema Mapping • Wants data from S • Understands T • May not understand S • XML Schema • DTD • Relational Source schema S Target schema T “conforms to” data • Automate (to the extent possible) the creation of mappings • Mappings used for (virtual) data integration or (materialized) data exchange R.J. Miller - U. Toronto

  7. Illustration: Mapping Creation • Support Nested Structures • Element correspondences • Human friendly • Automatic discovery • Preserve data meaning • Discover data associations • Use constraints & schema • Create New Target Values • Produce Correct Grouping R.J. Miller - U. Toronto

  8. Creating Correspondences • Graphical User Interface • DBA interactively specifies • Automatic Discovery • Attribute (Element) Classifier • Extensible to • Other Schema Matchers • VLDB J. 01 Survey • Correspondence based on syntactic information • Within schema or data R.J. Miller - U. Toronto

  9. Interpreting Correspondences statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project What semantics do we associate to an arrow? Good enough for one arrow ! cidexpenseDB.companiescidstatDB.cityStat.orgs Still works for these two arrows! cid,nameexpenseDB.companiescid,namestatDB.cityStat.orgs How about now ? gidexpenseDB.grantsgidstatDB.cityStat.orgs.fundings R.J. Miller - U. Toronto

  10. Associations btw Elements statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project • We must recognize that grants are associated to companies • Association (in the source): grants⋈ companies • Association (in the target):statDB⋈ orgs ⋈ fundings ⋈ financials R.J. Miller - U. Toronto

  11. Schema Mapping • Enumerate ALLlogical associations consistent with schema semantics • Constraints • Nesting (schema structure) • Data • Interpret correspondences (arrows) over pair source & target association R.J. Miller - U. Toronto

  12. Mappings as Views q • Views: st have a special form: • GAV:Qs(S)  Ti where Ti is a relation in T, Qs is a query on S • Company(C,N,Ct),Grant(C,G,A,P)  Projects(P,Ct) • Plain old view: create view projects (p, ct) as (select p,ct from …) • LAV:Si  Qt(T) where Si is a relation in S , Qt is a query on T • Company(C,N,City)  Org(C,N) City(C,Ct) Target schema T (Global) Source schema S (Local) t st virtual ! I J R.J. Miller - U. Toronto

  13. Clio Mappings q • Clio Schema Mapping: • Qs(S)  Qt(T) and constraints on S and T (s , t ) • More general than views • Generality often required when S, T are fixed • No design control Target schema T (Global) Source schema S (Local) t st Virtual or Materialized I J R.J. Miller - U. Toronto

  14. Using Mappings and Views q • Data Integration • The target is not materialized; it is just a querying interface • Queries are posed on the target schema; data is in the source. • Problem: how to answer the query in the “best” possible way • AKA: Answering queries using views • GAV/LAV (mostly) assumes conjunctive queries • (mostly) assumes no target constraints – target is a view • Uses relational (not nested relational) model Target schema T Source schema S t st virtual ! I J R.J. Miller - U. Toronto

  15. Using Mappings q • Data Exchange • The target is materialized • Queries are posed on the target schema; answered using target data • Problem: what is “best” instance to exchange • Mapping:Qs(S)  Qt(T) and constraints on S and T (s , t ) • Given instance of S there may be many instances of T • Which is best instance to exchange? • Grant(C,G,A,P,S) Funding(C,G,Aid),Financials(Aid,D,A) Target schema T (Global) Source schema S (Local) t st Materialize! I J R.J. Miller - U. Toronto

  16. Semantics of Query Answering • Answering queries using views • Query is answered using source data • Answer is set of tuples in query result on ALL possible target instances: certain answers • Data Exchange • Query is answered using ONE materialized target • Can single target give same information as source(s)? • Is query result the same in both settings? R.J. Miller - U. Toronto

  17. Mappings at Data Level Employee Salary Employee Salary Employee Salary Financial Target Database HumanResources • Mapping • Financial(e,s)  Global(e,s) • HumanRes(e,s)  Global(e,s) R.J. Miller - U. Toronto

  18. Data Inconsistency Employee Salary Employee Salary Employee Salary Financial Target Database HumanResources Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) R.J. Miller - U. Toronto

  19. Reconciling Inconsistencies (I) Employee Salary Employee Salary Employee Salary 1 – Delete all tuples for John Financial Target Database HumanResources Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) R.J. Miller - U. Toronto

  20. Reconciling Inconsistencies (II) Employee Salary Employee Salary Employee Salary 2 – Delete the salaries of John Financial Target Database HumanResources Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) R.J. Miller - U. Toronto

  21. Reconciling Inconsistencies (III) Employee Salary Employee Salary Employee Salary 3 – Delete only one tuple for John Financial Target Database HumanResources Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) R.J. Miller - U. Toronto

  22. Repairing an integrated database Employee Salary Repair 1 An integrated inconsistent database R.J. Miller - U. Toronto

  23. Repairing an integrated database Repair 1 An integrated inconsistent database Repair 2 R.J. Miller - U. Toronto

  24. Consistent Query Answers Repair 1 • Intuition: • Input: query Q • Get a query result Q( ) for each • repair . • A tuple is in the consistent answer • if it appears in all query results. Repair 2 R.J. Miller - U. Toronto

  25. Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 R.J. Miller - U. Toronto

  26. Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 ConsistentS(Q,I)={(Mary,3000)} R.J. Miller - U. Toronto

  27. Consistent Query Answers Repair 1 Q(e)= s: Global(e,s) “Get all employees” Repair 2 R.J. Miller - U. Toronto

  28. Consistent Query Answers Result 1 Q(e)=  s: Global(e,s) “Get all employees” Result 2 ConsistentS(Q,I)={(John),(Mary)} R.J. Miller - U. Toronto

  29. Consistent Query Answers Repair 1 Q= e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 R.J. Miller - U. Toronto

  30. Consistent Query Answers Repair 1 FALSE Q= e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 ConsistentS(Q,I)=FALSE TRUE R.J. Miller - U. Toronto

  31. Our work (IJCAI/IIWeb03) Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] R.J. Miller - U. Toronto

  32. Our work Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] Goal: Find a class of tractable queries (i.e., the consistent answers can be retrieved in polynomial time without explicitly building all repairs). R.J. Miller - U. Toronto

  33. Example: A tractable query Are there two employees with the same salary? Inconsistent instance Graph of the inconsistent instance 1000 John Mary 2000 Anna 3000 Employee Salary R.J. Miller - U. Toronto

  34. Example: A tractable query 1000 John Mary 2000 Anna 3000 R.J. Miller - U. Toronto

  35. Example: A tractable query 1000 John Mary 2000 Anna 3000 R.J. Miller - U. Toronto

  36. Inexpressibility result • Query rewriting • Input: query Q • Output: query Q’ s.t. Q’(I)=consistentS(Q,I) for every I. • Appealing approach • tractable • reuses existing DBMSs • BUT: so far known to be applicable only to a restricted classes of queries ([ABC, PODS 1999]) R.J. Miller - U. Toronto

  37. Inexpressibility result • Can we use query rewriting? R.J. Miller - U. Toronto

  38. Inexpressibility result • Can we use query rewriting? NO R.J. Miller - U. Toronto

  39. Practical Considerations (I) Conflicts are usually confined to a small portion of the database Robert 4000 Fred 5000 Paul 6000 7000 Peter 1000 John 2000 Mary Anna 3000 R.J. Miller - U. Toronto

  40. Practical Considerations (I) Conflicts are usually confined to a small portion of the database 1000 John 2000 Mary Anna 3000 R.J. Miller - U. Toronto

  41. Practical Considerations (II) Reasonable assumption in integration and exchange: constant number of conflicts per key. Financial Target Database Employee ! Salary Human Resources Employee ! Salary R.J. Miller - U. Toronto

  42. Bibliography J. Chomicki and J. Marcinkowski. On the Computational Complexity of Consistent Query Answers. coRR cs.DB/0204010, 2002. M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases, Proc. ACM PODS, 1999. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases, Proc. ACM PODS, 2003. R.J. Miller - U. Toronto

  43. Data Mapping (SIGMOD03) • What if sources unwilling to share schemas? • Common in more autonomous P2P settings • How can such sources share data? • Shared schema mappings not appropriate • Need to manage and share • Data mappings • Hyperion – P2P data sharing R.J. Miller - U. Toronto

  44. P2P File-Sharing Systems Currently, P2P querying relies on the use of value searches. e.g., retrieve songs for music band “NewOrder” However, P2P query mechanisms do not capture the intricacies of values, i.e., that values are often associated to each other. e.g. the value “New Order” is an alias for the value “Joy Division” We propose the use of mapping tables to record such associations e.g. a mapping table that records artist aliases R.J. Miller - U. Toronto

  45. A P2P Genome Database System Peers store information about genes, proteins, etc. SwissProt(pid, name) “alias” Gene(gid, name) Characteristics of mapping tables: • The recorded associations can be 1:1, 1:n or m:n • They are, in general, non-binary • They associate values within or across domains R.J. Miller - U. Toronto

  46. Contributions State of the art: Mapping tables represent expert knowledge. Currently, they are created manually by domain specialists. Our contributions: We automate the creation and maintenance of these tables. More specifically: • We investigate alternative semantics for mapping tables. • We motivate why reasoning capabilities are needed to manage them. • We propose efficient algorithms for both finding inconsistencies in mapping tables and inferring new mapping tables R.J. Miller - U. Toronto

  47. Conclusions • Managing Data Inconsistency • Tolerate inconsistency • Identify inconsistency at query time • Recognizes that cleaning not always possible or desirable • Reconciling inconsistency • Data mappings record reconciliation • Manage use and combination of data mappings www.cs.toronto.edu/db www.cs.toronto.edu/db/tomas www.cs.toronto.edu/db/hyperion www.cs.toronto.edu/~miller R.J. Miller - U. Toronto