1 / 28

XML Data Quality Modeling

XML Data Quality Modeling. Monica Scannapieco Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”. Outline. Motivations The D 2 Q data model Querying the D 2 Q data model DaQuinCIS architecture: Mediator Conclusions. Data Quality: Multi-dimensional concept.

cira
Télécharger la présentation

XML Data Quality Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Data Quality Modeling Monica Scannapieco Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  2. Outline • Motivations • The D2Q data model • Querying the D2Q data model • DaQuinCIS architecture: Mediator • Conclusions Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  3. Data Quality: Multi-dimensional concept Completeness • Accuracy • Jhn vs. John • Currency • Residence Address: out-dated vs. up-to-dated • Consistency • ZIP Code and City consistent Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  4. Data Quality & CISs • Cooperative Information Systems (CIS’s) • data sharing to accomplish cooperative tasks • high data replication • Instance level heterogeneities • need to be reconciled CIS’s need data quality • High data replication Data quality needs CIS’s Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  5. Why DQ modeling? • Quality can be associated to data in order to: • Certify the ‘‘correcteness’’ (accuracy, consistency, currency) and completeness of data • benefits for cooperation • Support instance level reconciliation • last-update timestamp for currency guarantee drives the reconciliation phase Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  6. Why XML modeling? • The context of CIS’s obliges to face interoperability issues • Flexibility of semi-structured models • possibility of associating quality values to different data granularity levels Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  7. D2Q: Data and Data Quality Model • Graph-based data model, enhancing the semantics of the XML data model to represent quality data Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  8. Enterprise:EnterpriseClass Enterprise:EnterpriseClass * * 1 1 1 1 Owner:OwnerClass Owner:OwnerClass Name Name : : string string Code: Code: string string 1 1 1 1 1 1 FiscalCode FiscalCode : : string string Name Name : : string string Address Address : : string string Data Schema • Data classδ(nameδ,π1,…, πn) • Name: nameδ • Set of properties πi =<namei: Typei> where: • namei is the name of the property πi • Typei can be • (i) a basic type • (ii) a data class or • (iii) a a type set-of <X>, where <X> can be either a basic type or a data class • Data Schema: Node- and Edge-Labelled Direct Acyclic Graph of data classes Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  9. Quality Schema • Quality Classλδ associated to a data class δ Enterprise_Currency Enterprise_Currency : : Enterprise_Quality:EnterpriseQualityClass Enterprise_Quality:EnterpriseQualityClass t t _currency _currency * * Enterprise_Consistency Enterprise_Consistency : : t t _consistency _consistency Owner_Quality:OwnerQualityClass Owner_Quality:OwnerQualityClass Enterprise_Completeness Enterprise_Completeness : : t t _completeness _completeness 1 1 Enterprise_Accuracy Enterprise_Accuracy : : t t _accuracy _accuracy FiscalCode_Quality FiscalCode_Quality : : … … … … FiscalCodeQualityClass FiscalCodeQualityClass t_accuracy t Owner_Accuracy Owner_Accuracy : : accuracy FiscalCode_Accuracy FiscalCode_Accuracy : : FiscalCode_Accuracy FiscalCode_Accuracy : : FiscalCode_Accuracy FiscalCode_Accuracy : : FiscalCode_Currency FiscalCode_Currency : : t t t t _accuracy _accuracy _completeness _completeness t t _consistency _consistency t t _currency _currency Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  10. D2Q Schema Quality Association quality Enterprise * string Enterprise_Quality Owner string quality Code Name quality * Owner_Quality Code_ Quality -accuracy Name_ Quality … quality … -accuracy Enterprise_ accuracy -accuracy Code_ accuracy Name_ accuracy Quality Associations: Biunivocal functions among all nodes of a data schema and all non-leaf nodes of a quality schema Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  11. D2Q Schema Instances • Data Classes Instance-> Data Objects • Quality Classes Instance-> Quality Objects • Quality Association Values->Quality Links Owner1 FiscalCode[SCNMNCXXX] Address[Via Salaria 113 Roma] Name[MonicaScannapieco] Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  12. From D2Q to XML • D2Q schemas translated into XML Schemas • XML Schema Types Definition • Data and quality classes and their properties as XML elements • OID and QOID attributes for quality associations • Introduction of root elements Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  13. Querying the D2Q Model with XQuery • Quality Selectors: set of user-defined XQuery functions • Each quality selector allows to access the values of a specific dimension or the overall quality of a set of input nodes • accuracy(node*)->node* Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  14. Example of DQ Accessing • Query: for $i in input()//owner[Name eq “Monica Scannapieco”] return quality($i/Address,$i/FiscalCode) • Result: <root> <Address_Quality qOID=“qOID132”> <Address_Accuracy>high</Address_Accuracy> <Address_Currency>medium</Address_Currency> </Address_Quality> <FiscalCode_Quality qOID=“qOID131”> <FiscalCode_Accuracy>high </FiscalCode_Accuracy> </FiscalCode_Quality> </root> Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  15. Rating Service Communication Infrastructure Cooperative Gateway Cooperative Gateway OrgN back-end systems Cooperative Gateway Org2 Org1 internals of the organization DaQuinCIS Platform: A platform for exchanging and improving data quality in CIS’s Quality Factory Quality Notification Service • Data Quality Broker • Record Matcher Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  16. Off-line Improvement Record Matcher Notication Service On-line Improvement Very bad data Not very bad data Good data Quality Improvement Strategy The Broker selectsthe best quality data answering a query and sends it to the requester (data quality-driven query answering) and to other providers (On-Line Improvement) The Record Matcher periodically compares exported data in order to improve their quality Broker Cooperative data Cooperative data Cooperative data The notification service multicastsdata quality changes Quality Maintenance Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  17. XQuery XQuery MEDIATOR MEDIATOR XML XML XML WRAPPER WRAPPER WRAPPER WRAPPER WRAPPER WRAPPER WRAPPER WRAPPER WRAPPER XML XML XML XML XML XML XML XML XML XML XML XML DB DB DB DB DB DB DB DB DB DB DB DB ORG 2 ORG 2 ORG 3 ORG 1 ORG 1 ORG 1 ORG 2 ORG 3 ORG 3 Data Quality Broker: a Quality-based Data Integration System • Wrapper/Mediator Architecture • Global and Local views expressed as XML Schemas D2Q-compliant • Global as View (GAV) Mapping Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  18. Query Processing Steps (1) • Given a query Q on a D2Q global schema • Q is unfolded according to a static mapping that retrieves all copies of same data that are available in the CIS • The execution of local queries returns a set of results, on which a run-time matching is performed Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  19. Query Processing Steps (2) • The result to be returned is built as follows: • (i) if no quality requirement is specified, a best quality default semantics is adopted. This means that the result is constructed by selecting the best quality values • (ii) if quality requirements are specified, the result is constructed by checking the satisfiability of the requirements on the whole result. Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  20. Why such a semantics? • Best quality copies always available • Quality Improvement Feature • Results collected at query-time have an associated quality • The best quality results is proposed to all data sources that provided a lower quality result Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  21. How does it work? • Static mapping specified through path expressions • A path expression allows to locate a concept in a schema • XML schemas are D2Q compliant Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  22. Mediator query processing steps Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  23. Unfolding • Path Expression Extraction: a global query is analyzed to extract path expressions • Path Expression Pre-processing: to obtain‘completely specified’ path expressions • Translation: each path expression on the global view is translated into (a set of) path expressions over the structure of the local views • Framing: keeps trace of transformation steps • Queries over local sources are sent to the Transport Engine module Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  24. Refolding • Re-translation step: received results are re-translated according to the global schema specification • Materialization: results are concatenated into a single, temporary file • Global Query Execution: the global query is changed into a query using only local files, and can then be executed • Record Matching: records are matched and after a comparison on quality values externally made by the Comparator Module, the query results ordered by quality are sent for a Quality Filtering • Results best fitting with the user query requirements are sent back to the user. Moreover, quality feedbacks are sent to the Transport Engine that is in charge of propagating them in the system. Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  25. Implementation Modules Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  26. On-going Experiments • Currently comparing: • Periodical record matching in a ‘traditional’ setting • Quality Improvement strategy underlying DaQuinCIS • Three Italian PA databases are being used Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  27. Conclusions • An XML Data Quality model • How using it within a DI architecture for: • Quality accessing • Quality improving Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

  28. Rating Service Communication Infrastructure Quality Notification Quality Service (QNS) Factory (QF) Cooperative Gateway OrgN back - end systems Cooperative Cooperative Cooperative Cooperative internals of the Org2 Gateway Gateway Gateway Org1 Gateway organization Data Quality Broker (DQB) DaQuinCIS Platform • Data Quality in Cooperative Information Systems (DaQuinCIS): platform for exchanging and improving data quality in CIS’s Rating Service Communication Infrastructure Quality Notification Quality Service (QNS) Factory (QF) Cooperative Cooperative Gateway Gateway OrgN back - end systems Cooperative internals of the Org2 Org1 Gateway organization Data Quality Broker (DQB) Monica Scannapieco-IFIP2.6 Meeting Catania, Sicily

More Related