1 / 37

Quality and Repair

Quality and Repair. Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011. Work Plan View WP2. 24. 12. 0. 6. 18. 30. 36. 42. 48. D2.1 Conceptual model and best practices for high-quality data publishing.

tayte
Télécharger la présentation

Quality and Repair

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011

  2. Work Plan View WP2 24 12 0 6 18 30 36 42 48 D2.1 Conceptualmodelandbestpracticesfor high-quality datapublishing D2.2 Methodsforqualityrepair D2.6 Methods for assessing the quality of sensor data Task 2.1 Data quality assessment and repair D2.4 Update of D2.1 FUB D2.3 Modelling and processing contextual aspects of data D2.5 Proof-of-concept evaluation for modelling space and time Task 2.2 Temporal, spatial and social aspects of data KIT D2.7 Recommendations for contextual data publishing Task 2.3 Recommendations for enhancing best practices for data publishing KIT

  3. Upcoming deliverables • Quality Assessment • D2.1 - Conceptual model and best practices for high-quality metadata publishing • Quality Enhancement • D2.2 - Methods for quality repair

  4. Outline • Overview of Quality • Data Quality Framework • Quality Assessment • Quality Enhancement (Repair)

  5. Quality “Fitness for use.” Joseph Juran. The Quality Control Handbook. McGraw-Hill, New York, 3rd edition, 1974.

  6. Data Quality • Multifaceted • accurate = high quality? • availability? • timeliness? • Subjective • weekly updates are ok. • Task-dependent • task: weather forecast • data is not good if it is not available for online query • vacation planning or aviation? • for me, for vacation planning

  7. Data Quality Dimensions Presentation order

  8. Data Quality Framework Quality Enhancement Quality Assessment

  9. Dereferenceability ACCESSIBILITY • Indicator: Dereferenceable URIs • “Resources identified by URIs that respond with RDF to HTTP requests?” • Metrics: • for datasets (d) and for resources (r) • deref(d) = count(r | deref(r)) • ratioderef(d) = deref(d) / no-deref(r) • Recommendation: • Your URIs should be dereferenceable. • Prefer reusing URIs that are dereferenceable.

  10. Access methods ACCESSIBILITY • Indicator: Access methods • “Data is accessible in varied and recommended ways.” • Metrics: • sample(d): {0,1} “example resource available for d” • endpoint(d): {0,1} “SPARQL endpoint available for d” • dump(d): {0,1} “RDF dumps available for d” • Recommendation: • Provide as many access methods as possible • A sample resource provides a quick view into the type of data you serve. • SPARQL endpoints for clients to obtain part of the data • Dumps are cheaper than alternatives when bulk access is needed

  11. Availability ACCESSIBILITY • Indicator: Availability • “Average availability in time interval” • Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24 • Alternatively, httphead() instead of deref() • Recommendation: • the higher the better

  12. Accessiblity Dimensions ACCESSIBILITY • Dereferenceability • Availability • Access methods • Response time • Robustness • Reachability • http GET / HEAD • hourly derefs • URI, Bulk, SPARQL • timed deref • requests per minute • LOD cloud inlinks Examples:

  13. Representational: Interpretability REPRESENTATIONAL • Indicator: Human/Machine interpretability • “URI is dereferenceable to human and machine readable formats” • Metrics: • format(deref(r,f)) in {Fh U Fm} : {0,1} • Fh = HTML, XHTML+RDFa, ...: {0,1} • Fm = NT, RDF/XML, ...: {0,1} • Recommendation: • Resources should dereference at least to human-readable HTML and one widely adopted RDF serialization.

  14. Vocabulary understandability REPRESENTATIONAL • Schema understandability • “Schema terms are familiar to existing agents.” • Metrics: • vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D) • Alt: Page Rank (prob. that random surfer has found v) • Recommendation: • Reuse widely deployed vocabularies.

  15. Representational Dimensions REPRESENTATIONAL • Human/Machine Interpretability • Vocabulary Understandability • Representational Conciseness • HTML, RDF • Vocabulary usage stats • Triples / Byte

  16. Contextual Dimensions CONTEXTUAL DIMENSIONS • Completeness • Full set of objects and attributes wrt to a task • Conciseness • Amount of duplicate entries, redundant attributes • Coherence • How well instance data conforms to schema

  17. Contextual Dimensions CONTEXTUAL DIMENSIONS • Verifiability • How easy it is to check the data? • Can use provenance information. • Validity • Encodes context- or application-specific requirements

  18. Intrinsic Dimensions INTRINSIC DIMENSIONS • Accuracy • usually estimated; may be available for sensors • Timeliness • can use last update • Consistency • two or more values do not conflict with each other • Objectivity • Can be traced via provenance

  19. Example: AEMET • Metadata entry:http://thedatahub.org/dataset/aemet • Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl • Access methods: Example URI, SPARQL, Bulk • Availability: • Example URI: available • SPARQL Endpoint: 100% • Format Interpretability: • TTL=OK • RDF/XML=OK • Verifiability: • Published by third party http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet

  20. Data Quality Framework Quality Enhancement Quality Assessment

  21. Validity as a Quality Indicator • Validity is an important quality indicator • Encodes context- or application-specific requirements • Applications may be useless over invalid data • Binary concept (valid/invalid) • Two steps to guarantee validity (repair process): • Identifying invalid ontologies (diagnosis) • Detecting invalidities in an automated manner • Subtask of Quality Assessment • Remove invalidities (repair) • Repairing invalidities in an automated manner • Subtask of Quality Enhancement

  22. Diagnosis • Expressing validity using validity rules over an adequate relational schema • Examples: • Properties must have a unique domain • p Prop(p)  a Dom(p,a) • p,a,b Dom(p,a)  Dom(p,b)  (a=b) • Correct classification in property instances • x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) • x,y,p,a P_Inst(x,y,p)  Rng(p,a)  C_Inst(y,a) • Diagnosis reduced to relational queries

  23. geo:location SpatialThing Sensor Observation Schema Data Item1 ST1 Example Ontology O0 Class(Sensor), Class(SpatialThing), Class(Observation) Prop(geo:location) Dom(geo:location,Sensor) Rng(geo:location,SpatialThing) Inst(Item1), Inst(ST1) P_Inst(Item1,ST1,geo:location) C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing) • Correct classification in property instances • x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) Item1 geo:location ST1 Sensor is the domain of geo:location Item1 is not a Sensor P_Inst(Item1,ST1,geo:location)O0 Dom(geo:location,Sensor)O0 C_Inst(Item1,Sensor)O0 • Remove P_Inst(Item1,ST1,geo:location) • Remove Dom(geo:location,Sensor) • Add C_Inst(Item1,Sensor)

  24. Preferences for Repair • Which repairing option is best? • Ontology engineer determines that via preferences • Specified by ontology engineer beforehand • High-level “specifications” for the ideal repair • Serve as “instructions” to determine the preferred solution

  25. O1 O0 O2 O3 Preferences (On Ontologies) Score: 3 Score: 4 Score: 6

  26. O1 O0 O2 O3 Preferences (On Deltas) -P_Inst (Item1,ST1, geo:location) Score: 2 -Dom (geo:location,Sensor) Score: 4 +C_Inst (Item1,Sensor) Score: 5

  27. Preferences • Preferences on ontologies are result-oriented • Consider the quality of the repair result • Ignore the impact of repair • Popular options: prefer newest information, prefer trustable information • Preferences on deltas are more impact-oriented • Consider the impact of repair • Ignore the quality of the repair result • Popular options: minimize schema changes, minimize addition/deletion of information, minimize delta size • Two sides of the same coin (equivalent options) • Quality metrics can be used for stating preferences • Metadata on the data may be needed • Can be qualitative or quantitative

  28. Generalizing the Approach • For one violated constraint • Diagnose invalidity • Determine minimal ways to resolve it • Determine and return preferred resolution • For many violated constraints • Problem becomes more complicated • More than one resolution steps are required • Issues: • Resolution order • When and how to filter non-preferred solutions? • Constraint (and resolution) interdependencies

  29. Constraint Interdependencies • A given resolution may: • Cause other violations (bad) • Resolve other violations (good) • Cannot pre-determine the best resolution • Difficult to predict the ramifications of each one • Exhaustive search required • Recursive, tree-based search (resolution tree) • Two ways to create the resolution tree • Globally-preferred (GP), locally-preferred (LP) • When and how to filter non-preferred solutions?

  30. Resolution Tree Creation (GP) Find all minimal resolutions for all the violated constraints, then find the preferred ones Globally-preferred (GP) • Find all minimal resolutions for one violation • Explore them all • Repeat recursively until consistent • Return the preferred leaves Preferred repairs (returned)

  31. Resolution Tree Creation (LP) Find the minimal and preferred resolutions for one violated constraint, then repeat for the next Locally-preferred (LP) • Find all minimal resolutions for one violation • Explore the preferred one(s) • Repeat recursively until consistent • Return all remaining leaves Preferred repair (returned)

  32. Characteristicsof GP Exhaustive Less efficient: large resolution trees Always returns most preferred repairs Insensitive to constraint syntax Does not depend on resolution order Characteristicsof LP Greedy More efficient: small resolution trees Does not always return most preferred repairs Sensitive to constraint syntax Depends on resolution order Comparison (GP versus LP)

  33. Algorithm and Complexity • Detailed complexity analysis for GP/LP and various different types of constraints and preferences • Inherently difficult problem • Exponential complexity (in general) • Main exception: LP is polynomial (in special cases) • Theoretical complexity is misleading as to the actual performance of the algorithms

  34. Performance in Practice • Performance in practice • Linear with respect to ontology size • Linear with respect to tree size • Types of violated constraints (tree width) • Number of violations (tree height) – causes the exponential blowup • Constraint interdependencies (tree height) • Preference (for LP): affects pruning (tree width) • Further performance improvement • Use optimizations • Use LP with restrictive preference

  35. Evaluation Parameters • Evaluation • Effect of ontology size (for GP/LP) • Effect of tree size (for GP/LP) • Effect of violations (for GP/LP) • Effect of preference (relevant for LP only) • Quality of LP repairs • Preliminary results support our claims: • Linear with respect to ontology size • Linear with respect to tree size

  36. Publications • Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011 • Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012

  37. Outlook • Continue refining model based on experience with data sets catalog • Derive “best practices checks” from metrics • Results of quality assessment to be added to next release of the catalog • Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework • Finalize experiments for Data Repair

More Related