1 / 21

Using Provenance for Quality Assessment and Repair in Linked Open Data

Using Provenance for Quality Assessment and Repair in Linked Open Data. Giorgos Flouris, Yannis Roussakis, Mar i a Poveda-Villal o n, Pablo N. Mendes, Irini Fundulaki Publication at EvoDyn-12. Setting and General Idea. Linked Open Data cloud Uncontrolled Vast Unstructured Dynamic

fauve
Télécharger la présentation

Using Provenance for Quality Assessment and Repair in Linked Open Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Provenance for Quality Assessment and Repair in Linked Open Data Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki Publication at EvoDyn-12

  2. Setting and General Idea • Linked Open Data cloud • Uncontrolled • Vast • Unstructured • Dynamic • Datasets interrelated, fused etc • Quality problems emerge • Assessment (measure quality) • Repair (improve quality)

  3. ES FR EN GE PT Motivating Example • User seeks information on Brazilian cities • Fuses Wikipedia dumps from different languages • Guarantees maximal coverage, but may lead to conflicts • E.g., city with two different population counts

  4. Main Tasks • Assess the quality of the resulting dataset • Framework for associating data with its quality • Repair the resulting dataset • By removing one of the conflicting values (i.e., one of the conflicting population counts) • How to determine which value to keep? • Solution: use heuristics • Here, we evaluate the use of provenance-related heuristics • Prefer most recent information • Prefer most trustworthy information

  5. Contributions • Emphasis on provenance • Assessment metrics (done) • Heuristics for repair (done, but does not support metadata information) • Contributions: • Extend repair algorithm to support heuristics on metadata • Define 5 different metrics based on provenance • Used for both assessment and repair • Evaluate them in a real setting

  6. Quality Assessment • Quality = “fitness for use” • Multi-dimensional, multi-faceted, context-dependent • Methodology for quality assessment • Dimensions • Aspects of quality • Accuracy, completeness, timeliness, … • Indicators • Metadata values for measuring dimensions • Last modification date (related to timeliness) • Scoring Functions • Functions to quantify quality indicators • Days since last modification date • Metrics • Measures of dimensions (result of scoring function) • Can be combined • We use this framework to define our metrics

  7. Quality Repair (Setting) • Focus on validity (quality dimension) • Encodes context- or application-specific requirements • Applications may be useless over invalid data • Binary concept (valid/invalid) • Generic

  8. Quality Repair (Rules) • Rules determine validity • Expressive • Disjunctive Embedded Dependencies (DEDs) • Cause interdependencies • Resolution causes/resolves other violations • Difficult to foresee ramifications of repairing choices • User cannot make the selection alone

  9. Quality Repair (Preferences) • Selection is done automatically, according to a set of user-defined specifications • Which repairing option is best? • Ontology engineer determines that via preferences • Specified by ontology engineer beforehand • High-level “specifications” for the ideal repair • Serve as “instructions” to determine the preferred solution for repair • Highly expressive

  10. Quality Repair (Extensions) • Existing work on repair is limited • Provenance cannot be considered for preferences • Assessment metrics based on provenance cannot be exploited • Extensions are needed (and provided) • Metadata (including provenance) can be used in preferences • Preferences can apply on both repairs and repairing options • Formal details omitted (see paper)

  11. Experiments (Setting) • Setting taken from the motivating example • Fused 5 Wikipedias: EN, PT, SP, GE, FR • Distilled information about Brazilian cities • Properties considered: • populationTotal • areaTotal • foundingDate • Validity rules: properties must be functional • Repaired invalidities (using our metrics) • Checked quality of result • Dimensions: consistency, validity, conciseness, completeness and accuracy

  12. Metrics for Experiments (1/2) • PREFER_PT: select conflicting information based on its source (PT>EN>SP>GE>FR) • PREFER_RECENT: select conflicting information based on its recency (most recent is preferred) • PLAUSIBLE_PT: ignore “irrational” data (population<500, area<300km2, founding date<1500AD) otherwise use PREFER_PT

  13. Metrics for Experiments (2/2) • WEIGHTED_RECENT: select based on recency, but in cases where the records are almost equally recent, use source reputation (if less than 3 months apart, use PREFER_PT, else use PREFER_RECENT) • CONDITIONAL_PT: define source trustworthiness depending on data values (prefer PT for small cities with population<500.000, prefer EN for the rest)

  14. Consistency, Validity • Consistency • Lack of conflicting triples • Guaranteed to be perfect (by the repairing algorithm), regardless of preference • Validity • Lack of rule violations • Coincides with consistency for this example • Guaranteed to be perfect (by the repairing algorithm), regardless of preference

  15. Conciseness, Completeness • Conciseness • No duplicates in the final result • Guaranteed to be perfect (by the fuse process), regardless of preference • Completeness • Coverage of information • Improved by fusion • Unaffected by our algorithm • Input completeness = output completeness, regardless of preference • Measured to be at 77,02%

  16. Accuracy • Most important metric for this experiment • Accuracy • Closeness to the “actual state of affairs” • Affected by the repairing choices • Compared repair with the Gold Standard • Taken from an official and independent data source (IBGE)

  17. Accuracy Evaluation … fr.dbpedia en.dbpedia pt.dbpedia Instituto Brasileiro de Geografia e Estatística(IBGE) Fuse/Repair dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate Gold Standard integrated data Compare Accuracy

  18. Accuracy Examples • City of Aracati • Population: 69159/69616 (conflicting) • Record in Gold Standard: 69159 • Good choice: 69159 • Bad choice: 69616 • City of Oiapoque • Population: 20226/20426 (conflicting) • Record in Gold Standard: 20509 • Optimal approximation choice: 20426 • Sub-optimal approximation choice: 20226

  19. Accuracy Results

  20. Accuracy of Input and Output

  21. Conclusion • Quality assessment and repair of LOD • Evaluated a set of sophisticated, provenance-inspired metrics for: • Assessing quality • Repairing conflicts • Used in a specific experimental setting • Results are necessarily application-specific • THANK YOU!

More Related