Knowledge and the Web – Data quality issues

Last update: 19 October 2016 Knowledge and the Web – Data quality issues Bettina Berendt KU Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2016-17-1stsemester/kaw

About your project proposals • General remarks (now) • Specific remarks (tomorrow in exercise session)

Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective!

Recall: Different types of “Open”

But why open? (from http://opendefinition.org/) • Do you see a statement in this definition that does not appear substantiated? • Can you give 3 reasons why it may be true? • Can you give 3 reasons why it may be false?

PS: “Modify” • … does not necessarily mean that everybody should be able to modify the original data • It does mean that you can take it and modify your copy. • Specifically, the opendefinition.org’s Content is licensed under a CC Attribution 4.0 International License.

More about the Wikipedia case • https://www.scientificamerican.com/article/wikipedia-editors-woo-scientists-to-improve-content-quality/

Our brainstorming (Task: check whether each is contained in Zaveri et al.‘s list!) • readability (format, ...) • missing data • different units • inaccurate data, false data (intention?) • duplicate data • format/type/value (ex. string instead of numbers) • context description missing • language • verifiability? • outdated, non-constant, lack of data-creation timestamp • how much info can you get out of the data (how connected is it)? • easy availailability?

From questions students asked last year: (1) Quality dimensions • Is there any gain or information that can be won by companies that are not providing services on the web given that services provided by e.g. dbPedia are not stable. To what extent should they be able to thrust the quality and accessibility of these services. Or do you think that the semantic web should be seen as just a source of information but not as a possible asset to other commercial institutions/industry/.... • How is usually evaluated the credibility of entries, in real life application, especially for user generated content? • If some data is modified, what will happen to all the links linked to the data? Will they be updated or not? What is the difficulty to update all related links? What will happen if it only adds new links but not delete or update old links?

From questions: (2) “Inconsistencies between datasets“ • If I want to reuse an already developed ontology is there any general methodology to choose one? How much impact does it has the incoming and outgoing links? Are there other parameters that should be considered? • In an application using Linked Data how do you handle disagreement and contradictory information about an entity? • In the text “Linked Data: Evolving the Web into a Global Data Space” by Heath and Bizer, it is stated that one of the properties of the “Web of Data” is that it is able to represent disagreement and contradictory information. How is contradictory information a good thing and how can the user know which information is trustworthy? What if people maliciously add wrong data? • How do we know which data is correct/incorrect? Who checks this? • Who or what determines if a link is valid. What can you do if there is a wrong link in the current LOD set ?

From your questions: (3) Issues affected by these questions • If we are going to make a user friendly search engine to navigate through linked data, what is the biggest problem that we need to solve? What will affect the search speed and the accuracy of search result? How to improve that? How to prevent user to get lost while searching.

What is data quality? http://www.dqglossary.com/data%20quality_.html (a collection of definitions) Data quality - The totality of features and characteristics of data that bears on their ability to satisfy a given purpose [...] Glossary of Quality Assurance Terms, Hanford.gov, 26 August 2009 11:49:56 Data quality - The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. Government of British Columbia, 26 August 2009 11:48:58

What is data quality in LOD? (*: newly introduced for LD) Zaveri et al., 2012

A wonderful example of good research!

Contextual

Intrinsic (1)

Intrinsic (2)

Accessibility

Representational (1)

Representational (2)

Dataset dynamicity

Trust (1)

Trust (2)

PROV – the W3C Provenance Specifications • W3C Working Group Note 30 April 2013 http://www.w3.org/TR/prov-overview/ • Tutorial at ESWC 2013 by Paul Groth, Jun Zhao, and Olaf Hartig: http://www.w3.org/2001/sw/wiki/ESWC2013ProvTutorial • Shown in class: introduction • Also interesting: PROV-O • (PPTs can be found on the tutorial page) • Book: http://www.provbook.org/ (accessible from within KU Leuven)

ProvStore (https://provenance.ecs.soton.ac.uk/store/)

A case of provenance

Formalizing provenance: a high-level view

More about provenance … • … by our invited speaker Tom De Nies on 23 Nov.

Inconsistencies between different data sources

What to do when you can retrieve/derive A and̚ A • The anarchistic solution: Ex falso quod libet • The careful solution: „don‘t know“ (retract both) • The pragmatic solution: choose one • (Note: The classification of solutions is NOT standard! ;-) )

Inconsistency Resolution Strategies Pass it on. Pass conflicting values to the user and let him/her decide. Take the information If value is missing in dataset 1, use value from dataset 2. Trust your friends Prefer information from certain sources. Cry with the wolves Choose most common value. Meet in the middle Take the average of all values. Keep up to date Use the newest value. Slide adapted from Bizer (2008) SeeAlso: Bleiholder and Naumann: Conflict Handling Strategies in an Integrated Information System. WWW2006.

„Prefer information from certain sources“ • How do I know what the source is? • How do I know which sources are better than others? • What do I do with this information about sources?

Q1. How do I know what the source is? • Provenance

Q2. How do I know which sources are better than others? • How do I know this • In real life? • In reading scientific papers? • On the Web? • Will look at two approaches: • „democratic“ – example voting • „meritocratic“ – example PageRank

PageRank, “the basis of Google” Notion of “Hub” and “Authority” Authority: Highly referenced pages Hub: Pages containing good reference lists Intuition: a high-quality site is one that has many high-quality sites linking to it Two algorithms developed at the same time: Kleinberg HITS: hubs and authorities Brin & Page PageRank: authorities “Rediscovery” of a work from bibliometrics: Pinsky & Narin (1976) Many re-uses of this idea in different domains (information retrieval, text summarization, social Web, …) Slide based on Karimzadehgan (2007) 1/6/2020 Introduction to Information Retrieval 46

Exploiting Inter-Document Links Authority Hub Slide based on Karimzadehgan (2007) Description (“anchor text”) Links indicate the utility of a doc What does a link tell us? 47

PageRank: The “random surfer” model Slide based on Karimzadehgan (2007) Probability q of randomly jumping to that page Page A Pages pointing to A 1/6/2020 Introduction to Information Retrieval 48

PageRank and relevance “Random surfer” selects a page, keeps clicking links until “bored”, then randomly selects another page. PageRank(A) is the probability that such a user visits A q is the probability of getting bored at a page PageRank matrix can be computed offline Google takes into account both the relevance of the page and PageRank (and many other things of course) Relevance is computed from the text and other features (of course, a proprietary and evolving scheme). Slide based on Karimzadehgan (2007) 1/6/2020 Introduction to Information Retrieval 49

Q3. What do I do with this information about sources? • Example Bonatti et al. 2011 • Compute the PageRank of sources (domains or documents) Recall the LOD cloud: • Rank of a triple = sum over the ranks of sources containing this triple • Rank of an inference = minimum over rank of tripels needed to make this inference • Identify minimum set of triples that cause the inconsistency • Remove the minimum-ranked triples to restore consistency

Knowledge and the Web – Data quality issues