Metrics-Driven Approach for LOD Quality Assessment

Metrics-Driven Approach for LOD Quality Assessment 2014-May-07

Outline • What is the problem? What have others done? What is our solution? Does it work?

What is the problem? • Linked Open Data (LOD): • Realizing Semantic Web by interlinking existing but dispersed data • Main components of LOD: • URIs to identify things • RDF to describe data • HTTP to access data

What is the problem? Datasets: 295 Triples:over 30,000,000,000 (30 B) Links:over 500,000,000 (500 M)

What is the problem? Inclusion Criteria for publishing and interlinking datasets into LOD cloud • resolvable http/https URIs • Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples) • Contains at least 1000 triples • Connected via at least 50 RDF links to the existing datasets of LOD • Accessible via RDF crawling, RDF dump, or SPARQL endpoint Is dataset ready to publish?

What is the problem? Idea of the LOD: Publishing first, improving later Results in: quality problems in the published datasets Missing link: Data Quality evaluation before release

What have others done? Data quality in the Context of LOD Validators Quality Assessment of Published data • General Validators • Parsing and Syntax • Accessibility / Dereferencability • Classifying quality problems of LOD • Using metadata for quality assessment • filtering poor quality data (WIQA) • Semantic Annotation using ontologies

What have others done? Limitations of related works: • Syntax validation, not quality evaluation • Not scalable • Not full automated • Evaluation after publishing

What is our solution? Proposing a set of metrics for Inherent quality assessment of datasets before interlinking to LOD cloud

What is our solution?

1. Selecting Inherent Quality Dimensions

2. Proposing Metrics Example: Goal: Assessment of the consistency of a dataset in the context of LOD Question: What is the degree of conflict in the context of data value? Metric: The number of functional properties with inconsistent values

3. Developing LODQM • LODQM: Linked Open Data Quality Model • 6 Quality dimensions • 32 Metrics

4. Theoretical Validation

5. Empirical Evaluation 5.1 5.2 5.3 5.4 5.5 5.6 5.7

5. Empirical Evaluation √

5. Empirical Evaluation √ √

5. Empirical Evaluation √ √ √ • Result: • Three pairs of metrics are correlated: • {IFP, Im_DT} • {Im_DT, Sml_Cls} • {Inc_Prp_Vlu, IF} • The others are independent

5. Empirical Evaluation √ √ √ √

5. Empirical Evaluation √ √ √ √ √ √

5. Empirical Evaluation √ √ √ √ √ √ √ • Result: • Only one pair of quality dimensions is correlated: • {Interlinking, Syntactic accuracy} • The others are independent

6. Quality Prediction Result: 20 out of 32 metrics are selected • Using Neural Network Method: • MultiLayerPerceptron

6. Quality Prediction

Conclusion on Metrics

Appreciative of your Attention and Comments

Metrics-Driven Approach for LOD Quality Assessment