1 / 17

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. OC Working Group – 21.01.2014 Serge Tymaniuk. Overview. Introduction Methodology Results Questions. Introduction.

hailey
Télécharger la présentation

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis OC Working Group – 21.01.2014 Serge Tymaniuk

  2. Overview • Introduction • Methodology • Results • Questions

  3. Introduction • Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1) • (1) Data and Web Science Group, University of Mannheim, Germany • (2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands • Features: • Analysis of RDFa, Microdata, and Microformats adoption on the Web • Based on large public Web crawl of 3 billion HTML pages • Aims at revealing the main topical areas of the published data and different vocabularies within each topical area • Examine structural richness (which properties are used to described popular types of entities)

  4. Web Crawl • Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3. • 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains. • Crawling conducted between Jan. - June 2012 • Compressed size of the corpus is 48TB • Relies on the PageRank algorithm

  5. Data Extraction Process • Parsing framework is executed on Amazon EC2 • Relies on Anything To Triples (http://any23.apache.org/) parsing library from Apache • Rapidminerdata mining framework is used for vocabulary term co-occurrence analyses

  6. Results: Overall picture • Structured data was discovered within 369Mout of 3Bpages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%)

  7. Results: Deployment by FORMAT * PLDs – Public Level Domains (i.e. websites) * URLs – HTML pages

  8. Results: Deployment by POPULARITY * According to Alexa Internet Inc. (AL) list of the most frequently visited websites

  9. Results: Deployment by domains

  10. Results: Deployment on the same Website • 93,5% of all website which has structured data use only a single format

  11. Results: Deployment of RDFa Most frequently used properties co-occurring with all the 4 most frequently used OGP classes: Most frequently used RDFa classes: • Alexa top 100 websites that use RDFa: • IMDB • Microsoft News Portal • BBC

  12. Results: Deployment of Microdata Most frequently used Microdata classes: • Alexa top 100 websites that use Microdata: • eBay • Microsoft Corp. • Apple Inc.

  13. Results: Deployment of Microformats • Alexa top 100 websites that use Microformats: • Wikipedia • Adobe • Taobao marketplace Most frequently used Microformats classes:

  14. Results: Topical Domains • Dominant Domains of the published data: • Persons and Organizations (by all 3 formats) • Blog- and CMS-related metadata (by RDFa and Microdata) • Navigational metadata (by RDFa and Microdata) • Product data (by all 3 formats) • Event data (by Microformats)

  15. Results: Structural Richness • Only a small set of generic properties is used to describe entities: • Instances of OGP class “Product” are described by title, url, site_name, description in most classes • Instances of Schema class “Product” is described largely only by name and description.  Additional extraction techniques has to be employed for deeper understanding

  16. Sources Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: http://hannes.muehleisen.org/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf

  17. Thank you for your attention! Questions?

More Related