240 likes | 402 Vues
Data on the (Semantic) Web. Agenda (75 min). Data on the Web Extracting data Publishing data Linked Data Metadata in HTML SPARQL endpoints Crawling and extraction Indexing RDF data Database-style indexing IR-style indexing . IR view of the Web. Web accessible resources
 
                
                E N D
Agenda (75 min) • Data on the Web • Extracting data • Publishing data • Linked Data • Metadata in HTML • SPARQL endpoints • Crawling and extraction • Indexing RDF data • Database-style indexing • IR-style indexing
IR view of the Web • Web accessible resources • Documents (typically HTML) • Multimedia • Search engines index NL text • Most of the structure in HTML is discarded • Multimedia is indexed by surrounding text • Additional information on web graph, usage • See Manning, Raghavan, Müntze. Introduction to Information Retrieval. Cambridge Press, 2008.
Data on the Web • Most web pages on the Web are generated from structured data • Data is stored in relational databases (typically) • Queried through web forms • Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions • Extraction using Information Extraction (IE) techniques (implicit metadata) • Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)
Information Extraction methods • Named Entity Recognition (NER) and disambiguation • OpenCalais, Zemanta • Extraction of triples • TextRunner, NELL • Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007. • Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007. • Filling web forms automatically (form-filling) • Madhavan et al. Google's Deep-Web Crawl. VLDB 2008 • Extraction from HTML tables • Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008 • Wrapper induction • Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007
Information Extraction • A tale of many trade-offs • Less or no training data, lower quality • More complex the model to learn, more training data needed • Deeper the analysis, slower the processing • The more narrowly trained, the more likely to break • Populating a Knowledge Base is easier than ad-hoc extraction • However, a complete and correct semantic representation of the content may not be need for all tasks
Publishing data on the Web • Pre-Semantic Web technologies have been inadequate • Existing formats are not appropriate for serendipitous reuse • HTML: structure is lost due to a mix of presentation and content • XML: captures structure, but not semantics • Lack of protocols to talk to databases over the Web • Motivation has been lacking • Publishers are interested to the extent that they benefit from sharing data, e.g. because it drives traffic back to their site
What the Semantic Web provides • Data format: RDF • Designed for object-relationship data • Identification of objects by URIs • Multiple serializations: RDF/XML, Turtle, N3, N-Triples, Trix etc. • Schema language: OWL • Description Logic based • Extensible using rule languages such as RIF • Query language and protocol: SPARQL • The principles of Linked Data
Methods for publishing RDF data • Multiple ways of publishing RDF data • SPARQL endpoints • Linked Data • Metadata in HTML documents • Data feeds • GRDDL • Automated tools • Each require different treatment in crawling and extraction
SPARQL endpoints • SPARQL is a standard query language and protocol for accessing RDF stores via HTTP • Also possible to expose a traditional RDBMs via a wrapper • Advantages: • Most flexible and best performing access from a consumer perspective • Disadvantages: • Higher maintenance • Discovery is problematic • Tools: • Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.) • RDB-to-RDF mappers such as D2RQ and Triplify • SPARQL query builders
Linked Data • A web of interlinked RDF documents • Each document describes the characteristics of a single object, and links to related objects • Most important: links to the same object in different data sets (sameAs) • Guidelines for proper configuration of web servers to serve such documents • Rapidly growing community • Focus on public datasets (government, scientific) • see linkeddata.org
Linked Data • Advantages: • No change to the publishing of the HTML documents • Data can be published by third party (e.g. Dbpedia) • Disadvantages: • Web servers need to be configured to properly handle URIs that identify concepts instead of documents • Search engines need to be extended to crawl linked data • Data is not always linked to documents • Tools • Linked Data browsers (Tabulator, Marbles etc.) • RDB-to-RDF mappers (D2RQ, Triplify)
Metadata in HTML • Microformats, RDFa, Microdata • Advantages: • Data and document are always in sync • Browser plug-in friendly • Search engine friendly • Copy-paste friendly • Tools: • XML editors (e.g. Oxygen) • RDFa Distiller • RDFa bookmarklet, Ubiquity RDFa plugin • Optimus microformat parser • Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…
Microformats (μf) • Agreements on the way to encode certain kinds of data in HTML • Reuse of semantic-bearing HTML elements • Based on existing standards • Minimality: designed to solve particular problems • Microformats exist for a limited set of objects • hCard, hResume, hProduct, hRecipe • Varying degrees of support and stability • hCard and rel-tag are widely supported • Community centered around microformats.org • Specifications and discussions are hosted there
Example: the hCard microformat <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"> <a class="fn url" rel="friend colleague met" href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post(<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from the<span class="vcard"> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a> </span>.
Microformats: limitations • No shared syntax • Each microformat has a separate syntax tailored to the vocabulary • No formal schemas • Limited reuse, extensibility of schemas • Unclear which combinations are allowed • No datatypes • No namespaces, unique identifiers (URIs) • no interlinking • mapping between instances is required
RDFa • W3C standard for embedding RDF data in HTML documents • A set of new HTML attributes • Despite the extension of HTML, RDFa does not require XHTML • A specification of how to extract the data from these attributes • RDFa can be used to embed data in HTML headers or to annotate parts of the body of HTML documents • RDFa is just a syntax, you have to choose a vocabulary separately
Differences in usage • Microformats are the first choice for most publishers because they are simple • If you find none that perfectly fits your needs then you need RDFa • Microformats have a fixed schema: you can not add your own attributes • Example: a social networking site with user profiles • VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections • You either live without this, or go with RDFa
Example: Facebook’s Open Graph Protocol • Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa • Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ • Facebook as consumer • Facebook indexes OGP data whenever someone ‘likes’ a page with OGP data • Social recommendation (‘like’ button) provides publishers with a way to promote their content on Facebook • Shows up in profiles and news feed, the user is subscribing to a channel of future feeds from the web page they liked • Facebook Graph API allows 3rd party developers to access the data • http://developers.facebook.com/docs/api
Example: Facebook’s Open Graph Protocol <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> … </head> ... </html>
Microdata • HTML5 is currently under standardization at the W3C • Introduces Microdata • Similar to microformats • Some predefined vocabularies with central registration • Some of the flexibility of RDFa • Introduce new terms using reverse domain names or full URIs • Semantic HTML elements such as <time>, <video>, <article>…
Microdata example <div itemscopeitemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009 </time>. <imgitemprop="image" src=”me.png" alt=”me”> </p> </div
The state of metadata in HTML • 5-10% of webpages contain some explicit metadata • Depending on how you count… • Too many competing approaches • Too many formats: microformatsvsRDFavsMicrodata • Too many schemas: publishers may need to use multiple different vocabularies or microformats to satisfy everyone