Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe

Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

This presentation in perspective This is actually one of a series of presentations on Linked Data Web and semantic technologies: Introduction to ontologies This on RDF, Jena, SparQL, and the “Semantic Web” Using inference and OWL In general, these Semantic technology topics seem “deceptively simple,” but are fraught with complications, limitations, and qualifications…especially when the casual user attempts to compare them with relational data approaches to the same or similar problems.

Topics Simple introduction to the semantic approach - sentences as triples and graphs - sentence components encoded using URIs - serializing sentences using the Resource Description Format (RDF) - storing semantically encoded data in triplestores - browsing information encoded in RDF Accessing and querying semantic data - Introduction to SparQL - Free-standing query clients: Twinkle, RDF-gravity, Explorator - Jena: software for manipulating triples Preeminent semantic resources - DBpedia - Bio2RDF “semantic web atlas of postgenomic knowledge” - Queries using Virtuoso SparQL and iSparQL endpoints Ontologies: what are they and how are they used? Discussion of the semantic approach

From raw data to sentences Here is some information that might be useful to you: Smith 21 Smith Jones Do you get it? Would it help to see the data tables? Perhaps you could guess what I’m trying to say if you look at column names. What’s missing here: the “relationships” between the separate pieces of “data”. In natural languages these relationships are established by using “predicates” to form sentences that connect these components, . . . as in the sentences on the next slide:

Sentences . . . some information in sentence form: Smith has age 21. Jones has age 45. Blake has age 12. George has age 21. Smith has favorite friend Jones. Jones has favorite friend Smith. Blake has favorite friend Blake. George has favorite friend Smith. where each sentence has the form: Subject Predicate Object also known as Entity Property Value and these elements are known together as a “triple”.

A “Sentence base” We can put these triples into one or more files to build a “sentence base” to hold these sentences. To help with manipulation and searching, each grammatical component is stored and accessed separately, so that each sentence retains its triple form: SubjectPredicateObject Smith has age 21 Jones has age 45 Blake has age 12 George has age 21 Smith has favorite friend Jones Jones has favorite friend Smith Blake has favorite friend Blake George has favorite friend Smith

Query sentences We can query such information with queries like: “Someone has friend Smith?” where “Someone” acts like a “variable” and “resolves” as the list: Jones George because the pattern “Someone has friend Smith” matches both triples: Jones has favorite friend Smith George has favorite friend Smith

Query sentences We can interpret a more complicated query like: "Someone has favorite friend Smith and has age 21?” as a pair of requirements: "Someone has favorite friend Smith?” and "Someone has age 21?“ where we mean “thatsamesomeone” has both characteristics . . . in which case Someone will resolve as "George“, since George is the only “Someone” who satisfies both requirements via the following triples: George has age 21 George has favorite friend Smith Note that in both example we have used triple “patterns” to query the triple store

Using graphs used to represent sentences If we want to complicate things, we can also represent the same information in “graph form” as with these 2 graphs that represent the 2 kinds of information in the collection of sentences: Graph #1: Person ages Graph #2: Favorite Friends Typically we don’t really want to complicate these issues, but the semantic web literature often “thinks” in graph terms and some applications display results as visual graphs.

Using graphs to represent sentences Here the 2 graphs are combined using named edges to represent 2 kinds of information associated with the same 4 persons. Graph #3: Person ages (:age) and favorite friends (:fav) Each arc represents the “predicate” of a sentence, connecting a “subject” with an “object”. (Note that a subject may have >= 0 arcs of each type.)

Using URIs and URLs to identify predicates and metadata! Now if it hadn’t already happened someone would come up with the idea to use URLs to point to Web documents that describe the “exact” meaning of each predicate, or “metadata”. For example, “http://CelebrityMagazine.com/fav” could contain a definition of “favorite friend”, and other documents would define “BFF”, “long-time-friend”, “family-friend”, “friends with benefits”, etc, And, in fact, these definitions could themselves refer to other definitions like some “superset” of relationships such as: http://CelebrityMagazine.com/personal_relationships or the personal_relationships file could include a collection of subset definitions that we might refer to like: http://CelebrityMagazine.com/personal_relationships#fav using the # convention for targeting a specific location within a URL. Note that this form of metadata is not the only useful form of metadata, but it is clearly integrated with the data in a unique fashion. The basic triplet structure of each sentence provides another (implicit) form of metadata.

The sentences as a set of 8 triples (2 for each person) |-------------------------------------| | Subject | Predicate | Object | ======================================= | “Blake” | example:fav | “Blake” | | “Blake” | info:has_age | "12" | | “Jones” | example:fav | “Smith” | | “Jones” | info:has_age | "35" | | “George” | example:fav | “Smith” | | “George” | info:has_age | "21" | | “Smith” | example:fav | “Jones” | | “Smith” | info:has_age | "21" | --------------------------------------- Here the abbreviation “example:” stands for http://CelebrityMagazine.com/personal_relationships# and the abbreviation “info” stands for some imaginary web page that defines age, let’s say http://demographicstats.org/characteristics#”.

Representing sentence components using URIs To specify exactly which person named “Blake”, “Smith”, etc. we are referring to, we can again use URIs. ------------------------------------------------------------------------------ | Subject | Predicate | Object | =============================================================================== | <http://fake.host.edu/blake> | example:fav | <http://fake.host.edu/blake> | | <http://fake.host.edu/blake> | info:has_age | "12" | | <http://fake.host.edu/jones> | example:fav | <http://fake.host.edu/smith> | | <http://fake.host.edu/jones> | info:has_age | "35" | | <http://fake.host.edu/george> | example:fav | <http://fake.host.edu/smith> | | <http://fake.host.edu/george> | info:has_age | "21" | | <http://fake.host.edu/smith> | example:fav | <http://fake.host.edu/jones> | | <http://fake.host.edu/smith> | info:has_age | "21" | ------------------------------------------------------------------------------- Here the abbreviation “example:” stands for http://CelebrityMagazine.com/personal_relationships# and the abbreviation “info” stands for some imaginary web page that defines age, let’s say http://demographicstats.org/characteristics#”.

Triplestore summary and outrageous claims Sentences are composed of subject, predicate, object “triples”. Subjects and predicates are specified as URIs that may be dereferenceable, and predicate URLs may provide metadata describing the meaning of the predicate. A collection of triples can be represented as a “graph”, and may be known as a “graph.” Sentences are stored in “triplestores” or “quad stores” (when they are members of identifiable graphs whose names give the 4th component). Triples will contain URIs that: - identify and/or name “resources”: subjects and/or objects, and - serve to identify and/or reference predicate definitions, and object data types (as in “25”^^xsd:int), and One way to think about this, is that triplestores do NOT contain “data”, but rather “sentences”, “information”, “assertions” (not necessarily true or correct assertions), “units of thought” (Mons), or maybe “little chunks o’ meaning”. One might also say that the semantic approach transcends the data/meta-data dichotomy because the triple format provides implicit metadata, and because predicates can link to their definitions.

Triples may be serialized in various forms There are several ways to convert such triples into a serialized, or text-based, form. Here is the simplest. It is the N3 (for Notation 3) form of a standard known as “Turtle” (for “Terse RDF Triple Language”), with each line holding 3 URIs, and ending with a “.” @prefix example <http://CelebrityMagazine.com/personal_relationships#> . @prefix info <http://demographicstats.org/characteristics#> . <http://fake.host.edu/blake> example:fav <http://fake.host.edu/blake> . <http://fake.host.edu/blake> info:has_age "12" . <http://fake.host.edu/jones> example:fav <http://fake.host.edu/smith> . <http://fake.host.edu/jones> info:has_age "35" . <http://fake.host.edu/george> example:fav <http://fake.host.edu/smith> . <http://fake.host.edu/george> info:has_age "21" . <http://fake.host.edu/smith> example:fav <http://fake.host.edu/jones> . <http://fake.host.edu/smith> info:has_age "21" .

Triples may be serialized in various forms: Another serialization format is the standard Resource Description Format (RDF), which is used in this encoding of the Smith information (with non-dereferenceable URIs): <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:example="http://fake.host.edu/example-schema#"> <example:Person rdf:about=“http://fake.host.edu/smith”> <example:name>Smith</example:name> <example:age>21</example:has_age> <example:fav rdf:resource=“http://fake.host.edu/jones”/> </example:Person> </rdf:RDF> Note: There exist other, “standard” schemas for encoding personal information, such as the Friend of a Friend (FOAF) schema.

Dereferenceable URI version of the Smith RDF triple Here is the same information encoded with “dereferenceable” URIs, URIs that can actually be accessed and from which content can be downloaded: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:example="http://fake.host.edu/example-schema#"> <example:Person rdf:about=“http://discern.uits.iu.edu:8421/smith”> <example:name>Smith</example:name> <example:age>21</example:has_age> <example:fav rdf:resource=“http://discern.uits.iu.edu:8421/jones”/> </example:Person> </rdf:RDF>

Browsing RDF documents Here is a view of the Smith RDF file from within Firefox using the Tabulator plug-in: You can click on the jones.rdf link to see the Jones record, and browse from there, or choose the Person link to examine its definition (if its dereferenceable).

The “Semantic Web” In general, if URIs are dereferenceable they can link into a “Gigantic Global Graph”, usually know as the “Linked Data Web” or the “Semantic Web.” “If HTML and the Web make all online documents look like one huge book, RDF, schema, and inference languages will make all the data (sic) in the world look like on huge database.” --TimBL

Documents in RDF format may be interrogated: - by physical inspection (for anyone willing to read XML) - by using an RDF browser (like the Tabulator plug-in, etc.) - by writing programs (in Jena, for example) that read RDF files, construct the represented graphs internally, and then - access graph triples in sequential order, - select triples according to specified content, and/or - apply SparQL queries and access results in sequential order - using command-line tools that apply SparQL queries, and/or - using GUI interfaces accepting SparQL queries - written in text, or - represented graphically - using SparQL endpoints that accept queries embedded in URLs

A SparQL example If http://discern.uits.iu.edu:8421/all-persons.rdf contains all the triples listed earlier, then this SparQL query should find all the triples related to “smith”: select $p $o from <http://fake.host.edu:8421/all-persons.rdf> where { <http://discern.uits.iu.edu:8421/smith.rdf> $p $o . } Intuitively, this query asks “Smith has what relationship(s) to whom/what?” and should identify these 2 value pairs: <http://fake.host.edu/example-schema#fav> <http://discern.uits.iu.edu:8421/jones.rdf> <http://fake.host.edu/example-schema#age> "21” $p, $o are variable names that were each assigned a value as the query was “satisified.” Variable names may also start with “?”.

Another SparQL example If http://discern.uits.iu.edu:8421/all-persons.rdf contains all the triples listed earlier, then this SparQL query simply asks for a list of all those triple values: select * from <http://discern.uits.iu.ed:8421/all-persons.rdf> where { $sub $pred $obj . } Intutitively, this query asks “Who has what relationship to whom?” $sub, $pred, and $obj will each be assigned one or more values as the query is satisified and all three will be printed (*). (Note that “$sub $pred $obj .” is a triple pattern in the Turtle/N3 format.)

Results of the single (unified) file SparQL query -------------------------------------------------------------------------- | sub | pred | obj | ========================================================================== | http://...8421/blake.rdf | example:fav | http://...8421/blake.rdf | | http://...8421/blake.rdf | example:has_age | "12" | | http://...8421/jones.rdf | example:fav | http://...8421/smith.rdf | | http://...8421/jones.rdf | example:has_age | "35" | | http://...8421/george.rdf | example:fav | http://...8421/smith.rdf | | http://...8421/george.rdf | example:has_age | "21" | | http://...8421/smith.rdf | example:fav | http://...8421/jones.rdf | | http://...8421/smith.rdf | example:has_age | "21" | -------------------------------------------------------------------------- where “…” indicates “discern.uits.iu.edu:”.

A “distributed” SparQL query against 4 separate RDF files The next query searches 4 dereferenceable files holding the same data broken into 4 files, one for each subject: select * from <http://discern.uits.iu.edu:8421/smith.rdf> from <http://discern.uits.iu.edu:8421/jones.rdf> from <http://discern.uits.iu.edu:8421/george.rdf> from <http://discern.uits.iu.edu:8421/blake.rdf> where { $sub $pred $obj . } The results of this query will be the same as the results for the single file query (though order my vary due to remote URL access latency).

Use SparQL to find the predicates This SparQL example query simply asks for a list of all the unique predicates that occur in all the triples: select distinct $p from <http://discern...8421/friend-network.rdf> where { $s $p $o . } If you don’t use “distinct” you will get multiple occurrences of the same predicate. This can be very useful when you are trying to figure out what predicates are available to interrogate a triplestore that you don’t know much about.

SparQL (incomplete) basic syntax : SELECT some_variable_list FROM <some_RDF_source_URI> WHERE { { some_n3_triple_pattern . another n3_triple_pattern . } Notes: - the “<“ and “>” characters are required. - other commands in place of SELECT are: CONSTRUCT, ASK and DESCRIBE. - * is a valid variable list, specifying any variable included in a triple pattern, and may be preceded by DISTINCT, which will prevent duplicate triples. - there may be multiple FROM clauses, whose targets will be combined and treated as a single store. - a “.” separating multiple triple patterns is intuitively similar to a natural language “and”, but actually behaves like an SQL natural join. - the term WHERE is optional, and may be omitted. SparQL reference: http://www.dajobe.org/2005/04-sparql/SPARQLreference-1.8.pdf

Optional clauses in SparQL queries Clauses permitted within a “where” clause: optional { triple_pattern }: identifies a triple that need not appear in an RDF target but whose absence will not prohibit a pattern match. filter: restricts variable matches in the preceding triple to specified filter patterns, as in: { $s $p $date FILTER ( $date > "2005-01-01T00:00:00Z"^^xsd:dateTime ) } or { $s $p $d FILTER ( xsd:dateTime( $d ) < xsd:dateTime( "2005-01-01T00:00:00Z“ ) ) } or { ?s ?p ?name FILTER regex( ?name, "^smi", “some_flag“ ) } union: “where” clauses may be constructed as { triple_pattern_1 } UNION { triple_pattern_2 } and any RDF element matching either of these triples will be included in the resulting output. Clauses permitted following the “where” clause: order by [DESC|ASC| ] ( variable_list ) limit n: print up to n return values. offset n: start output with the nth return value. group by: implemented by some SparQL implementations.

Some useful SparQL pattern patterns Display two property values of some entity (<some_URI>) on the same line: select * where { <some_URI> <some_predicate> ?o . <the_same_URI> <some_other_predicate> ?o1 . } Example using the friend information and PREFIX statements: PREFIX example: <http://CelebrityMagazine.com/personal_relationships#> PREFIX info: <http://demographicstats.org/characteristics#> select * where { <http://fake.host.edu/smith> example:fav ?favorite . <http://fake.host.edu/smith> info:has_age ?age . }

Some more useful SparQL pattern patterns Merge results of 2 pattern matches into a single output column: select * where { { <some_URI> <some_predicate> ?o . } UNION { <some_other_URI> <some_other_predicate> ?o . } } Example: PREFIX example: <http://CelebrityMagazine.com/personal_relationships#> PREFIX info: <http://demographicstats.org/characteristics#> select * where { { <http://fake.host.edu/smith> example:fav ?values .} UNION { <http://fake.host.edu/smith> info:has_age ?values . } }

Some more useful SparQL pattern patterns Slowly find all triples whose object components mention “hexokinase”: select * where { ?s ?p ?o . FILTER regex( $o, "hexokinase" ) . } Quickly find all entries with object components mentioning hexokinase, but works only through a Virtuoso SparQL endpoint when applied to indexed graphs (and will return nothing when applied to a non-indexed graph): select * where { ?s1 ?p1 ?o1 . ?o1 bif:contains "hexokinase" . }

SparQL desktop client: Twinkle (version of the upward paths query)

SparQL desktop client: RDF-gravity (using the friend data)

SparQL desktop client: Explorator RDF explorer The Explorator can download (extracts from) multiple RDF resources, and manipulate them in combination. Here with the Russian lakes example. This approach provides an interface using a set algebra model of data manipulation. (See Araujo, et al. and http://139.82.71.60:3000/explorator)

Jena The Java-based Jena package from HP Labs allows users to manipulate and query RDF graphs. You can write a program that uses Jena classes to - retrieve and parse an RDF file containing a graph or a collection of graphs, - store it in memory, - examine each triple in turn, examine one component (say, the subject) of each triple in turn, or examine only triples that meet specified criteria, and, - write a serialized version of a graph to a file or STDOT. For example, one might examine each stored triple searching for a specific reference URI, or for a specific literal value, as with a search for triples containing a specific value, “21”^^xsd:age, in their object portions. An RDF graph is stored in Jena as a “model”, and a Jena model is created by a factory, as in: Model m = ModelFactory.createDefaultModel(); Once a model has been defined, Jena can populate it by reading data from files, backend data bases, etc. in various formats, and once it has been populated, Jena can perform set operations on pairs of populated models and/or search models for specific values or combinations (patterns) of values.

Jena For example, there are several methods for creating iterators over a model to access specific components. Iterators may be built by - listing the components of each triple: - model.listSubjects(); - model.listObjects(); - comparing a specific component with a specified value, as in: model.listSubjectsWithProperty( Prop p, RDFNode object ); which will get you a collection of subjects possessingproperty/predicate p and specific value object ) - comparing all components against specific values in 2 steps: - construct a “selector” possessing specific values s, p and o: Selector selector = new SimpleSelector( subject, predicate, object ) - and then build the statement list: model.listStatements( selector );

Preeminent Linked Data resources: The DBpedia and Bio2RDF The “DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web” (http://dbpedia.org/About) DBpedia currently holds over 200 million triples, harvested by scraping DBpedia Infoboxes included within the Wikipedia. The DBpedia is currently housed in a OpenLink Virtuoso Universal Database, which can store relational, object, XML, and semantic information. Details at: http://dbpedia.org/About

Bio2RDF: “Atlas of postgenomic knowledge” Bio2RDF integrates (extracts from) some 40 biomedical information resources (such as GO, Uniprot, etc.) recoded in RDF (>2 Gtriples): - currently runs over the Virtuoso Universal Database server at http://atlas.bio2rdf.org but each resource has its own SparQL endpoint, in addition to the endpoint accessing the unified triplestore: http://atlas.bio2rdf.org/sparql - a list of included resources is at (http://www.freebase.com/view/user/bio2rdf/public/sparql) and includes links to the SparQL endpoint for each resource, as well as descriptions of the resource contents and triple counts. - raw text N3 formats for this data use around 1 TB, but install in much less space within Virtuoso (perhaps 100 GB). - there is also a Bio2RDF proxy service that takes queries and relays them to multiple distributed servers (examples later).

Resources included in Bio2RDF (downloadable from http://quebec.bio2rdf.org/download/n3/) GO KEGG OMIM HGNC PUbMed INOH GeneID IProClass UniProt MGI UniRef CellMap UniParc BioPAX Kegg Pathway InterPro CPATH Pfam Reactome PROSITE Biocyc Protein MeSH SID PDB CID CPD: Kegg Ligand for chemical compound PubChem GL: Kegg Ligand for carbohydrate structure UniSTS EC Homologene RN Kegg Ligand for chemical reaction DBpedia DR: Kegg Ligand for drugs OBO CheBI Taxonomy: NEWT Affymetrix PID Biocarta

Bio2RDF resources (Edge width is proportional to link density.)

SparQL endpoints Triplestores like the Virtuoso Universal Database Server publish “SparQL endpoints” that will take SparQL queries through several interfaces. For example you can query the DBpedia through a Virtuoso SparQL endpoint at http://dbpedia.org/sparql by sending SparQL queries: - encoded in URLs addressed to the triplestore endpoint, like http://dbpedia.org/sparql?query=SELECT distinct * WHERE { $s $p $o . $o bif:contains “Goethe_Johann_Wolfgang” . } - entered into Web forms that present text areas into which one can enter queries, as on the next pages

The SparQL interface to DBpedia

The iSparQL Advanced interface to DBpedia

The iSparQL QBE interface to DBpedia (close up) Here is the same query in graphical form as constructed using the iSparql QBE interface:

The iSparQL QBE interface to DBpedia

Results from the iSparql text and/or QBE queries

Using SparQL to get RDF extracts Suppose you want to build a local RDF triplestore from DBpedia containing only the Goethe entries, or import these entries into some other desktop client like the Explorator. Documents returned by SparQL select queries are usually not RDF documents. They may not have triples, and they are usually structured for display or storage in HTML, Excel or some other format. You can use the CONSTRUCT command (in place of SELECT) within a SparQL query to build a proper RDF formatted response: construct { <http://dbpedia.org/resource/Johann_Wolfgang_von_Goethe> $p $o } where { <http://dbpedia.org/resource/Johann_Wolfgang_von_Goethe> $p $o . } The structure of the triple to be created is specified in the “construct” clause. Note that construct queries like these can be embedded in URLs to SparQL endpoints.

Ontologies The term “ontology” is used in different ways by different people. Pidcock writes that “People use the word to mean different things, e.g.: glossaries and data dictionaries, thesauri and taxonomies, schema and data models, and formal ontologies and inference.” And Uschold writes “An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. . .This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.”

The DBpedia ontology The DBpedia ontology is “shallow, cross-domain” ontology. In http://www4.wiwiss.fu-berlin.de/dbpedia/dev/ontology.html it appears as a tree with maximum depth of 4. The main level class is a “Thing”, and the first sublevel classes are: Person, Organization, Anatomical structure, Place, Species, etc. The next level persons are Scientist, College Coach, Monarch, Politician, etc. Some classes are also assigned “properties”. For example, a species may have Order and Family properties (even though an organism’s Order and Family could be inferred from its position in the (ontology that is the) evolutionary tree.

Wikipedia Infoboxes The DBpedia gets its information from the Wikipedia “Infoboxes”, such as this one for Johann Wolfgang von Goethe that appears on his Wikipedia page. Infobox contents are mapped to DBpedia ontology classes and properties, which are used as RDF predicates. Here the Goethe “resource” is: http://dbpedia.org/resource/ Johann_Wolfgang_von_Goethe and you know how to find all the predicates and objects by now?

The DBpedia ontology Here is a query to find all the “Places” known to DBpedia: select distinct * where { $s a <http://dbpedia.org/ontology/Place> } limit 1000 And a query to find every “person’s” birth info: select $s $o where { $s a <http://dbpedia.org/ontology/Person> . $s <http://dbpedia.org/property/birth> $o } limit 1000 where the predicate “a” is a short form of http://www.w3.org/1999/02/22-rdf-syntax-ns#type

Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe

Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe

Presentation Transcript

Introduction to Semantic Web

Introduction to Semantic Web Technologies Ivan Herman, W3C June 22 nd , 2010

Semantic Enhancement

Semantic Web Services

A year on the Semantic Web @ W3C (with more details on RDFa)

Semantic Search Engines – On the Way to Web 3.0

Semantic Inference for Question Answering

Feasting on Brains! From Web Services to Web 2.0 to the Semantic Web and back again…

Table of Contents

Table of Contents

Natural Language Processing COMPSCI 423/723 Rohit Kate

Tutorial – Syllabus – AAAI 2006 Tutorial Forum

Tutorial AAAI 2006 Tutorial Forum

Carnegie Mellon University

Querying the Web of Data with SPARQL and XSPARQL

Semantic Web Service Systems

Intelligent Systems

Semantic Web Services Tutorial

Web - Technologies

An Introduction to RDF and the Semantic Web