A Semantic Web Search and Metadata Engine

A Semantic Web Search and Metadata Engine RoiAdadi David Ben-David

Glossary SWD <rdf:RDF> … <rdfs:Classrdf:ID=”Department” /> <rdfs:Classrdf:ID=”Course” /> <rdf:Propertyrdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOfrdf:parseType="Collection"> <rdfs:Classrdf:about=# Department /> <rdfs:Classrdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“number” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“department” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource=“#Department”> </rdf:Property> <rdf:Propertyrdf:ID=“creditPts” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course> … </rdf:RDF> • Semantic Web Document (SWD) • A web page that serializes an RDF graph. • Uses one of the recommended RDF syntax languages, i.e. RDF/XML, N-TRIPLE or N3. • Semantic Web Term (SWT) • An RDF resource that represents an instance of rdfs:Class or rdf:Property, and can be universally referenced by its URI reference (URIref). • Semantic Web Ontology (SWO) • An SWD is considered to be an SWO when a significant proportion of the statements it makes defines new SWTs. • Semantic Web Database (SWDB) • An SWD that does not define or extend a significant number of terms. • Introduces individuals and makes assertions about them. • Make assertions about individuals defined in other SWDs. SWT SWT SWT SWT SWT SWT

SWO Class Document FOAF http://xmlns.com/foaf/spec/index.rdf Contain 12 classes and 51 properties (in 466 triples) (No individuals) Class Organization Property mbox

SWDB FOAF description for Tim Finin www.cs.umbc.edu/~finin//foaf.rdf Defines three individuals and make statements about them (No classes or properties) Name statement Nick Name statement

Motivation • Current form of the Semantic Web • web of Semantic Web Documents (SWD) • Navigating the Semantic Web is difficult • Paucity of explicit hyperlinks (beyond NS in URIrefs). • Relations such as rdfs:seeAlso and owl:imports are rare. • There is a need for a search engine customized for SWD • Find and analyze SWDs on the web. • Suggest a measure for SWDs’ importance (ranking).

Who needs it? • Semantic Web researchers • Search for SWTs and SWOs for publishing their knowledge. • Software Agents • Search SWDs for external knowledge. • Retrieve SWOs to fully understand SWTs. Find the most popular ontology to publish a personal profile

Why don’t just use Google? • Conventional web navigation and ranking models are not suitable for the Semantic Web. • They do not differentiate SWDs from other web pages. • They do not parse and use the internal structure of SWD and the external semantic links among SWDs • Designed to work with NL and unstructured text The FOAF ontology is not among the 10 search results in Google for “person ontology”

Swoogle Objectives • Finding appropriate ontologies • Qualified search (Terms + Types) • Ontologies are sorted by their popularity. • Finding instance data • Querying SWDs with constraints on the classes and properties used by them. • Helps to integrate Semantic Web data on the web. • Characterizing the Semantic Web • Structural properties

Related Work • Ontology Based Annotation Systems • SHOE, Ontobroker, webKB, QuizRDF, CREAM, … • Annotating online documents. • Document indexes based on the annotations, but not on the entire document. • Use their own ontologies that might not suit some SWDs

Related Work – cont. • Ontology Repositories • DAML Ontology Library, SemWebCentral, Schema Web, … • Collect ontologies (simply store the entire RDF document). • Do not automatically discover SWDs but rather require people to submit URLs. • Constitute a small portion of the Semantic Web.

Related Work– cont. • Semantic Web Browsers • W3C’s Ontaria • Searchable and browsable directory of RDF documents developed by the W3C. • Do not automatically discover SWDs. • Stores the full RDF graphs. • Indexes individuals of well known classes • e.g. foaf:Person, rss:Item Experiments show: outperforms them all!

Swoogle • Crawler-based indexing and retrieval system for the Semantic web. • Discover semantic web documents • Computes relations between documents • Store and reason over extracted metadata • The system is designed to scale up to handle tens of millions of documents • Enables rich query constraints on semantic relations

Swoogle Architecture

Swoogle Architecture - Discovery • Collects candidate URLs to find and cache SWDs • Submitted URLs. • A Web crawler. • A customized meta-crawler (using conventional search engines). • SwoogleBot Semantic Web Crawler . • Analyzes SWDs to produce new candidates. Up until now Swoogle has found over 1.7M SWDs with more than 1G triples!

Swoogle Architecture – Indexing • Analyzes the discovered SWDs • Generates the bulk of Swoogle’s metadata about the Semantic Web • Characterizes features associated with SWDs and SWTs. • Tracks relations among SWDs and SWTs. How SWDs use/define/populate a given SWT? How two SWTs are associated?…

Swoogle Architecture – Analysis • Analyzes the generated metadata. • Classification of SWOs and SWDBs. • Hosts the modular ranking mechanisms. • Ontology Rank.

Swoogle Architecture – Services • provides search services to software agents and users, allowing them to access metadata and navigate the semantic web • Swoogle Search – searches SWDs using constraints on URLs, SWTs being used or defined, etc. • Ontology Dictionary – searches ontologies at the term level and offers more navigational paths.

SWD Metadata • SWD metadata is collected to make SWD search more efficient and effective. • Derived from the content of SWD as well as the relations among SWDs • 3 categories of metadata: • Basic metadata • Relations among SWDs • Analytical results

Basic Metadata • Language Features – properties describing the syntactic or semantic features of an SWD. • Encoding – syntactic encoding of an SWD. • “RDF/XML”, “N-TRIPLE” and “N3”. • Language – the language used by an SWD. • “OWL”, “DAML+OIL”, “RDFS” and “RDF”. • OWL Species – the language species of an SWD written in OWL. • “OWL-LITE”, “OWL-DL” and “OWL-FULL”

Basic Metadata – cont. <rdf:RDF> <rdfs:Classrdf:ID=”Department” /> <rdfs:Classrdf:ID=”Course” /> <rdf:Propertyrdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOfrdf:parseType="Collection"> <rdfs:Classrdf:about=# Department /> <rdfs:Classrdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“number” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Propertyrdf:ID=“department” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource=“#Department”> </rdf:Property> <rdf:Propertyrdf:ID=“creditPts” > <rdfs:domainrdf:resource=“#Course”/> <rdfs:rangerdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course> </rdf:RDF> • RDF Statistics – properties summarizing node distribution of the RDF graph of an SWD. • How an SWD defines new classes, properties and individuals. • Let foobe an SWD and let C(foo), P(foo), I(foo) be the set of classes, properties and individuals defined in the SWD foo respectively. The onology-ratioR(foo) is calculated by: • R(foo) ranges from 0 to 1, where 0 implies that foo is a pure SWDB and 1 implies that foo is a pure SWO.

Basic Metadata – cont. • Ontology Annotations– properties that describe an SWD as an ontology. • The SWD has an instance of OWL:Ontology • Swoogle records the following properties: • label (rdfs:label) • comment (rdfs:comment) • versionInfo (owl:versionInfo/daml:versionInfo)

Relations Among SWDs • Capturing and analyzing relations at the RDF node level is hard. • Swoogle generalizes RDF node level relations and Focuses on SWD level relations. • Swoogle captures the following SWD level relations: • TM/IN – SWD is using terms defined by some other SWDs. • IM – an ontology imports another ontology. • EX – an ontology extends another ontology • PV – an ontology is a prior version of another. • CPV – an ontology is a prior version of another and is compatible with it. • IPV - an ontology is a prior version of another and is incompatible with it.

Inter-Ontology relations Indicators of inter-ontology relation

Ranking SWDs • OntologyRank inspired by Google’s PageRank algorithm. • Underlying Random Surfing Model: • Surfer jumps to a random URL • With probability d randomly chooses a link to follow. • With probability 1-d jumps to another random URL.

Page Rank • Given a document A, A’s Page rank is computed by: where are web documents that link to A; C(T)is the total outlinks of T; and d is a damping factor, typically set to 0.85.

PageRank

The SW Navigation Model • The graph formed by SWDs has a richer set of relations. • The edges have explicit semantics • Users can navigate the Semantic Web whithin or across the web and RDF graph through 7 groups of navigational paths

The SW Navigation Model

OntologyRank • The semantics of links lead to a non-uniform probability of following a particular outgoing link. • Given SWD’s A and B, Swoogle classifies inter-SWD links into four categories: • imports(A,B) – A import all content of B. • uses-term(A,B) – A uses some of the terms defined by B (without importing B). • extends(A,B) – A extends the definitions of terms defined by B. • asserts(A,B) – A makes assertions about the individuals defined by B. • Each category is assigned a different weight, which represents the probability of following that kind of link.

OntologyRank – cont. • Given an SWD a, Swoogle computes its raw rank by: where L(a) is the set of SWDs that link to a, T(x) is the set of SWDs that x links to.

OntologyRank – cont. • Then, Swoogle computes the rank for SWDB and SWO by: where T(c) is the transitive closure of SWOs imported by a.

Indexing and Retrieval of SWDs • The problem of Indexing and Searching SWDs • Significant semantic information encoded in marked documents. • Reasoning over large collection of documents can be expensive. • Traditional information retrieval techniques • Faster (coarse view of the text). • Can quickly retrieve a set of SWD’s based on similarities of the source text alone.

Applying IR Techniques • SWDs are not entirely markup. • Search should be applied to both structured and unstructured components of the document. • We may want SWDs to be available to commonly used search engins • Documents must be transformed to a form that a standard IR engine can understand and manipulate. • Well researched methods for ranking matches, computing similarities between documents and employing relevance feedback.

Applying IR Techniques • Look at a document as a collection of either tokens or N-Grams. • URIrefs of classes, properties and individuals corresponds to words in natural languages. • Apply the following process to an SWD • Reduce it to triples. • Extract URIrefs (with duplicates). • Discard URIrefs of blank nodes. • Hash each URI to a token. • Index the document. Matching “time” to: http://foo.com/timeont.owl#timeInterval http://foo.com/timeont.owl#calendarClockInterval http://purl.org/upper/temporal/t13.owl#timeThing indexes by either N-Gram or URIrefs

Swoogle Demo…

A Semantic Web Search and Metadata Engine