CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011Task 3: Semantic Databases on the XMT PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman,Edward Jimenez, Greg Mackey

Recap from August Review • We built a simple automatic query translator • Much of the work was done by hand • Lessons learned from experiments: • Query optimization must happen early and often • An efficient semantic search engine will almost certainly need data-driven and query-driven optimization • Now that BTC is largely passed, we are continuing forward with these goals

Query Optimization Research Agenda • Prerequisites for query optimization research: • Build out an end-to-end query engine • Enables: validation, measurement, profiling • Build a simple research compiler • Enables: rapid prototyping, attribute aggregation • Not to be construed as standing up a product • Glue code is not engineered to be robust • Compiler is a first pass at using intermediate forms

A Modular Query Engine

Progress: Data Ingest • Portability important • Using MTGL on multiple systems: • Cray XMT Threadstorm nodes • Cray XMT service node (Linux) • Cray CX-1 • Endian-ness an issue for storing/retrieving binary triples • Ensure First triple has small (<232) “Subject” • Upon reading triples, if first integer >= 232, then swap bytes on all integers read in. Do this on all systems. • Swap 100,000,000 uint64_t • Identical code compiled and run on each platform

Progress: Graph Representation • Work in progress: use MTGL and sample code from SPEED-MT to build out components: • Build a compressed_sparse_row<BidirectionalS> representation of the RDF graph • Focus on an ability to memory map graph data structures for fast reloading (rememd). • Adapt Search-Space Recursive Descent code (described in August, 2010 review) to the MTGL-based data structure. • Redesign Dictionary Encoding storage on disk: • Use only one file that supports an “Endian-ness hack” • Avoid the need to parse strings from char-array to rebuild in-memory data dictionary.

A Transformation-oriented Query Compiler

Progress: Parser • Query Parsing • SPARQL 1.0 implemented as ANTLR LL* grammar • Tested using SPARQL Performance Benchmark (SP2Bench) and Data Access Working Group (DAWG) tests • Currently passes 175 of 214 tests (81%) • We are not currently working to improve the coverage • SPARQL parsing is not a priority • We needed enough coverage to get interesting properties • (OPTIONAL, UNION, FILTER, blank nodes, etc.)

Progress: Intermediate Representation • Query language is not amenable to optimization • So we lower into a more comfortable form • GPIR: Graph Pattern Intermediate Representation • Query is represented as a graph • Entity references are unified (all ?x refer to the same thing) • Entities are tagged with language attributes • e.g.- all triples from a UNION statement are tagged with a union ID and a common union group ID • EPIR: Execution Plan Intermediate Representation • Still very much a work-in-progress

Progress: Intermediate Representation # Query Graph <21148736> in GPIR 9:0-5-8 3:0-1-2 6:4-5-0 7:0-1-2 % 0:variable:T 0:label:"?person" 1:label:"http://www.w3.org/1999/02/22-rdf-syntax-ns#type" 2:label:"http://xmlns.com/foaf/0.1/Person" 3:union_group:1 3:optional:F 3:union_id:0 4:variable:T 4:label:"?subject" 5:variable:T 5:select:T 5:label:"?predicate" 6:union_group:1 6:optional:F 6:union_id:0 7:union_group:1 7:optional:F 7:union_id:1 8:variable:T 8:label:"?object" 9:union_group:1 9:optional:F 9:union_id:1 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT DISTINCT ?predicate WHERE { { ?person rdf:typefoaf:Person . ?subject ?predicate ?person } UNION { ?person rdf:typefoaf:Person . ?person ?predicate ?object } } Parses to:

Progress: Transforms • Transforms operate on an IR • Input and output are same format, so they can be chained • Example transform: xform_rem_uni • Removes union group attributes which only have one member • Think of this as algebraic simplification on math expressions (A+0 => A), except for SPARQL UNION statements

Potential Optimizing Transforms • Longer-term, we are looking at several different types of transforms to attempt. Here are some examples: • Impossible query identification: a triple pattern, constraint, or inferred interaction does not exist in the data • Deterministic bind: if a property is known to be unique (e.g.- rdf:type is usually unique), a traversal can avoid nondeterminism while satisfying that constraint • Selectivity-based strategy detection: if the pattern does (or can be reduced to) not include complex interactions, a simpler execution strategy can be chosen on-the-fly.

Future Work: specific directions Continue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database query Work with Bill Howe on query language and hybrid search strategies Expand our collaboration with Task 1. Support Task 16 (Mayo) Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot) Find collections of complex queries Continue work on search engine comparison: Array-based Subgraph-isomorphism (MTGL) Sprinkle-SPARQL Query-optimization infused with pattern matching Extend study of larger path types (n=4,5) and/or non-linear motifs

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

Presentation Transcript

Semantic Web Services Tutorial

Psychology 439G/572B

Emotional Stroop Task

Knowledge Discovery over the Deep Web, Semantic Web and XML

Eric Neumann Clinical Semantic Group W3C HCLS chair, MIT Fellow

An introduction to biological databases

2011 PE Review:

Intro to Databases (using Microsoft Access)

What Semantic Web researchers need to know about Machine Learning?

Data

Internet Engineering Course

Semantic Web Services Systems and Tools 4th International Semantic Web Conference (ISWC 2005)

Chapter 19: Distributed Databases

Chapter 1 The Semantic Web Vision

An introduction to biological databases

An introduction to biological databases

Genomic Databases

COMPSCI 732: Semantic Web Technologies

HAPTER 4

Entry Task: Oct 22 nd Monday

Semantic Web Services

Searching The Semantic Web