1 / 13

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT. PNNL: David Haglin , Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman , Edward Jimenez, Greg Mackey. Recap from August Review. We built a simple automatic query translator

dexter
Télécharger la présentation

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CASS-MT Review: 6-Apr-2011Task 3: Semantic Databases on the XMT PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman,Edward Jimenez, Greg Mackey

  2. Recap from August Review • We built a simple automatic query translator • Much of the work was done by hand • Lessons learned from experiments: • Query optimization must happen early and often • An efficient semantic search engine will almost certainly need data-driven and query-driven optimization • Now that BTC is largely passed, we are continuing forward with these goals

  3. Query Optimization Research Agenda • Prerequisites for query optimization research: • Build out an end-to-end query engine • Enables: validation, measurement, profiling • Build a simple research compiler • Enables: rapid prototyping, attribute aggregation • Not to be construed as standing up a product • Glue code is not engineered to be robust • Compiler is a first pass at using intermediate forms

  4. A Modular Query Engine

  5. Progress: Data Ingest • Portability important • Using MTGL on multiple systems: • Cray XMT Threadstorm nodes • Cray XMT service node (Linux) • Cray CX-1 • Endian-ness an issue for storing/retrieving binary triples • Ensure First triple has small (<232) “Subject” • Upon reading triples, if first integer >= 232, then swap bytes on all integers read in. Do this on all systems. • Swap 100,000,000 uint64_t • Identical code compiled and run on each platform

  6. Progress: Graph Representation • Work in progress: use MTGL and sample code from SPEED-MT to build out components: • Build a compressed_sparse_row<BidirectionalS> representation of the RDF graph • Focus on an ability to memory map graph data structures for fast reloading (rememd). • Adapt Search-Space Recursive Descent code (described in August, 2010 review) to the MTGL-based data structure. • Redesign Dictionary Encoding storage on disk: • Use only one file that supports an “Endian-ness hack” • Avoid the need to parse strings from char-array to rebuild in-memory data dictionary.

  7. A Transformation-oriented Query Compiler

  8. Progress: Parser • Query Parsing • SPARQL 1.0 implemented as ANTLR LL* grammar • Tested using SPARQL Performance Benchmark (SP2Bench) and Data Access Working Group (DAWG) tests • Currently passes 175 of 214 tests (81%) • We are not currently working to improve the coverage • SPARQL parsing is not a priority • We needed enough coverage to get interesting properties • (OPTIONAL, UNION, FILTER, blank nodes, etc.)

  9. Progress: Intermediate Representation • Query language is not amenable to optimization • So we lower into a more comfortable form • GPIR: Graph Pattern Intermediate Representation • Query is represented as a graph • Entity references are unified (all ?x refer to the same thing) • Entities are tagged with language attributes • e.g.- all triples from a UNION statement are tagged with a union ID and a common union group ID • EPIR: Execution Plan Intermediate Representation • Still very much a work-in-progress

  10. Progress: Intermediate Representation # Query Graph <21148736> in GPIR 9:0-5-8 3:0-1-2 6:4-5-0 7:0-1-2 % 0:variable:T 0:label:"?person" 1:label:"http://www.w3.org/1999/02/22-rdf-syntax-ns#type" 2:label:"http://xmlns.com/foaf/0.1/Person" 3:union_group:1 3:optional:F 3:union_id:0 4:variable:T 4:label:"?subject" 5:variable:T 5:select:T 5:label:"?predicate" 6:union_group:1 6:optional:F 6:union_id:0 7:union_group:1 7:optional:F 7:union_id:1 8:variable:T 8:label:"?object" 9:union_group:1 9:optional:F 9:union_id:1 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT DISTINCT ?predicate WHERE { { ?person rdf:typefoaf:Person . ?subject ?predicate ?person } UNION { ?person rdf:typefoaf:Person . ?person ?predicate ?object } } Parses to:

  11. Progress: Transforms • Transforms operate on an IR • Input and output are same format, so they can be chained • Example transform: xform_rem_uni • Removes union group attributes which only have one member • Think of this as algebraic simplification on math expressions (A+0 => A), except for SPARQL UNION statements

  12. Potential Optimizing Transforms • Longer-term, we are looking at several different types of transforms to attempt. Here are some examples: • Impossible query identification: a triple pattern, constraint, or inferred interaction does not exist in the data • Deterministic bind: if a property is known to be unique (e.g.- rdf:type is usually unique), a traversal can avoid nondeterminism while satisfying that constraint • Selectivity-based strategy detection: if the pattern does (or can be reduced to) not include complex interactions, a simpler execution strategy can be chosen on-the-fly.

  13. Future Work: specific directions Continue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database query Work with Bill Howe on query language and hybrid search strategies Expand our collaboration with Task 1. Support Task 16 (Mayo) Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot) Find collections of complex queries Continue work on search engine comparison: Array-based Subgraph-isomorphism (MTGL) Sprinkle-SPARQL Query-optimization infused with pattern matching Extend study of larger path types (n=4,5) and/or non-linear motifs

More Related