1 / 19

Querying Distributed RDF Data Sources with SPARQL

Querying Distributed RDF Data Sources with SPARQL. Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung. Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea.

ophira
Télécharger la présentation

Querying Distributed RDF Data Sources with SPARQL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying Distributed RDF Data Sources with SPARQL Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

  2. Introduction • SPARQL has to deal with thousands of RDF data • with a local machine • with multiple and distributed machines • Integrated access to multiple RDF data sources is a key challenge for many semantic web applications • Current implementations of SPARQL load all RDF graphs to the local machine • This usually incurs a large overhead in network traffic Center for E-Business Technology

  3. Introduction • DARQ, an engine for federated SPARQL queries • Provides transparent query access to multiple SPARQL services • Distributed ARQ, as an extension to ARQ (jena) • Available under GPL License at http://darq.sf.net/ In this presentation, .. Building Sub-queries Metadata for each DS Data Source Do not care Center for E-Business Technology

  4. Preliminaries • A SPARQL query Q is defined as Q = (E, DS, R) • E : an algebra expression of the SPARQL query • DS : a RDF data source • R : Query Type (SELECT, CONSTRUCT, DESCRIBE, ASK) • The algebra expression E consists of • Graph Patterns • Triple Pattern : (s, p, o) • Basic Graph Pattern : a set of triple pattern • Filtered BGP : BGP with constraints • Solution Modifiers, • Such as PROJECTION, DISTINCT, LIMIT or ORDER BY Center for E-Business Technology

  5. An Example SPARQ Query SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 Query Type Projection TP BGP FBGP Solution Modifiers Center for E-Business Technology

  6. Query Processing • A query is processed in 4 stages: • Parsing : converts the query string into a tree model of SPARQL. The DARQ query engine reuses the parser shipped with ARQ • Query Planning : the query engine decomposes the original query and builds multiple sub-queries according to the information in the service descriptions, each of which can be answered by one known data source • Query Optimization : In the third stage, the query optimizer takes the sub-queries and rewrites them for optimization • Query Execution : the Query execution plan is executed. The sub-queries are sent to the data sources and the results are integrated Center for E-Business Technology

  7. Service Descriptions • Information for each data sources is helpful • To find the relevant data sources for the different triples • To decompose the query into sub-queries • Service descriptions • Let us know whether the data available from a data source • Allow limitations on access patterns • Include statistical information used for query optimization • Are represented in RDF Center for E-Business Technology

  8. Service Descriptions • Data Description • A service description defines the capabilities which indicates whether data is available or not • Ex) sd:capability [ sd:predicate rdf:type ]; • The definition of capabilities is based on predicates • DARQ currently only supports queries with bounded predicates • Limitation on Access Pattern • DARQ supports limitations on access patterns • Ex) sd:requiredBindings [ sd:subjectBinding foaf:name ]; • Ex) sd:requiredBindings [ sd:objectBinding foaf:name ]; Center for E-Business Technology

  9. Service Descriptions • Statistical Information • Helps the query optimizer to find a cost-effective query plan • Includes • Ns : The total number of triples • Optional information for each predicate • nD(p) : The number of triples for the predicate p in the data source D • sselD(p) : The selectivity of a triple pattern for the predicate p when the subject is bounded (default = 1 / nD(p) ) • oselD(p) : The selectivity of a triple pattern for the predicate p when the object is bounded (default = 1) • Using simple statistics => every data source can provide them • More precise statistics would be preferable but will not be available Center for E-Business Technology

  10. Service Descriptions • The data source defined in the example can answer queries for foaf:name, foaf:mbox and foaf:weblog. • Objects for a triple with predicate foaf:name must always start with a letter from A to R • In total it stores 112 triples • The data source has limitations on access patterns, i.e. a query must contain a triple pattern with predicate foaf:name or foaf:mbox with a bounded object Center for E-Business Technology

  11. Query Planning • Query planning is based on the information provided by service descriptions • In this system, we have two stages • Source Selection: let us know which data source is relevant to a given query • The algorithm simply matches given triple patterns against the capabilities of the data sources • Ex) sd:capability [ sd:predicaterdf:type]; • SELECT ?x WHERE ?x rdf:typefoaf:Person; • As a result, every triple pattern in a BGP has a set of corresponding data sources • The results from source selection are used to build sub-queries that can be answered by the data source • Building Sub-Queries • Each data source has a sub-query • Each sub-query has a filtered BGP Center for E-Business Technology

  12. Query Planning SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 DARQ (?x foaf:name ?name) (?x foaf:mbox ?mbox) (?x foaf:name ?name) (?x foaf:mbox ?mbox) sd:capability sd:predicate foaf:name. sd:capability sd:predicate foaf:mbox. sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox. (Person, name, “TBL”) (Person, mbox, “T@x.y”) (Person, name, “ABC”) (Person, mbox, “A@b.c) Center for E-Business Technology

  13. Query Optimization - Logical • Rule-based Query Rewriting • Based on [Perez, J. et al., ISWC 2006] • Reduces the number of BGP & variables • Moving value constraints into sub-queries Center for E-Business Technology

  14. Query Optimization - Physical • Physical optimization is about the intermediate result size estimation (cost-based optimization) • The result size estimation is based on the statistics provided in the service descriptions • Join, Single Triple, Multiple Triples (BGP) • An example of a single triple pattern Center for E-Business Technology

  15. Evaluation • Dataset : a subset of DBpedia, 31.5 million triples in total • Contains RDF data extracted from Wikipedia • http://dbpedia.org Center for E-Business Technology

  16. Evaluation • 2 physical machines, 5 logical SPARQL endpoints Center for E-Business Technology

  17. Evaluation • Optimization has made significant improvements • My opinion • The experiment doesn’t count the loading time • There need to be compared with other systems • http://esw.w3.org/topic/LargeTripleStores Center for E-Business Technology

  18. Conclusion • DARQoffers a single interface for querying multiple, distributed SPARQL end-points • Using SPARQL Standard => Flexible • Using Service Descriptions • Data sources can be added and/or removed dynamically • A query can be federated and optimized with statistical information • Limitation • Predicates must be bounded (Sub. ?p Obj. is not allowed) • CONSTRUCT, DESCRIBE, ASK are not supported • GRAPH, UNION, OPTIONAL are not supported Center for E-Business Technology

  19. Paper Evaluation • Pros • Good idea • Distributed SPARQL processing is relatively new research field • Defining service descriptions • Dealing with all aspects of query engine • Implementation • My Comments • Too simple, and still slow • Many limitations Center for E-Business Technology

More Related