1 / 42

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing. Nie Zhi niezhixuesen@163.com. Outline. Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion. Outline. Introduction Related work SPARQL Query Processing in MapReduce

liliha
Télécharger la présentation

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com

  2. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  3. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  4. RDF • Resource Description Framework • subject-predicate-object expressions (S-P-O) http://www.mpii.de/yago/resource/ Albert Einstein Albert Einstein S isCalled isCalled Albert Einstein isCalled isCalled P O wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics Nobel Prize in Physics

  5. SPARQL Query Language for RDF PREFIX source:<http://www.mpii.de/yago/resource/> SELECT ?name ?where WHERE { ?who source:hasWonPrize Nobel Prize in Physics. ?who source:isCalled ?name. ?who source:wasBornIn ?where} Query: http://www.mpii.de/yago/resource/ isCalled isCalled Albert Einstein Albert Einstein isCalled isCalled wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics

  6. RDF knowledge base… Semantic web , Web2.0 Extract Knowledge from the Web YAGO DBpedia Freebase Billion Triple Challenge…

  7. RDF knowledge base 295 data sets 31 billion RDF triples 504 million RDF links (September 2011)

  8. Challenge and Opportunity Challenge The RDF data is growing rapidly. Researchers are working with billions of triples. Relational database has limited ability on scalability. Opportunity Google GFS, MapReduce, BigTable Hadoop: implementation of the MapReduce framework and HDFS Achievements:Yahoo!,Amazon,腾讯,百度,淘宝...... We need to consider the recent achievements for handling massive scale Web data on clusters

  9. MapReduce:word count • Map(k1,v1) → list(k2,v2) • Reduce(k2, list (v2)) → list(k3,v3) Map output Reduce Input Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), • (good 1), (good 1) • Worker 1: • (the 1), (weather 1), • (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), • (is 1), (good 1). • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4) file1: the weather is good file2: today is good flie3: good weather is good.

  10. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  11. Solution 1 • Directly map the SPARQL into a sequence of MapReduce Jobs • Pro. • scalable • Con. • a burden on the user in terms of usage and maintenance • Not support complex query • No index • Not consider the RDF data characteristics

  12. Solution 2 • Map the SPARQL to Pig -> MapReduce Jobs • Pro. • Scalable • Support complex query • Con. • No index • Not consider the RDF data characteristics

  13. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  14. Architecture overview SPARQL Translator RDF 2 JSON Loader BGP Union Filter Optional Transform Filter Join Sort Group Built-in Functions JAQL Query Language Optimizer JSON Data Model Map-Reduce Runtime HDFS Cluster Deployment and Management

  15. JSON • JSON (JavaScript Object Notation) is a lightweight data-interchange format • It is based on a subset of the JavaScript Programming Language • JSON is built on two structures: • A collection of name/value (Key/value) pairs • An ordered list of values (array)

  16. RDF to JSON • JSON is built on two structures: • name/value (Key/value) pairs {s:Albert Einstein} • list of values(array) [{s:Albert Einstein},{}…]

  17. JAQL JAQL is an open-source language for querying JSON (JavaScript Object Notation) data. It provides a general parallel data processing platform on Hadoop Developed by IBM

  18. Basic Idea • SPARQL can be supported on Hadoop by translating queries into JAQL operators

  19. SPARQL to JAQLTransformation 1 2 3 1 Mapreduce job1 Mapreduce job2 2 3 Mapreduce job3 Mapreduce job4 4 {s:Albert Einstein, p:isCalled, o:Albert Einstein }

  20. Data storage In Hadoop framework, a file is the smallest unit of input to a MapReduce job and read from the disk. One straightforward partitioning strategy is to store all the data in one file Must scan the entire data in the read operation Data Partitioning Strategy

  21. Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning

  22. Horizontal partitioning with JSON • For example • Store in HDFS

  23. Vertical Partitioning with JSON • For example • Store in HDFS

  24. Clustered property partitioning with JSON • For example • Store in HDFS

  25. Partition Index: Vertical Partitioning

  26. Partition Index: Horizontal partitioning

  27. Partition Index: Clustered property partitioning

  28. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  29. Experiments • Dataset:Billion Triples Challenge 2010(BTC10) . • 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have 1,426,823,976 unique triples; • Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. • 30nodes: One node is a master, and the others are slaves • 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R) CPU E5645@ 2.40GHz • “dfs.replication” is 2 • JAQL is 0.5.1 version • Java 1.6

  30. Experiments Fig. Distribution of data

  31. Experiments Fig. Cost time of each query

  32. Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

  33. Conclusion Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop. Transformation of SPARQL to JAQL Filter, Transform, Join …… Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one

  34. Scalability RDBMS: Waits and deadlocks are increasing nonlinearly with the size of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, very expensive Schema:Structured data MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data

  35. RDBMS V.S MapReduce Table . RDBMS compared to MapReduce

  36. Limit of hadoop The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance

  37. The Next Generation of Apache Hadoop MapReduce • Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. • ResourceManager ApplicationMaster

  38. Conclusion • SPARQL on Cloud • Pro. • Scalable • High throughput • Con. • Expense of latency • Complex query:JAQL • Join operation Hadoop(MapReduce) Pro. Scalable High throughput Con. Expense of latency No index No more than 4000 nodes

  39. Thanks!

  40. Sparql query Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. } Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. } Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. } Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.} Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }

  41. Sparql query Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. } Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. } Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.} Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.} Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}

  42. Sparql query Q3, Q10 are star join queries with poplar predicates and unspecified object Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object. Q2 is a chain query The value of subject is literals in Q7

More Related