1 / 25

ESWC 2009 Research IX: Evaluation and Benchmarking

ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack , Wolf Siberski, Wolfgang Nejdl L3S Research Center, Universität Hannover, Germany {minack,siberski,nejdl}@L3S.de 03.06.2009

kirima
Télécharger la présentation

ESWC 2009 Research IX: Evaluation and Benchmarking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl L3S Research Center, Universität Hannover, Germany {minack,siberski,nejdl}@L3S.de 03.06.2009 http://www.l3s.de/~minack/rdf-fulltext-benchmark/

  2. Outline • Motivation • Benchmark • Data set and Query set • Evaluation • Methodology and Results • Conclusion • References Enrico Minack

  3. 1. Motivation • Semantic applications provide fulltext search • Underlying RDF stores have to provide fulltext search • Application developers have to choose • Best practice:  Benchmark • No fulltext search RDF benchmark • RDF stores perform ad hoc benchmarks •  strong need for RDF fulltext benchmark Enrico Minack

  4. 2. Benchmark • Extended Lehigh University Benchmark [LUBM] • Synthetic data, fixed list of queries • Familiar but not trivial ontology • University, Faculty, Professors, Students, Courses, … • Realistic structural properties • Artificial literal data • „Professor1“, „GraduateStudent216“, „Course7“ Enrico Minack

  5. 2. Benchmark Enrico Minack

  6. 2. Benchmark Enrico Minack

  7. 2.1 Data set • Added • Person names (first name, surname)following real world distribution • Publication content following topic-mixture-basedword distributions trained by real document collection [LSA] Enrico Minack

  8. 2.1 Data set (Person Names) • Probabilities from U.S. Census 1990 • (http://www.census.gov/genealogy/names/) • 1,200 male first names • 4,300 female first names • 19,000 surnames Enrico Minack

  9. 2.1 Data set (Publication Text) Probabilistic Topic Model 100 Topics (word probabilities) Topics of documents Topic occuring probability Topic cooccurring probability trained NIPS data set 1,740 documents Enrico Minack

  10. 2.1 Data set (Publication Text) Graduate Student Faculty Professor Topic Topic Topic Publication Enrico Minack

  11. 2.1 Data set (Statistics) Enrico Minack

  12. 2.2 Query set • Three sets of queries • Basic IR Queries • Semantic IR Queries • Advanced IR Queries Enrico Minack

  13. 2.2 Query set (Basic IR Queries) • Pure IR queries • Q1: • Q2: • Q3: • Q4: • Q5: „engineer“ „network“ ub:publicationText „network“ „engineer“ ub:publicationText „network“ „engineer“ ub:publicationText „network engineer“ ub:surname „smith“ „Smith“ Enrico Minack

  14. 2.2 Query set (Semantic IR Queries) ub:Publication ub:Publication ub:publicationText „engineer“ • Q6: • Q7: • Q8: • Q9: ub:title ?title ub:publicationAuthor ub:FullProfessor ub:fullname „smith“ ?name Enrico Minack

  15. 2.2 Query set (Semantic IR Queries) ub:Publication ub:publicationText „engineer“ • Q10: • Q11: ub:publicationAuthor ub:FullProfessor ub:fullname „smith“ ub:publicationAuthor ub:Publication ub:publicationText „network“ Enrico Minack

  16. 2.2 Query set (Advanced IR Queries) • Q12: „+network +engineer“ • Q13: „+network –engineer“ • Q14: „network engineer“~10 • Q15: „engineer*“ • Q16: „engineer?“ • Q17: „engineer“~0.8 • Q18: „engineer“  Score • Q19: „engineer“  Snippet • Q20: „network“  Top 10 • Q21: „network“  Score > 0.75 ub:publicationText Enrico Minack

  17. 3. Evaluation • 2 GHz AMD Athlon 64bit Dual Core Processor • 3 GByte RAM, RAID 5 array • GNU/Linux, JavaTM SE RE 1.6.0 10 with 2 GB Memory Jena 2.5.6 + TDB Sesame 2.2.1NativeStore + LuceneSail Virtuoso 5.0.9 YARS post beta 3 Enrico Minack

  18. 3.1 Evaluation Methodology • Evaluated LUBMft(N) with N = {1, 5, 10, 50} • For each store: • For each query: • Flush the file system cache • Start the store • Repeat 6 times • Evaluate the query • Evaluation time > 1,000s, break • Stop store • Performed 5 times Enrico Minack

  19. 3.2 Evaluation Results • Basic IR Queries „engineer“ „network“ Enrico Minack

  20. 3.2 Evaluation Results • Semantic IR Queries ub:Publication ub:publicationText ub:title „engineer“ ?title ub:publicationAuthor ub:FullProfessor ub:fullname „smith“ ?name Enrico Minack

  21. 3.2 Evaluation Results • Semantic IR Queries ub:pubText ub:Pub „engineer“ ub:pubAuth ub:full ub:FullProf „smith“ ub:pubAuth ub:pubText ub:Pub „network“ Enrico Minack

  22. 3.2 Evaluation Results • Advanced IR Queries • Same relativeperformance • Feature Richness:Sesame (10)Jena (9)YARS (5)Virtuoso (1) Enrico Minack

  23. 4. Conclusion • Identified strong need for a fulltext benchmark • - For semantic application and RDF store developers • Extended LUBM towards a fulltext benchmark • Other benchmarks can be extended similarily • RDF stores provide many IR features • boolean, phrase, proximity, fuzzy queries • Multiple fulltext queries in one query are challenging Enrico Minack

  24. 5. References • [LSA] Mahwah, N.J., Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, 2007. • [LUBM] Guo, Y., et al.: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 158-182 (2005). • [LuceneSail] Minack, E., et al.: The Sesame LuceneSail: RDF Queries with Full-text Search. Technical Report 2008-1, NEPOMUK (February 2008). • [Sesame] Broekstra, J., et al.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 54-68. Springer, Heidelberg (2002). • [Jena] Carroll, J.J., et al.: Jena: Implementing the Semantic Web Recommendations. In: WWW Alternate track papers & posters, pp. 74-83. ACM, New York (2004). • [YARS] Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress. IEEE Press, Los Alamitos (2005). Enrico Minack

More Related