1 / 22

Routing of Structured Queries in Large-Scale Distributed Systems

Routing of Structured Queries in Large-Scale Distributed Systems. Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter

kipling
Télécharger la présentation

Routing of Structured Queries in Large-Scale Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter Institute for Informatics / Telematics GroupGoethe-University / Frankfurt am Main, Germany

  2. Routing of Structured Queries in Large-Scale Distributed Systems Overview 1. Introduction • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion

  3. 1.Introduction2. Concept 3. Routing 4. Evaluation Proposed research: • XML Information Retrieval in P2P systems • Investigate the impact of using structural information when retrieving XML-documents in a P2P network • Challenge: not all information accessable / scalability issues How to perform & improve query routing in a large-scale P2P System by using structural information?

  4. 1.Introduction2. Concept 3. Routing 4. Evaluation XML Information Retrieval in Peer-to-Peer Systems: • Challenges: • no central index • only selected information available • bandwith consumption / communication overhead • efficiency vs effectiveness • vague queries • relevance-ranking InformationRetrieval Peer-to-Peer XML-Retrieval • structured documents • more precise search • based on c/s architectures • distributed • autonomous peers • growing amount of XML-documents

  5. Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 2. Concept & Architecture

  6. 1.Introduction 2. Concept 3. Routing 4. Evaluation Concept for a P2P-search engine: • Queries: content-and-structure (CAS) • Indexing: include structure • Hybrid indexing: globally or locally (distributing summaries) depending on peer status  index with posting lists (doc level) & peer lists (peer-level) • Distributing global information into DHT • Ranking: extended vector space model (using structure) • Results/Retrieval units: document or element retrieval

  7. 1.Introduction 2. Concept 3. Routing 4. Evaluation Concept for a P2P-search engine: • Routing: • Use peer lists and posting lists • Use of pre-computed posting lists for popular term combinations  highly discriminative keys (HDKs) • Use of pruned posting lists by considering structural information • Ordering of posting lists by a query-independent score (evidence from document-, element-, collection, and peer level) • Select top k results according to pre-ranking regarding structural similarity between CAS query and posting key

  8. 1.Introduction 2. Concept 3. Routing 4. Evaluation Frequent XTerm index HDK index P2P network DL Local documents APPLICATION GUI Indexing Querying & result presentation Querying Component INFORMATION RETRIEVAL Index storage component Inverted Index Statistics Index Document index Retrieval unitindex Similarity calculator Retrieval component Ranking component Routing component Weighting calculator Sourceselector PEER-TO-PEER P2P component SpirixDHT PeerMetricscalculator SimulationDHT Chord

  9. Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 3. Routing

  10. P0 P7 P1 P6 P2 P5 P3 P4 1.Introduction 2. Concept 3. Routing 4. Evaluation • (dok2,12.4) • (dok2/chap, 11.2) • (dok1/sec,5.4) Example: q q = {apple, \book} • Peer P0 looks for books about apples • Id i0 = hash(apple, \book) = hash(apple)is calculated • Peer P5 assigned to i0 is located in log(n) hops • Query q is sent to P5 • P5 selects top k=2 postings for q;these relate to dok1 and dok2 • Id i1 = hash(dok1) and Id i1 = hash(dok1) are calculated, their peers located • q is sent to P2 and P6 assigned to i1 and i2 • P2 and P6 calculate relevance for dok1 and dok2 plus their RUs • P2 and P6 send back results to P0 Dok2=(1,4,0,0,3,…) Dok1=(0,1,5,1,3,…) Result = {(dok2,12.4), (dok2/chap, 11.2)} Result = {(dok1/sec,5.4)} q assigned to hash(apple) apple, \book  dok1(4.8), dok2(4.1), dok3(3.7)…apple, \novel  dok2(12.9) apple, \article\p\sec  ----

  11. 1.Introduction 2. Concept 3. Routing 4. Evaluation Routing process:

  12. 1.Introduction 2. Concept 3. Routing 4. Evaluation Weighting of postings (query independent at indexing): • Entries sorted by scoret(di); choose k best entries for XTerm t • Considers document di, best retrieval unit rubest, and peer pi • Weighting function w: BM25e-based • PeerScore: high for peers with good collections regarding t and with good performance metrics

  13. 1.Introduction 2. Concept 3. Routing 4. Evaluation Selection of Postings (query dependend reordering): Example: apple \book\chapter  dok1(12.8), dok2(12.4) \article\p  dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section  dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) apple \book\chapter  dok1(12.8), dok2(12.4) \article\p  dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section  dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) sim = 1 sim = 0 sim = 0.7 q = { (apple, \book\chapter), (chips, \section) } Final Posting list = {dok2(12.4*1+3.1*0.7=14.6), dok1(12.8*1+2.3*0.7=14.4), dok4(18.4*0.7=12.9), dok3(1.5*0.7=1.1) }

  14. Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 4. Evaluation

  15. 1.Introduction 2. Concept 3. Routing 4. Evaluation Implementation: • Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents • P2P-complex: • Based on OpenChord, • Collects peer characteristics, • Adapted to special requirements of XML IR • Preliminary evaluation with INEX-Collection

  16. 1.Introduction 2. Concept 3. Routing 4. Evaluation Evaluation: • Evaluation with INEX-Collection of 2007: • Wikipedia-collection: 660.000 documents (4.6 GB) • 80 CAS queries (out of 123 topics ) • run on 1 peer with simulationDHT (measurement of #postings) • retrieval of best 1500 results per query • PLmax set to indefinite ( all HDKs single XTerms) • different structural similarity functions • simple version of the proposed formulas (document-based) • Goal: show the effect of using structural hints for routing • efficiency (#postings: 100, 500, 2000 postings) • effectivness (precision at different recall levels)

  17. 1.Introduction 2. Concept 3. Routing 4. Evaluation

  18. 1.Introduction 2. Concept 3. Routing 4. Evaluation

  19. +7,2% +8,7% +5,5% 1.Introduction 2. Concept 3. Routing 4. Evaluation

  20. 1.Introduction 2. Concept 3. Routing 4. Evaluation

  21. 1.Introduction 2. Concept 3. Routing 4. Evaluation Conclusion: • Propose to take advantage of XML structure when routing in highly distributed environments such as P2P systems • Provide an infrastructure for investigation of proposed techniques to perform routing based on evidence from document-, element-, collection-, and peer-level • For 80 CAS topics of INEX2007, efficiency and effectivness could be improved • Future work to verify the observed improvement: • evaluate formulas in full version • runs with multimedia topics INEX 2007; INEX2008 • measure bandwidth consumption (incl. #messages, message sizes) • run on different peers; split collection

  22. Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion ? 5. Questions and Discussion

More Related