1 / 32

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine. Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany weikum@mpi-inf.mpg.de. Peter Triantafillou University of Patras Rio, Greece peter@ceid.upatras.gr. Sebastian Michel

chesmu
Télécharger la présentation

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany weikum@mpi-inf.mpg.de Peter Triantafillou University of Patras Rio, Greece peter@ceid.upatras.gr Sebastian Michel Max-Planck-Institut für Informatik Saarbrücken, Germany smichel@mpi-inf.mpg.de Middleware 2005 Grenoble, France

  2. Vision • Today: Web Search is dominated • by centralized engines (“to google”) • - censorship? • - single point of attack/abuse • - coverage of the web? • Ultimate goal: “Distributed Google” to • break information monopolies • P2P approach best suitable • large number of peers • exploit mostly idle resources • intellectual input of user community MINERVA Infinity: A Scalable Efficient P2P Search Engine

  3. Challenges • large scale networks • 100,000 to 10,000,000 users • large collections > 10^10 documents • 1,000,000 terms • high dynamics MINERVA Infinity: A Scalable Efficient P2P Search Engine

  4. Questions • Network Organization • structured? • hierarchical? • unstructured? • Data Placement • move data around? • data remains at the owner? • Scalability? • Query Routing/Execution • Routing indexes? • Message flooding? MINERVA Infinity: A Scalable Efficient P2P Search Engine

  5. Overview • Motivation (Vision/Challenges/Questions) • Introduction to IR and P2P Systems • P2P- IR • Minerva Infinity • Network Organization • Data Placement • Query Processing • Data Replication • Experiments • Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine

  6. 5 x 7 x 4 x # of terms (term frequency) Information Retrieval Basics Document Terms MINERVA Infinity: A Scalable Efficient P2P Search Engine

  7. d28: 0.7 d53: 0.8 d51: 0.6 d11: 0.6 d55: 0.6 d12: 0.5 d17: 0.1 d44: 0.4 d14: 0.4 d17: 0.3 ... d52: 0.3 d52: 0.1 d44: 0.2 ... d28: 0.1 ... Information Retrieval Basics (2) Top-k Query Processing: find k documents with the highest total score Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached B+ tree on terms e.g. Fagin’s algorithm TA or a variant without random accesses index lists with (DocId: tf*idf) sorted by Score MINERVA Infinity: A Scalable Efficient P2P Search Engine

  8. P2P Systems • Peer: • “one that is of equal standing with another” (source: Merriam-Webster Online Dictionary ) • Benefits: • no single point of failure • resource/data sharing • Problems/Challenges: • authority/trust/incentives • high dynamics • … • Applications: • File Sharing • IP Telephony • Web Search • Digital Libraries MINERVA Infinity: A Scalable Efficient P2P Search Engine

  9. Structured P2P Systems based on Distributed Hash Tables (DHTs) • “structured” P2P networks • provide one simple method: lookup:key->peer • CAN [SIGCOMM 2001] • CHORD [SIGCOMM 2001] • Pastry [Middleware 2001] • P-Grid [CoopIS 2001] robustness to load skew, failures, dynamics MINERVA Infinity: A Scalable Efficient P2P Search Engine

  10. p1 p56 p8 k54 k10 p51 p48 p14 p21 p42 k24 p38 p32 k38 k30 Chord • Peers and keys are mapped to the same cyclic ID space using a hash function • Key k (e.g., hash(file name)) is assigned to the node with key p (e.g., hash(IP address)) such that k  p and there is no node p‘ with k  p‘ and p‘<p MINERVA Infinity: A Scalable Efficient P2P Search Engine

  11. Lookup(54) k54 Chord (2) • Using finger tables to speed up lookup process • Store pointers to few distant peers • Lookup in O(log n) steps p1 p56 Chord Ring p8 p51 p14 p42 p21 p38 p32 MINERVA Infinity: A Scalable Efficient P2P Search Engine

  12. Overview • Motivation (Vision/Challenges/Questions) • Introduction to IR and P2P Systems • P2P- IR • Minerva Infinity • Network Organization • Data Placement • Query Processing • Data Replication • Experiments • Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine

  13. P2P - IR • Share documents (e.g. Web pages) in an efficient and scalable way • Ranked retrieval • simple DHT is insufficient MINERVA Infinity: A Scalable Efficient P2P Search Engine

  14. p1 p56 p8 p51 p48 p14 p21 p42 p38 p32 capacity overload of peers with highly frequent / popular terms (data load AND query load) Possible Approaches • Each peer is responsible for storing the COMPLETE index list for a subset of terms. Query Routing: DHT lookups Query Execution: Distributed Top-k [TPUT ’04, KLEE ‘05] MINERVA Infinity: A Scalable Efficient P2P Search Engine

  15. a: P1 P6 P4 b: P5 P3 P1 P6 ... P2 P1 P3 Distributed Directory Term List of Peers à P6 P4 P5 capacity overload of peers with - highly frequent terms - high-quality collections Possible Approaches (2) • Each peer has its own local index (e.g., created by web crawls) Query Routing: 1. DHT lookups 2. Retrieve Metadata 3. Find most promising peers Query Execution: - Send the complete Query and merge the incoming results MINERVA Infinity: A Scalable Efficient P2P Search Engine

  16. Overview • Motivation (Vision/Challenges/Questions) • Introduction to IR and P2P Systems • P2P- IR • Minerva Infinity • Network Organization • Data Placement • Query Processing • Data Replication • Experiments • Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine

  17. low high low high high low Minerva Infinity • Idea: • assign (term, docId, score) triplets to the peers • order preserving • load balancing • hash(score)+ hash(term) as offset • guarantee 100% recall MINERVA Infinity: A Scalable Efficient P2P Search Engine

  18. Hash Function • Requirements: • Load balancing (to avoid overloading peers) • Order preserving (to make the QP work) • One without the other is trivial ... • Load balancing: apply a pseudo random hash function • Order preserving: • Both together is challenging … S-Smin ----------------- * N Smax - Smin MINERVA Infinity: A Scalable Efficient P2P Search Engine

  19. Hash Function (2) • Assume an exponential score distribution • Place the first half of the data to the first peer • The next quarter to the next peer • and so on … 1 … 0 MINERVA Infinity: A Scalable Efficient P2P Search Engine

  20. Term Index Networks (TINs) • Reduce # of hops during QP by reducing the number of peers that maintain the index list for a particular term  Only a small subset of peers is used to store an index list. 62 2 45 B 2 Global Network 45 24 7 41 41 7 62 A 12 C 12 37 15 16 24 24 16 20 MINERVA Infinity: A Scalable Efficient P2P Search Engine

  21. How to Create/Find a TIN • Use u Beacon-Peers to bootstrap the TIN for term T p = 1/u For i=0 to i<n‘ do id = hash(t, i*p) if (i>0) use hash(t,(i-1)*p) as a gateway to the TIN else node with id creates the TIN End for Global Network T Beacon nodes act as gateways to the TIN MINERVA Infinity: A Scalable Efficient P2P Search Engine

  22. Publish Data / Join a TIN • Peer with id = hash(t, score) not in the TIN for term t • Randomly select a beacon node (Beacon nodes act as gateways to the TIN) • Call the join method • Store the item (docId, t, score) MINERVA Infinity: A Scalable Efficient P2P Search Engine

  23. Query Processing Data Peers Coordinator 1 1 2-keyword Query Alternative: Collect data and send in one batch. MINERVA Infinity: A Scalable Efficient P2P Search Engine

  24. QP with Moving Coordinator Data Peers Coordinator 1 1 1 3-keyword Query MINERVA Infinity: A Scalable Efficient P2P Search Engine

  25. 41 41 7 7 62 62 A A 2 2 8 12 12 45 45 64 C C B B B B 16 16 24 24 20 28 55 11 57 1 50 7 B B B B C C 46 34 22 Data Replication • Vertical: Replicate data inside a TIN via a ‘reverse’ communication. • Horizontal: Replicate complete TINs 1 1 2 3 2 1 2 3 3 1 2 3 64 11 C C 24 24 31 1 49 5 A A A A 16 19 MINERVA Infinity: A Scalable Efficient P2P Search Engine

  26. Experiments Test bed: 10,000 peers Benchmarks: • GOV: TREC .GOV collection + 50 TREC-2003 Web queries, e.g. juvenile delinquency • XGOV: TREC .GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention • SCALABILITY: One query executed multiple times ………. MINERVA Infinity: A Scalable Efficient P2P Search Engine

  27. Experiments: Metrics Metrics • Network traffic (in KB) • Query response time (in s) - network cost (150ms RTT, 800Kb/s data transfer rate) - local I/O cost (8ms rotation latency + 8MB/s transfer delay) - processing cost • Number of Hops MINERVA Infinity: A Scalable Efficient P2P Search Engine

  28. Scalability Experiment • Measure time for a different query loads. • identical queries • inserted into a queue MINERVA Infinity: A Scalable Efficient P2P Search Engine

  29. Experiments: Results MINERVA Infinity: A Scalable Efficient P2P Search Engine

  30. Conclusion • Novel architecture for P2P web search. • High level of distribution both in data and processing. • Novel algorithms to create the networks, place data, and execute queries. • Support of two different data replication strategies. MINERVA Infinity: A Scalable Efficient P2P Search Engine

  31. Future Work • Support of different score distributions • Adapt TIN sizes to the actual load • Different top-k query processing algorithms MINERVA Infinity: A Scalable Efficient P2P Search Engine

  32. Thank you for your attention MINERVA Infinity: A Scalable Efficient P2P Search Engine

More Related