1 / 29

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks. M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006. The Search Problem.

tom
Télécharger la présentation

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006

  2. The Search Problem • Decentralized system of nodes, each of which stores copies of documents • Keyword-based search • Each document is identified by a set of keywords (e.g. song title) • Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”) • Example • Song: “Here Comes the Sun” • keywords: “Here”, “Comes”, “The”, “Sun” • Query: “Here” AND “Sun” • Responses: “Here Comes the Sun”, “The Sun is Here”

  3. Metrics • Success rate • fraction of queries that return a result, conditional on a result being available • Number of results found • no more than a desired maximum Rmax • Response time • for first result, and for Rmaxth result • Bandwidth cost • includes costs of index creation, query propagation, and to fetch result(s)

  4. Key Workload Characteristics • Document popularities follow a Zipfian distribution • Some documents are more widely copied than others • Are also requested more often • Some nodes have much faster connections and much longer connection durations than others

  5. So… • Retrieve popular documents with least work • Offload work to better-connected and longer-lived peers How can we do that?

  6. Hybrid P2P network [Loo, IPTPS 2004] Bootstrap Nodes DHT Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index Ultrapeers Peers

  7. How to know document popularity? • PIERSearch uses • Observations of • result size history • keyword frequency • keyword pair frequency • Sampling of neighboring nodes • These are all local • Global knowledge is better

  8. More on global knowledge • Want histogram of document popularity • i.e. number of ultrapeers that index a document • we only care about popular documents, so can truncate the tail • On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers • If this exceeds threshold, then flood, else use DHT* * modulo rare documents with common keywords, see paper

  9. Example • Assume 100 ultrapeers and only two documents • Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2 • Query ‘Sun’ has sum 15+2/100 = 0.17 • Query ‘Are My’ has sum 2/100 = 0.02 • If threshold is 0.05, then first query is flooded and for second, we use a DHT

  10. How to compute the histogram? • Central server • Centralizes load and introduces single point of failure • Compute on induced tree • brittle to failures • Gossip • pick random node and exchange partial histograms • can result in double counting

  11. A: a, b B: a, c C: a, d Double counting problem a:5 b:1 c:1 d:1 a:2 b:1 c:1 a:2 b:1 c:1 a:3 b:1 c:1 d:1

  12. Avoiding double couting • When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT • Gossip CT values for all titles with other ultrapeers to compute maxCT • because max is an extremal value, no double counting • (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT • Example • 1000 nodes • Chances are good that one will see 10 consecutive heads • It gossips ‘10’

  13. Approximate histograms • Use coin-flipping trick for each document • Note that there can be up to 50% error • Gossip partial histograms • Concatenate histograms • Truncate low-count documents

  14. What about the threshold? • If chosen too low, flood too often! • If chosen too high, flood too rarely! • Threshold is time dependent and load dependent • No easy way to choose it

  15. Adaptive thresholding • Associate utility with the performance of a query • Threshold should maximize utility • For some queries, use both flooding and DHT and compare utilities • This will tell us how to move the threshold in the future

  16. Utility function

  17. Adaptive thresholding

  18. Evaluation • Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java. • Simulates each query, response and document download. • Uses user lifetime and bandwidth distributions observed in real systems. • Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head) • can also use traces of queries from real systems.

  19. Parameters • 3 peers join every 4 seconds • Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents • Peers emit queries on average once every 300 seconds, requesting at most 25 results • Zipf parameter of 1.0. • 1.7 million queries over a 22 hour period

  20. Simulation stability • Stable population achieved at 20,000 seconds • Variance of all results under 5% and removed for clarity

  21. Systems compared

  22. Metrics

  23. Performance (normalized)

  24. Adaptive thresholding

  25. Scaling (normalized)

  26. Trace-based simulation • Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 • ~230,000 distinct queries • ~200,000 distinct keywords • ~672,000 distinct documents

  27. Conclusions • Gossip is an effective way to compute global state • Utility functions provide simple ‘knobs’ to control performance and balance competing objectives • Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants” • Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two

  28. Questions? ? ? ?

  29. Simulator Speedup • Fast I/O routines • Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance. • Batch database uploads • prepared statements turn out to be much less efficient than importing a table from a tab-separated text file. • Avoid keyword search for exact queries • Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes

More Related