Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006

The Search Problem • Decentralized system of nodes, each of which stores copies of documents • Keyword-based search • Each document is identified by a set of keywords (e.g. song title) • Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”) • Example • Song: “Here Comes the Sun” • keywords: “Here”, “Comes”, “The”, “Sun” • Query: “Here” AND “Sun” • Responses: “Here Comes the Sun”, “The Sun is Here”

Metrics • Success rate • fraction of queries that return a result, conditional on a result being available • Number of results found • no more than a desired maximum Rmax • Response time • for first result, and for Rmaxth result • Bandwidth cost • includes costs of index creation, query propagation, and to fetch result(s)

Key Workload Characteristics • Document popularities follow a Zipfian distribution • Some documents are more widely copied than others • Are also requested more often • Some nodes have much faster connections and much longer connection durations than others

So… • Retrieve popular documents with least work • Offload work to better-connected and longer-lived peers How can we do that?

Hybrid P2P network [Loo, IPTPS 2004] Bootstrap Nodes DHT Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index Ultrapeers Peers

How to know document popularity? • PIERSearch uses • Observations of • result size history • keyword frequency • keyword pair frequency • Sampling of neighboring nodes • These are all local • Global knowledge is better

More on global knowledge • Want histogram of document popularity • i.e. number of ultrapeers that index a document • we only care about popular documents, so can truncate the tail • On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers • If this exceeds threshold, then flood, else use DHT* * modulo rare documents with common keywords, see paper

Example • Assume 100 ultrapeers and only two documents • Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2 • Query ‘Sun’ has sum 15+2/100 = 0.17 • Query ‘Are My’ has sum 2/100 = 0.02 • If threshold is 0.05, then first query is flooded and for second, we use a DHT

How to compute the histogram? • Central server • Centralizes load and introduces single point of failure • Compute on induced tree • brittle to failures • Gossip • pick random node and exchange partial histograms • can result in double counting

A: a, b B: a, c C: a, d Double counting problem a:5 b:1 c:1 d:1 a:2 b:1 c:1 a:2 b:1 c:1 a:3 b:1 c:1 d:1

Avoiding double couting • When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT • Gossip CT values for all titles with other ultrapeers to compute maxCT • because max is an extremal value, no double counting • (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT • Example • 1000 nodes • Chances are good that one will see 10 consecutive heads • It gossips ‘10’

Approximate histograms • Use coin-flipping trick for each document • Note that there can be up to 50% error • Gossip partial histograms • Concatenate histograms • Truncate low-count documents

What about the threshold? • If chosen too low, flood too often! • If chosen too high, flood too rarely! • Threshold is time dependent and load dependent • No easy way to choose it

Adaptive thresholding • Associate utility with the performance of a query • Threshold should maximize utility • For some queries, use both flooding and DHT and compare utilities • This will tell us how to move the threshold in the future

Utility function

Adaptive thresholding

Evaluation • Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java. • Simulates each query, response and document download. • Uses user lifetime and bandwidth distributions observed in real systems. • Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head) • can also use traces of queries from real systems.

Parameters • 3 peers join every 4 seconds • Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents • Peers emit queries on average once every 300 seconds, requesting at most 25 results • Zipf parameter of 1.0. • 1.7 million queries over a 22 hour period

Simulation stability • Stable population achieved at 20,000 seconds • Variance of all results under 5% and removed for clarity

Systems compared

Metrics

Performance (normalized)

Adaptive thresholding

Scaling (normalized)

Trace-based simulation • Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 • ~230,000 distinct queries • ~200,000 distinct keywords • ~672,000 distinct documents

Conclusions • Gossip is an effective way to compute global state • Utility functions provide simple ‘knobs’ to control performance and balance competing objectives • Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants” • Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two

Questions? ? ? ?

Simulator Speedup • Fast I/O routines • Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance. • Batch database uploads • prepared statements turn out to be much less efficient than importing a table from a tab-separated text file. • Avoid keyword search for exact queries • Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

Presentation Transcript

Peer-To-Peer Networks

Peer-to-Peer Networks

Self Regulated Search in Unstructured Peer-to-Peer Networks

Efficient Search in Peer to Peer Networks

Peer-to-Peer Membership Management for Gossip-Based Protocols

Peer-to-Peer Networks

Peer-to-peer networks

Peer to Peer Networks vs. Server Based Networks

Search and Replication in Unstructured Peer-to-Peer Networks

Peer to Peer Networks

Search and Replication in Unstructured Peer-to-Peer Networks

Improving Search in Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

Hybrid Keyword Search across Peer-to-Peer Federated Data

Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

Efficient Search in Peer to Peer Networks

Hybrid Keyword Search across Peer-to-Peer Federated Data