Efficient Search Techniques in Peer-to-Peer Systems

Routing Indices For P-to-P Systems ICDCS 2002

Introduction • Search in a P2P system • Mechanisms without an index • Mechanisms with specialized index nodes (centralized search) • Mechanisms with indices at each node • Structure P2P network • Unstructure P2P network • Parallel v.s. sequentially search • Response time • Network traffic

Routing indices(RI) • Query • Documents are on zero or more “topics”, and queries request documents on particular topics. • Documents topics are independent • Local index • RI • Each node has a local routing index which contains following information • The number of documents along each path • The number of documents on each topic of interest • Allow a node to select the “best” neighbors to send a query to

The RI may be “coarser” than the local indices • overcounts • Undercounts

Goodness measure • Number of results in a path • Using Routing indices

Storage space • N: number of nodes in the P2P network • b: branching factor • c: number of categories • s: counter size in bytes Centralized index : s*( c+1) *N Distributed system: s*(c+1)*b (each node)

Creating routing indices

Maintaining Routing Indices • Trade off between RI freshness and update cost • No requiring the participation of a disconnecting node • Discussion • If the search topics is dependent? • Can the number of “hops” necessary to reach a document be estimated?

Alternative Routing Indices • Hop-count RI • Aggregated RIs for each “hop” up to a maximum number of hops are stored

Search cost • Number of messages • The goodness of a neighbor • The ratio between the number of documents available through that neighbor and the number of messages required to get those documents • Regular tree with fanout F • It takes Fh messages to find all documents at hop h • Storage cost?

Exponentially aggregated RI • Store the result of applying the regular-tree cost formula to a hop-count RI • How to compute the goodness of a path for the query containing several topics?

Cycles in the P2P network (HW)

Improving Search in Peer-to-Peer Networks ICDCS 2002 Beverly Yang Hector Garcia-Molina

Outline • Introduction • Techniques • Experiment

Introduction • We present three techniques for efficient search in P2P systems. • Basic idea is to reduce the number of nodes that process a query

Current Techniques • Gnutella • BFS with depth limit D. • Waste bandwidth and processing resources • Freenet • DFS with depth limit D. • Poor response time.

Iterative Deepening • Under policy P= { a, b, c} ;waiting time W • See example.

Directed BFS • A source send query messages to just a subset of its neighbors • A node maintains simple statistics on its neighbors • Number of results received from each neighbor • Latency of connection

Candidate nodes • Returned the Highest number of results • Low hop-count • High messages

Local Indices • Each node n maintains an index over the data of all nodes within r hops radius. • All nodes at depths not listed in the policy simply forward the query. • Example: policy P= { 1, 5}

Experimental Setup • For each response ,we log: • Number of hops took • IP from which the Response message came • Response time • Individual results

Experimental result

Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems Kunwadee Sripanidkulchai Bruce Maggs Hui Zhang IEEE INFOCOM 2003

motivation • Although flooding is simple and robust, it is not scalable. • A content location solution in which peers organized into an interest-based structure on top of Gnutella. • The algorithm is called interest-based shortcuts

Interest-based locality

Shortcuts Architecture and Design Goals • To create additional links on top of a peer-to-peer system’s overlay • As a separate performance enhancement layer on top of existing content location mechanisms

Content location paths

Shortcut Discovery • The first lookup returns a set of peers that store the content • These are potential candidates. • One peer is selected at random from the set and added • For scalability, each peer allocates a fixed-size amount of storage to implement shortcuts.

Shortcut selection • We rank shortcuts based on their perceived utility • A peer sequentially asking all of the shortcuts on its list.

Ranking metrics • Probability of providing content • Latency of the path to the shortcut • Load at the shortcut • A combination of metrics can be used based on each peer’s preference

Performance indices • Success rate • Load characteristics • Query scope • Minimum reply path lengths • Additional state

Potential and Limitations • Adding 5 shortcuts at a time produces success rates that are close to the best possible. • Slightly increase the shortest path length from 1 to 2 hops will perform better success rate.

Conclusion • A simple and practical mechanism was proposed.

Similarity Discovery in structured P2P Overlays ICPP

Introduction • Structured P2P network • Only support search with a single keyword • Similarity between two documents • Keyword sets • Vector space • Measure • Problems • Search problem • New keyword?

Meteorograph • Absolute angle

Publishing and Searching • Publish • Hash • Publish the item to a node np with the hash key closest to hash value

Search problem • Nearest answers • K_nearest answers • e • Partial • Comprehensive • Search strategy • Discussions • What happened when keyword vector is represented by q?

Other issues • Load balance • Changes of vector space • Republished? • Comprehensive set of keywords • Other methods?

SWAM: A Family of Access Methods for Similarity-Search in Peer-to-Peer Data Networks Farnoush Banaei-Kashani Cyrus Shahabi (CIKM04)

PDN access method • Defines • How to organize the PDN topology to an index-like structure • How to use the index structure

Hilbert space • Hilbert space (V, Lp) • Key k = (a1,a2, … , ad) • d: the dimension of a Vector space • The domain is a contiguous and finite interval of R • The Lp norm with p belongs to Z+ • The distance function to measure the dissimilarity

Topology • Topology of a PDN can be modelled as a directed graph G(N, E) • A(n) is the set of neighbors for node n • A node maintains • A limited amount of information about its neighbors Includes • the key of the tuples maintained at neighbors • The physical addresses of neighbors

The processing of the query is completed when all expected tuples in the relevant result set are visited • Access methods • Join, leave for virtual nodes • Forward for using local information to process queries and make forwarding decisions

The small world example • Grid component • Random graph component • The process of queries (exact, range, kNN) in the highly locality topology

Flat partitioning • SWAM also employs the space partitioning idea: flat partitioning

Query Processing • Exact-Match query processing • Range query processing • kNN Query processing

Data Indexing in Peer-to-Peer DHT Networks ICDCS 2004

Efficient Search Techniques in Peer-to-Peer Systems