"A Measurement Study of Peer-to-Peer File Sharing Systems"

"A Measurement Study of Peer-to-Peer File Sharing Systems" Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", Proceedings of the Multimedia Computing and Networking (MMCN), San Jose, January, 2002.

Peer-to-Peer • Membership • Ad-hoc • Dynamic • Goals • Design architecture that encourages cooperation • Examples • Napster • Gnutella • BitTorrent

Locating Files (Napster) • Server maintains index of files contained in connected peers • Peer queries server for file • Server returns peer who has file and then queries other servers • Direct link between peers to transfer file

Locating Files (Gnutella) • Queries are performed by flooding the network • Responses are returned only if the peer has the file, queries are forwarded to neighbors • Files are downloaded via direct link between two peers

Collecting Metrics (Napster) • Napster Crawler • Query for “popular” files using multiple simultaneous connections, maintain list of peers returned by server • For each peer collect metadata from the server • Peers reported bandwidth • Number of files shared • Current number of uploads/downloads • Names and sizes of all files shared • IP address

Collecting Metrics (Gnutella) • Gnutella Crawler • Connect to popular peers • Send ping messages with large TTLs to known peers • Add newly discovered peers based on pong messages • Pong messages include metadata about peer • Number of files • Total size of files

Metadata collected • Bottleneck bandwidth • Latency • Number of shared files • Lifetime/Uptime • Distribution across DNS domains

Approx. 60% of peers have a session duration shorter than 1 hour • Napster peers are up a larger percentage of the time as compared to Gnutella peers • Gnutella –20% of hosts are available >45% of the time • Napster – 20% of hosts are available >80% of the time

Approximately 30% of Napster peers misreport their bandwidth (Modems + ISDN < 64Kbps) • Authors argue that misreporting is one indication that many users in a p2p system are not willing to cooperate.

Gnutella • Approximately 25% of Gnutella peers share no files • 75% of peers share less than 100 files • 7% of peers share more than 1000 files (which is more than the other 93% combined) • Napster • 40%-60% of users share only 5-20% of files

Gnutella network of 1771 peers • 30% of peers randomly removed • 1106 of remaining 1300 nodes still connected • 4% of peers selectively removed (63 best connected peers) • Network becomes highly fragmented

"Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems" Daniel Stutzbach, Reza Rejaie, and Subhabrata Sen, "Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems," Networking, 2008.

Gnutella Characteristics in Depth • Previous studies as claimed in the paper: • Lack the accuracy of the captured snapshots. • The crawlers that have been used weren’t fast enough. • Resulting in distorted snapshots • And partial view of the topology. • Simulations are standing on invalid assumption (e.g. power-law distributing). • “Finally, to our knowledge, the dynamics of unstructured P2P overlay topologies have not been studied in detail in any prior work.”

Cruiser: Fast and Accurate Crawler “Cruiser can accurately capture a complete snapshot of the Gnutella network with more than one million peers in just a few minutes”. Faster than any crawler ever built, as claimed. Enabled capturing the overlay dynamics which led to more accurate characterization. Data Set: 18,000 snapshot captured in 11 month period. Weekly intervals and daily random captures.

Some of the Findings WikiPedia As mentioned, the study refutes the power-law distribution of the node degree. Online-like architecture. Limewire v.s BearShare. Reachability: 30-38% are unreachable. Previous studies ignored this factor which forms a non-negligible fraction of the peers

Two-Tier Topology Ultrapeer Top-level overlay Leaf

Fast Crawler vs. Slow Crawler Form Two-piece Power-Law distribution. Why the Power-Law Distribution is a measurement artifact? Slow Crawler: Cruiser with less concurrent connection.

Top-Level Overlay Analysis Peaks at 30 degree Due to the fact of the pre-configuration of Limewire and BearShare. New peers, approaching 30 degree Re-configured or other implementation to have higher degree • The actual distribution is two-piece Power-Law distribution

Leaves Overlay Analysis Limewire: 30 leaves Majority have 3 or less parent < 0.02% 100-3000 parent BearShare: 45 leaves Other: 75 leaves

Reachability • Flood-based Query: New peers are discovered exponentially up to a certain point. • Pair-wise Distance: 60% have a length of 4 as shortest path distance • Effect of two-tier topology on leaves: • One Parent: we get a distribution similar to the Utlrapeers shifted by 2. • More than that: • 50% -> length of 5 (+1). • 50% -> length of 6 (+2).

Overlay Dynamics As the number of peers increases, the uptime (in hours) decreases exponentially.

Overlay Dynamics-2 • What are the causes ? • Protocol Driven: As the peers select their neighbors according to the protocol. • User Driven: Peer participation • Definition of a stable peer: • a peer is stable if it manages to have a connection duration of time t. t=48 hours as in the study.

Internal Connectivity of Stable Core • 88-94% remained connected. • Observations: • Stable core are clustered together • Peers with higher uptime are more biased to establish connections with each others. Stable core: peers with t >= 48. [Excluding the connection between unstable peers]

External Connectivity of Stable Core • Peers (not in the stable core) are following the same behavior. (i.e biased to connect with peers have >= uptime) • This behavior led to form onion-like connections. • The core of the onion is the Stable Core • SC(t) < P1(t) • … • Pn-1(t) < Pn(t)

Overlay Dynamics .. The Main Cause User driven dynamics are the major factor of the overlay dynamics.

"Making Gnutella-like P2P Systems Scalable" Y. Chawathe, S. Ratnaswamy, L. Breslau, N. Lanham, S. Shenker, "Making Gnutella-like P2P Systems Scalable," ACM Sigcomm 2003,Dongyu Qiu and R. Srikant, "Modeling and performance analysis of BitTorrent-like peer-to-peer networks", Proceedings of ACM Sigcomm, 2004.

Review of Gnutella • Uses an unstructured overlay network • Distributed download AND search • Floods query across this overlay with a limited scope • Notorious for poor scaling

Scaling Gnutella • Previously proposed solution: hash table to wide-area file search • The paper advocates maintaining simplicity of unstructured system but with new mechanisms • propose solution using aspects of a system similar to KaZaA and its supernodes model • supernodes have higher bandwidth connectivity • searches are routed to supernodes which hold pointers to peer data

Distributed Hash Tables (structured overlay) • Pros: • Looking up using DHT requires O(log n) steps vs. Gnutella's O(n) steps • Cons: Doesn’t deal well with… • Transience of nodes in P2P network: DHTs require repair operations to preserve efficiency and correctness of routing. • Situations where keyword searches are more prevalent, and important, than exact-match queries • Situations where most queries are for relatively well-replicated files not single copies of files

Gia’s Improvements • All messages exchanged by clients are tagged at their origin with a globally unique identifier • Explicitly accounts for node heterogeneity and capacity constraints • Replace flooding with “biased” random walks • It is not any single of the following components, but in fact, the combination of them all that provides GIA's large performance advantage…

Dynamic topology adaptation: • Puts most nodes within short reach of high capacity nodes and makes sure that the well-connected high-degree nodes can actual handle the large number of queries by calculating “satisfaction level” (discussed later) • Active flow control: • Avoid overloading by assigning flow-control tokens based on "available capacity" • Dropping queries not an option in Gia where random walks are taken instead of floods • A node that advertises high capacity to handle incoming queries is in turn assigned more tokens for its own outgoing queries.

One-hop replication: of pointers to content offered by immediate neighbors • This way a node can also answer queries for its neighbors (resulting in fewer hops) • When a node goes offline, its information is flushed from its neighbors to maintain consistency and accuracy • Search protocol: • Based on biased random walks that directs queries towards high-capacity nodes (instead of purely random) • TTLs (and MAX_RESPONSES) bound duration of the biased random walks and book-keeping avoids redundant paths

“Satisfaction Level” • A measure of how close the sum of the capacities of all of a node's neighbors (normalized by their degrees) is to the node's own capacity. • To add a new neighbor, a node randomly selects a small number of candidate entries. From these randomly chosen entries, it selects the node with maximum capacity greater than its own capacity. If no such candidate entry exists, it selects one at random. (see Algorithm 1)

Nodes with low satisfaction levels perform topology adaptation more frequently than satisfied nodes. I = T x K^(1-S)S : Satisfaction level (see algorithm 2)I : Adaptation interval T : maximum interval between adaptation iterationsK : aggressiveness of the adaptation. • After each interval I, if a node's S < 1.0, it attempts to add a new neighbor. If S = 1.0, it still continues to iterate through the adaptation process, checking its satisfaction level every T seconds.

Algorithms(regarding Satisfaction)

Gia, Flood, Random Walk over Random Topology, and Supernode models. • When the query load increases, we notice a sharp "knee" in the curves beyond which the success rate drops sharply and delays increase rapidly. The hop-count holds steady until the knee-point and then decreases.

Gia, Flood, Random Walk over Random Topology, and Supernode models. • Compare by observing Collapse Point (CP) and Hop-count before Collapse Point (CP-HC)

Single Search Responses • Aggregate System Capacity of Gia is 3 to 5 orders of magnitude higher than Flood and Random Walk Random Topology. • RWRT performs better than Flood typically but can be about the same when there are fewer nodes since RWRT may end up visiting practically all the nodes anyway. • Flood and Supernode models retain low hop counts whereas Gia may sometimes need to traverse quite a bit. RWRT, being random, has hop counts that are inversely proportional to the replication factor. • Flood and Supernode models have a collapse point that falls as the number of nodes increases due to the greater query load at each node.

Multiple Search Responses • MAX_RESPONSES parameter only affect Gia and RWRT since Flood and Supernode flood throughout the network • Naturally, higher MAX_RESPONSES causes more hop counts in Gia and RWRT models. Collapse point is also drops when the replication factor is low. • Note: A search for k responses at r% replication is equivalent to one for a single answer at r/k% replication.

Node failure • When a node leaves the network, its queued queries are resumed by the nodes that originally generated them. • How do we make sure a query isn't lost? • Keep-alive messages (query responses sent back to the originator of the query act as implicit keep-alives). If no actual matches have been found yet, we’ll still just send a dummy query response message.

Next bottleneck:Download process? • The authors believe that the technique of directing towards higher capacity nodes helps alleviate this issue. • To make the above benefit more significant, it would be better to have the higher capacity nodes store more files (rather than simply pointers). • Simple solution would be to have popular files at low capacity nodes replicated to the higher capacity nodes.

"A Measurement Study of Peer-to-Peer File Sharing Systems"