Scalability

Scalability • Optimizing P2P Networks: Lessons learned from social networking • Social Networks • Lessons Learned • Organizing P2P Networks • Gnutella Case Studies • 3 case studies • DHTs • what are they? • example

Social Networks Boston Omaha • Stanley Milgram(not a Harvard professor) – 1967 social networking experiment • How many ‘social hops’ would it take for messages to traverse through the US population (200 million) • Posted 160 letters to randomly recruited people in Omaha, Nebraska • Asked them to try to pass these letters to a stockbroker working in Boston, Massachusetts • Rules: • use intermediaries whom they know on a first name basis • chosen intelligently • make a note at each hop • 42 letters made it one version of the experiment • Average of 5.5 hops • Demonstrated the ‘small world effect’ Suggests that the social network of the United States is indeed connected with a path-length (number of hops) of around 6 – The 6 degrees of separation ! Does this mean that it takes 6 hops to traverse 200 million people??

Lessons Learned from Milgrim’s Experiment • Social circles are highly clustered • A few members have wide-ranging connections • these form a bridge between far-flung social clusters • this bridging plays a critical role in bringing the network closer together • For example • A quarter of all letters passed through a local storekeeper • A half were mediated by just 3 people • Lessons Learned • These people acted as gateways or hubs between the source and the wider world • A small number of bridges dramatically reduces the number of hops

From Social Networks toComputer Networks… • There are a number of similarities to social networks • People = peers • Intermediaries = Hubs, Gateways or Rendezvous Nodes (JXTA speak...) • Number of intermediaries passed through = number of hops • Are P2P Networks Special then? • P2P networks are more like social networks than other types of computer network because they are often: • Self Organizing • Ad-Hoc • Employ clustering techniques based on prior interactions (like we form relationships) • Decentralized discovery and communication (like we form neighbourhoods, villages, cities etc) • What about social networking sites? • huge – “If Facebook were a country, it would be the eighth most populated in the world, just ahead of Japan, Russia and Nigeria.” • But the application overlay network does not reflect social network • Use centralized data centers.

Peer to Peer: What’s the problem? • Problem: how do we organize peers within ad-hoc, multi-hop pervasiveP2P networks? • network of self-organizing peers organized in a decentralized fashion • such networks can rapidly expand from a few hundred peers to several thousand or even millions • P2P Environment Recap: • Unreliable Environments • Peers connecting/disconnecting – network failures to participation • Random Failures e.g. power outages, Cable, DSL failure, hackers • Personal machines are much more vulnerable than servers • algorithms have to cope with this continuous restructuring of the network core. • P2P systems need to treat failures as normal occurrences not freak exceptions • must be designed in a way that promotes redundancy with the tradeoff of a degradation of performance

Performance Issues in P2P Networks 3 main factors that make P2P networks more sensitive to performance issues: • Communication. • Fundamental necessity • Users connected via different connection speeds • Multi-hop • 2.Searching • No central Control so more effort is needed • Each hop adds to total bandwidth • 3.Equal Peers • Free Riders – imbalance in the harmony of network • Degrades performance for others • Need to get this right and adjust accordingly

Gnutella Studies 1: Free Riding E. Adar and B.A. Huberman (2000), “Free Riding on Gnutella,” First Monday 5(10), http://firstmonday.org/issues/issue5_10/adar/index.html Two types of free riding • download files but never provide any files for other to download • users that have undesirable content • They found 22,084 of the 33,335 peers in the network (66%) of the peers share no files • 24,347 or 73% share ten or less files • top 1 percent (333 hosts) represent 37 percent of the total files shared • 20 percent (6,667 hosts) sharing 98% of the files shows - even without Gnutella Reflector nodes, the Gnutella network naturally converges into a centralized + decentralized topology with the top 20% of nodes acting as super peers or reflectors

Gnutella Studies 2: Equal Peers Study on Reflector Nodes [clip] www.clip2.com Studied Gnutella for one month • Noted an apparent scalability barrier when query rates went above 20 per second. Why?? • In a network of roughly 1000 nodes, a servent must handle up to 20 queries per second. • a dial-up 56-K link cannot keep up with this amount of traffic • one node connected in the incorrect place can grind the whole network to a halt because it becomes a dead end • The network fragments. • This is why P2P networks place slower nodes at the edges

Gnutella Studies 3: Communication Peer-to-Peer Architecture Case Study: Gnutella Network Matei Ripeanu, on-line at: http://people.cs.uchicago.edu/~matei/PAPERS/P2P2001.pdf Studied topology of Gnutella over several months & reported two findings: • Gnutella network shares the benefits and drawbacks of a power-law structure • - networks that organize themselves so that most nodes have a few links and a small number of nodes have many • - found to show an unexpected degree of robustness when facing random node failures. • - vulnerable to attacks e.g. by removing a few of the super nodes can have a massive effect on the function of the network as a whole. • Gnutella network topology does not match well with the underlying Internet topology leading to inefficient use of network bandwidth. • He gave 2 suggestions: • use an agent to monitor network and intervene by asking servents to drop/add links to keep the topology optimal. • replace the Gnutella flooding mechanism with a smarter routing and group communication mechanism.

Gnutella Studies • Gnutella shows properties associated with power-law distribution • (e.g., a node with twice the connections is four times less frequent) • Power-law distributions happen all over the place in nature and society: • word frequency distribution • Sizes of meteorites and sand particles • Sizes of cities • the Pareto principle (80 – 20 rule) - 20% of the population own 80% of the wealth • Zipf distribution (and Zipf-Mandelbrot) – Mandelbrot coined the term fractal • And has re-emerged recently as The Long Tail on the Web.

The Gnutella Network The figure below is a view of the topology of a Gnutella network as shown on the LimeWire web site, the popular Gnutella file-sharing client. Notice how the power-law or centralized-decentralized structure is demonstrated.

Another View of the Gnutella Network

Reflector Nodes C F1.mp3 0 F1.mp3 – ID0:F1.mp3 … F2.mp3 1 F3.mp3 2 • Known as ‘super peers’ – in JXTA these are Rendezvous peers • cache file list of connected users – maintain an index • When a query is issued, the Reflector does not retransmit it - it answers the query from its own memory • Do they remind you of anything ?

Napster = Gnutella? N3 User N2 Napster Gnutella Super Peers: Napster Duplicated Servers Gnutella Napster User Napster.com =? 1. Natural?? 2. Reflector (clip2.com)

Scalability Through Structure • Gnutella, Kazaa can be classified as ‘unstructured’ networks • interconnection of nodes is ad-hoc, highly dynamic, defined independently by each node according to individual requirements. • settles into a topology with qualities associated with power-law distribution. • A class of P2P systems that are known as ‘structured’ evolved just after the millennium. • Chord • CAN • Pastry • Tapestry • Generally a form of Distributed Hash Table (DHT)

What are DHTs? • A DHT is a topology that provides similar functionality to a typical hash table. • put(key, value) • get(key) • Peers are buckets in the table • with their own local hash tables • Allows a peer to publish a resource onto a network using a key to determine where the data will be stored (i.e. which peer will receive the data). • Using keys presupposes a logical ‘space’ which the keys map onto. • The key is mapped to the space using a hashing function to ensure equal distribution of resources across the network. • Nodes are responsible for sections of this space.

Why DHTs? • Address the flooding issue without resorting to centralized/decentralized architecture. • Typically search can be achieved in O(logn) hops where n is the number of nodes in the network. • only a few neighbors need to be known – typically O(logn) • small neighborhoods and flat topology makes for a robust network, easy to handle churn.

Example: Chord Topology • Divides the key space into a circle • keys are n-bit sized • ring can contain up to 2n nodes • keys can range from 0 to 2n – 1 • Consistent hashing algorithm (e.g. MD5) is used to evenly distribute keys around the ring. • increases probability of robustness • allows nodes to join and leave without disrupting the network • O(1/n) fraction of keys are moved to a different location • Node IDs are distributed based on the key size and the number of nodes in the network. • A node should be responsible for keys/nodes keys

Chord Finger Tables • Just knowing your precursor and successor leads to very bad performance • O(n) hops to find a key (O(n/2) expected) • Chord nodes have a routing (finger) table containing approx. O(logn) nodes • The distance of nodes in the table increases exponentially • having this many nodes in the finger table means O(logn) hops are needed to find the key • For each query for key k there is choice of O(logn) nodes. Choose the one whose id is closest to k

DHT Issues • DHTs are structured. Maintaining the structure has overhead. • presupposes equal capabilities in nodes • NOT a power-law distribution • not always possible to have fuzzy or attribute-based queries. It’s a lookup facility – you need to know the key • Searching on a Gnutella network is open ended. • you may get results, you may not. • DHT algorithms are deterministic and designed for lookup • So data going missing is more problematic • replication needs to be employed to ensure data availability

Closing Remarks • Summary • Centralized + Decentralized – understand from the original Gnutella to the new models • The role of Reflector nodes • Structured topologies (DHTs) – efficient lookup without Centralization

Scalability

Scalability

Presentation Transcript

Intellectual Scalability

Windows NT Scalability

Scalability

Windows NT Scalability

Driver Scalability

Design for Scalability

Jenkins Scalability Summit

Scalability

PVSS Oracle scalability

Parallel Scalability

Scalability in Grids

Scalability

Scalability

SCALABILITY ANALYSIS

Scalability Overview

System Scalability

Chapter 3: Scalability

Scalability

Scalability

Scalability for Search

Scalability

Windows NT Scalability