Scalability of Reliable Group Communication Using Overlays

IEEE INFOCOM 2004 Scalability of Reliable Group Communication UsingOverlays Sambit Sahu Columbia University, IEOR Dept NY, USA Presented by Zahid Anwar

Motivation • Reliable group communication- an important research problem since the last decade • Problems with IP Supported Reliable Multicast • No Wide Spread deployment of IP Multicast • Group throughput vanishes when group size increases • ALM “New kid on the block” • Use end-system Overlays to support group communication • studies focusing on protocol development for efficient tree construction and maintenance • No comprehensive study done so far to address scalability concerns as far as • Throughput • Buffer Utilization • Latency of content delivery are concerned

Contributions • Two decisive results • Group throughput is scalable and depends on the minimum of the local maximum throughput • The edges only need to measure the RTT and packet marking probability and report them back to the source • Buffer occupancy is scalable provided the sender does Rate Control • Sender needs to be pessimistic and to consider a worst case scenario by adopting a low rate or adapt the send rate dynamically

Model: for Reliable Overlay Group Communication • Directed graph • Nodes connected in a tree topology. • Node replicates parent’s data on each out-going edge • Infinite Buffer Size Figure used from the paper (colored)

Model: TCP Connections in Tandem … • TCP connection from node k to k+1 referred to as edge k. • Underlying overlay edge k, there are Hkrouters • Source (node 0,0), has infinite packets to multicast. • m-th packet is available at time Tm. • sm(k,h) = aggregated service time experienced by the m th packet going through the hth router of the overlay edge k. • The TCP congestion control window is characterized by Wm( k ) • Node k transmits packet m when packet m-Wm(k) is received by k+1 S m( 1, 0) S m( 1, HK) S m( 2, 0) S m( K, HK) S m( K, 0) Overlay edge K Window Wm(K) Source Overlay edge 1 Window Wm(1)

Model: TCP Congestion Control • Window size is governed by the AIMD rule of RENO • Wm( k ) assumed independent for each overlay edge k • Packet marking and evolution of window size jointly Markovian • Establish linear evolution equations governing the packet departure times • Use them as a recursive way of computing the evolution of the packets in a large tandem

Model: Evolution Equations • xm(k,h)is the time when router (k, h) has finished forwarding packet m. • Rules 1 and 3: “Packet m can’t leave node h of edge k until packet m-1 has cleared it, and packet m itself has cleared previous node” • Rule 2: “The first node in kth overlay edge can’t proceed with packet m if the packet that was Wm(k) packets ago has not cleared edge k” • Rule 4: “The first node in the system can’t proceed with packet m until it has processed packet m-1 and, of course, packet m has been emitted by source” Note: V denotes MAX

Analysis: Dependency Graph • Construct a weighted random graph describing dependency relations between state variables • The time at which a particular packet leaves an edge k can be found by the maximum of weights over all possible paths from (m, k, h) to (1, 0, 0) .

Analysis: Scalability of Throughput for Chain Topology • Under the mild assumption that the sequence of aggregated service times is stationary and ergodic the long term average throughput converges to a constant Θ λ1, k ≡ • Where λis the arrival intensity at the source and • Dλm, k= xm (k, Hk)is the time when the mth packet has been transmitted in the kth overlay edge.

Results: Throughput for chain • The overall throughput of a chain topology is minimum of the local maximum throughput of all the edges • Theorem 2: Under Assumption 1, for all 0 ≤ k ≤ K, Θ λ1, k= min (λ, θ1,…, θk), i.e., the throughput of the first k nodes is the minimum of the arrival intensity and of the local maximum throughput of the overlay edges 1, · · · , k. Θ = θ1 V θ2 V θk

Results: Generalize to tree topologies (uncongested access links) • Assumption 2: The aggregated service times in any router of an overlay edge originating from a node are independent of the number of TCP connections originating from this node. • Theorem 3: Under Assumptions 1 and 2, for any arbitrary tree rooted at the source node Θ λ1, k= min (θ1,…,θk) • In the core of the Internet, there are simultaneously a big number of other TCP sessions anyway. So each individual session added by the multicast tree has little effect on the router’s behavior.

Analysis: Where does Assumption 2 get us? • The end-to-end control of IP supported reliable multicast makes it such that each node is permanently randomly delayed due to its waiting for the acks of the the latest of its offspring nodes • Whereas in overlay multicast, each line of offspring of a node can actually progress at its own and proper speed and a key decoupling takes place which allows each TCP connection to get the long term average throughput it would get in the absence of the other parts of the tree. Overlays IP Multicast

Analysis: Backing up Assumption 2 • Locality Assumption: the non-reference transfers originating from end-system k affect the aggregated service times of the reference transfer of overlay edge k only • Natural if nodes are sparse enough for being all located on different LANs or geographical areas. • Fairness Assumption: Let sm(k,h) (resp. s’m(k,h) denote the aggregated service time of packet m of the reference transfer on hop h of overlay edge k when the out degree of end-system k is equal to 1 (resp. M). Then s’m( k, h) ≤ M sm( k ,h)

b7 235/155/98 asterix-1 berk-2 372/155/0 113/116/36 ace baobab 235/155/0 25/19/83 397/155/0 32/36/69 pisa-1 fermi-1 edge ananda-1 204/154/4 885/155/0 209/36/1 769/19/1 671/19/1 ucsb-1 cmu-1 umn-1 Berk-1 asterix-2 Simulation: Throughput & Buffer Utilization • Local throughput measured by sending packets on all downstream links at the maximum rate, without waiting for incoming transmission. • Buffer size not restricted. Each buffer entry corresponds to one 100-byte block. 20,000 blocks were sent. • Buffer utilization is the proportion of the max # of blocks used in the buffer to the total number of blocks sent during experiment. • Simulation presented without analysis Link (KB/s) / Tree (KB/s)/Buffer Occupancy (%)

Analysis: Necessity of Rate Control at Source • Theorem 4: If the intensity of packet arrival date (Tm) mЄZ, denoted by λ is larger than Θ λ1, k then there exists at least one station 1 ≤ k ≤ K, for which the sojourn time of packet m converges to infinity in probability when m goes to infinity. • The sojourn time of a packet m is the time spent in the k-th overlay edge and is formally given by: Dλm, k - Dλm, k-1

Simulation Results with Rate Control The buffer occupancy at an end system at depth k converges to a stationary value with increasing k when the throttling rate is under theta. Also, this tells us that the packet passage time through an overlay edge at depth k converges with increasing k. The buffer occupancy doesn’t go on increasing out of bounds with increasing depth, and same for edge passage time (delay).

Design: Optimal Tree Construction Algorithm • Forwarding paths should be chosen such that the resulting tree has the local maximal throughput of its bottleneck overlay edge maximized • Sort all edges in increasing throughput order • Discard edges starting with those with the smallest throughput until the set of remaining edges makes a connected graph • Build a spanning tree rooted at the source using the remaining edges of the sorted list 3 5 8 7 8 4 2 6 9 The resulting tree is optimal (proof by contradiction)

Design: Tree construction when accounting for bottleneck at Access Link • Models the situation where forwarding links are typically connected to the internet via DSL/Cable and modem links . • Decision problem a generalization of the minimum degree spanning tree (NP-hard) • Heuristic to achieve at least ½ of the optimal throughput

Weaknesses • Maximal Throughput tree-building algorithm might not be good for delay

Design: Solution Strategy • Fix a target group throughput θ. • Remove from the network G the links that have throughput less than θ. • Call the new graph G’ θ= (V, E’ θ), where E’ θ= {(i, j) Є E : θi j ≥ θ } • With θ fixed the constraints on node throughput for each node i can be treated as degree constraints, allowing at most • Floor( c I / θ) outgoing links per node. • Use binary search to find the smallest value of θ for which such a tree can be constructed

Design: Bounding constraint violation • Approx algorithm constructs a in poly time a tree in G’ such that the degree constraints are violated by at most 1 for each node, provided there exists a spanning tree satisfying all degree constraints implied by throughput θ • Construct a arbitrary spanning tree in G’ using a simple algorithm like dept first search • Compute a set B V of all nodes with max degree constraint violation, and try to reduce the cardinality of B by performing a series of improvements ∩

Design: Defining Improvement 7 11 10 • Suppose max degree violation in tree is k • Add an edge connecting 2 nodes with degree violation less than or equals k-2, and break the loop by removing one edge, incident to one of B • This reduces degree violation of one of the nodes in B from k to k-1 5 6 • Algo performs improvements until no improvements are possible, or until B is empty. • When B is empty, rebuild B with violation k - 1, and repeat

time Slow Start Host A Host B • Algorithm • initialize: Congwin = 1 • for (each segment ACKed) • Congwin++ • until (loss event OR • CongWin > threshold) one segment RTT two segments four segments • Double congestion window every RTT (exponential increase) • Loss event: timeout (Tahoe TCP) and/or or three duplicate ACKs (Reno TCP)

14 12 10 8 congestion window size (segments) 6 threshold 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Transmission round TCP TCP Series1 Series2 Tahoe Reno After Slow Start /* slowstart is over */ /* Congwin > threshold */ Until (loss event) { every w segments ACKed: Congwin++ } threshold = Congwin/2 If (loss detected by timeout) { Congwin = 1 perform slowstart } If (loss detected by triple duplicate ACK) Congwin = Congwin/2

Random Early Detection • Routers mark packets (Network assisted Congestion Control as opposed to End-to-End Congestion Control) • Wmk :Window size seen by packet m at node k • rmkis the counter triggering window size increments (packet within a window) • Consider a Markov Chain (Wmk, rmk) Є {(w,r) Є {1,2,…, Wmax}2 | r ≤ w}

Transitions • From (w, r), with r >1 the next state is • (w, r-1) with probability 1-pk • (w/2 V 1, w/2 V 1) with probability pk • From (w,1) the next state is • ((w + 1) Λ Wmax, (w + 1) Λ Wmax) w. p 1-pk • (w/2 V 1, w/2 V 1) with probability pk Note: V denotes MAX and Λ denotes min

IEEE Journal on Selected Areas in Communication (JSAC) 2002 SCRIBE: A large-scale and decentralized application-level multicast infrastructure Miguel Castro, Anne-Marie Kermarrec, Antony Rowstron Microsoft Research, Cambridge, UK Peter Druschel Rice University, USA Presented by Zahid Anwar

Motivation • The use of multicast in applications has been limited because of the lack of wide scale deployment and the issue of how to track group membership. • Application Level Multicast has grown popular but needs to address problems of • Scalability • Fault-tolerance • Self-organization • delay

4 copies Duplicationat sender Motivation : Unicasting vs. IP vs. ALM • Unicasting: • IP Multicasting: Duplication at routers • Application-Level Multicasting (ALM): Duplication at end hosts

Description: Scribe is a Publish/Subscribe Event Notification Application • Decentralized Peer-to-Peer Model • A multicast tree is formed by joining the Pastry routes from each group member to a rendezvous point associated with a group. Publisher Publisher Topic of interest Subscriber Subscriber

Description: Pastry – a peer-to-peer location and routing substrate 0XXX 1XXX 2XXX 3XXX 1122 3321 START 1122 routes a message to 3000. 3032 First hop fixes first digit (3) 3001 Second hop fixes second digit (30) END 3001 closest live node to 3000. • Routes messages to the nodeId numerically closest to the destination in < log2b (N) steps • Not only focuses on minimizing number of overlay hops but also the delay experienced • “Route chosen for a message is likely to be good with respect to the proximity metric”

Design: Creating/Joining a Scribe Multicast Group • The creator creates a group by making a GroupId = hash (group’s name + his name) • Routes a CREATE message with key = GroupId • The node with id closest to the GroupId becomes the root • Joining nodes send a JOIN with key = GroupId

Design: Joining a group (subscribing)and multicasting a message (publishing) Forwarder (may or may not be part of the group) A:1100 B:1000 Event C:1011 1100 D:1101 E:1001 H:1111 1101 1011 1001 F:0100 G:0111 0100 0111

SCRIBE – Reliability and Fault Tolerance • Use TCP message delivery, flow control • On failure use Pastry to repair the tree • Forwarder failure (figure) • Parent send heart-beat message to children • Multicast messages are implicit heartbeats • Child send JOIN message to find a new parent when the parent fails • Root failure • The root state is replicated across the k closest nodes to the root node (by its leaf set)

Infocom 2004 Peer-to-Peer Support for Massively Multiplayer Games Bjorn Knutsson, Honghui Lu, Wei Xu, Bryan Hopkins Presented by Zahid Anwar

Motivation • Games like Lineage have recorded 2 million registered players and 180k concurrent players in one night • Traditionally online games use a client-server architecture and achieving scalability using server clusters • Lacks flexibility • Has to be over-provisioned to handle peak loads • Difficult to allow user designed game extensions • Proposes use of P2P overlays to support Massively multiplayer games (MMG) • Primary contribution of paper: • Architectural (P2P for MMG) • Evaluative

Design: MMG sits on top of scribe/pastry layer MMG GAME SCRIBE (Multicast support) PASTRY (P2P overlay)

Design: Architecture of Game • Thousands of players co-exist in same game world • Most MMG’s are role playing games (RPG) or real-time strategy(RTS) or hybrids • Examples: Everquest, Ultima online, Sims online • World made up of • immutable landscape information (terrain) • Characters controlled by players • Mutable objects (food, tools, weapons) • Non-player characters (NPCs) controlled by automated algorithms • World divided into regions

Design: Assumptions • Game design based on fact that: • Players have limited movement speed • Limited sensing capability • Hence data shows temporal and spatial localities • Use Interest Management • Limit amount of state player has access to • Players in same region form interest group • State updates relevant to group disseminated only within group • Player changes group when going from region to region

Design: Regions and Scribe groups

Design: Coordinators • Use coordinator-based mechanism for shared objects • Each object assigned a coordinator • Coordinator resolves conflicting updates and keeps current value • Group players & objects by region • Map regions to peers using pastry Key • Each region is assigned ID • Live Node with closest ID becomes coordinator • Currently all objects in region coordinated by one Node

Design: Replication • Shared state replication • Lightweight primary- backup to handle failures • Failure detected using regular game events • Dynamically replicate coordinator when failure detected • Keep at least one replica at all times • The replica kept at M which is the next closest to message or object K • If new node added which is closer to message K than coordinator • Forwards to coordinator • Updates itself • Takes over as coordinator

Experimental Results • Prototype Implementation of “SimMud” • Used FreePastry (open source) • Maximum simulation size constrained by memory to 4000 virtual nodes • Players eat and fight every 20 seconds • Remain in a region for 40 seconds • Position updates every 150 millisec by multicast

Evaluation: Distributions of message rate. Average group size is 10. • Message rates for 1000 and 4000 players with 100 and 400 regions, respectively. • Each node receives between 50 and 120 messages per second • Good Scalability

Experimental Results • Breakdown of type of messages • 99% messages are position updates • Region changes take most bandwidth • Message rate of object updates higher than player-player updates • Object updates multicast to region • Object update sent to replica • Player player interaction effects only players

Concerns • Objects may not always be assigned to nodes that are best for that particular object • Simulation on one machine close to reality? • Would a bandwidth requirement of 7.2kB/sec – 22.34 KB/sec be sufficient for more complicated games?

SOSP 11/2001 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron Microsoft Research, Cambridge, UK Peter Druschel Rice University, USA Presented by Zahid Anwar

Motivation • Peer to Peer has recently gained popularity because of file sharing applications e.g. Napster, Gnutella and Freenet • PAST (among many other 2nd generation p2p applications) aims at providing a global storage utility which aims to provide: • strong persistence, high availability, scalability and security

Introduction • Used for Archival storage, content distribution • Not a general purpose file system • Stores multiple replicas of files • Caches additional copies of popular files • Based on Pastry routing scheme • Offers persistent storage services for replicated read-only files • Owners can insert/reclaim files • Clients just lookup

Design: Insertion and Retrieval k fileId • Insertion • fileId=secure hash(name, public key, salt) • File stored on the k nodes whose nodeIds are • numerically closest to the 128 msb of fileId • Required storage is debited against the owner’s storage quota • A file certificate is returned • Signed with owner’s private key • Contains: fileId, hash of content, replication factor + others • Each node of the k replica storing nodes attach a store receipt • Ack sent back after all k-nodes have accepted the file • Retrieval • File located in log16 N steps (expected) • Usually locates replica nearest client C insert

Scalability of Reliable Group Communication Using Overlays

Scalability of Reliable Group Communication Using Overlays

Presentation Transcript

Group Communication

Reliable Group Communication

Using Gossip to Build Network Overlays.

Reliable Communication

CS 194: Distributed Systems Process resilience, Reliable Group Communication

Group Communication

Reliable Group Communication: a Mathematical Approach

Group Communication

OverQoS: Offering QoS using Overlays

Group Communication using Ensemble Part II

Reliable Group Communication

PPTG Thin Overlays Task Group

Group Communication using Ensemble

Using Overlays to Improve Security

Resilient Multicast using Overlays

Towards Sustained Scalability of Communication Networks

Scalability Study of S3D using TAU

Reliable Multicast Group

Group Communication using Ensemble

Structured Overlays - self-organization and scalability

Group Communication

Reliable Multicast Group