1 / 37

Introduction

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron Presented by Yu Feng and Elizabeth Lynch. Introduction. Application-level multicast Goals Scalability Failure tolerance Low delay

boyd
Télécharger la présentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scribe: A Large-Scale and Decentralized Application-Level Multicast InfrastructureMiguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. RowstronPresented byYu Feng and Elizabeth Lynch

  2. Introduction • Application-level multicast • Goals • Scalability • Failure tolerance • Low delay • Effective use of network resources

  3. Pastry • P2P location and routing substrate • Provides: • Scalability • Large numbers of groups • Large numbers of multicast sources • Large numbers of members per group • Self-organization • Peer-to-peer location and routing • Good locality properties

  4. Scribe • Application-level multicast infrastructure • Built on top of Pastry • Takes advantage of Pastry properties • Robustness • Self-organization • Locality • Reliability

  5. nodeId • Each node is assigned 128-bit nodeId • nodeIds are uniformly distributed • Each node maintains tables that map nodeIds to IP addresses • (2^b-1)*[log(2^b)N] + l entries • O(log(2^b)N) messages required to update after group membership change

  6. Routing Guarantees • A message and key will be routed to the live node whose nodeId is closest to the key • In a network of N nodes, the average number of steps in a route to any node is less than log(2^b)N • Delivery is guaranteed unless l/2 or more nodes with adjacent nodeIds fail

  7. Routing Tables • nodeIds and keys are treated as sequences of digits base 2^b • Each node's routing table has [log(2^b)N] rows and 2^b – 1 entries per row • Each entry in row n refers to a node whose nodeId matches the present node's nodeId in the first n digits but whose n+1th digit has one of 2^b – 1 other possible values • The entry closest to the present node according to a distance metric is chosen

  8. Leaf Sets • l/2 closest larger and l/2 closest smaller nodeIds relative to present nodeId • Each node maintains IP addresses for its leaf set

  9. Routing algorithm • Current node forwards to a node whose nodeId has a prefix at least one digit (b bits) longer in common with the key • If no such node is available, forward to a node with the same prefix length whose nodeId is closer to the key

  10. Locality • Proximity metric • Locality properties relevant to Scribe • Short routes • According to simulations: 1.59 to 2.2 times distance directly between the source and destination • Route convergence • According to simulations: average distance traveled by two messages sent to the same key is approximately equal to the distance between the two source nodes

  11. Node Addition • New node X picks a nodeId • X contacts nearby node A • A routes special message with X as key • Message is routed to a node Z with nodeId numerically closest to X • If X==Z, X must choose a new nodeId • X obtains leafset from Z • X obtains ith row of routing table from ith node traversed from A to Z • X notifies appropriate nodes that it is now alive

  12. Node Failure • Neighboring nodes in nodeId space periodically exchange keep-alive messages • If a node is silent for a period of time, T, it is presumed failed. • All members of the failed node's leaf set are notified and then remove the failed node from their leaf sets and update.

  13. Node Recovery • Contacts all the nodes in last known leaf set • Obtains their leaf sets • Updates its leaf set • Notifies members of new leaf set

  14. Pastry API • nodeId=pastryInit(Credentials) • Causes local node to join existing Pastry network or start a new one • route(msg, key) • Routes msg to the node with nodeId numerically closest to key • send(msg, IP-addr) • Sends msg to the node at IP-addr

  15. Required Pastry Functions • deliver(msg, key) • When msg is received and local node's nodeId is closest to key out of all live nodes • When msg is received that was transmitted via send() to IP of local node • forward(msg, key, nextId) • Called just before msg is forwarded to node with nodeId=nextId • Application can change msg content or nextId value • If nextId=NULL, msg terminates at local node • newLeafs(leafSet) • Called whenever there's a change in the leaf set

  16. Scribe Overview • Multicast application framework built on top of Pastry • Any Scribe node may create a group • Other nodes can join the group and multicast to all members of that group • Best effort delivery and does not guarantee ordered delivery

  17. How? • A group is formed by building a multicast tree through joining Pastry routes from each group member to a rendezvous point (root of the tree). • Multicast messages are sent to rendezvous point for distribution • Pastry and Scribe are fully decentralized • Decisions are based on local information • Provides reliability and scalability

  18. Multicast Tree • Scribe creates a multicast tree rooted at the rendezvous point. • Scribe nodes that are part of a multicast tree are called forwarders. • They may or MAY NOT be a members of the group. • Each forwarder contains a children table. • There is an entry (IP address and nodeId) for each of its children in the multicast tree.

  19. Scribe API • create(credentials, groupId) • Creates a new group using the credentials to control future access • join(credentials, groupId, messageHandler) • Join a group with the specified groupId • leave(credentials, groupId) • Leave a group with the specified groupId • multicast(credentials, groupId, message) • Multicast the specified message to the group with specified groupId

  20. Scribe Implementation Creating a Group • A scribe node asks Pastry to route a CREATE message using the groupId as the key. [e.g., route(CREATE, groupId)] • Pastry delivers the CREATE message to a node that has its nodeId numerically closest to the groupId. • Scribe’s deliver method is invoked and adds the new groupId to a list of groups it already knows. In addition, it also checks the credentials to ensure the group can be created. • This node becomes the rendezvous point for the newly created group.

  21. Scribe Implementation Joining a Group • Asks Pastry to route a JOIN message with the groupId as the key. [e.g., route(JOIN, groupId)]. The message is routed towards the rendezvous point. • Each node along the route, Pastry invokes Scribe’s Forward method. • Checks to see if it is a forwarder for the group. • If it is a current forwarder for the group, then it adds the node as a child. • If it is NOT a current forwarder for the group, then it creates a children table for the new group, adds the node as a child. Then it routes a JOIN message with groupId as key [e.g., route(JOIN, groupId)]. • Finally, it terminates route message it received form the source.

  22. Scribe Implementation Leaving a Group • It records locally that it left the group. • If there are no more children in its children table, it sends a LEAVE message to its parent node. • The parent node repeats step 2 until a node with a non-empty children table is found after removing the source node.

  23. Multicast a Message • Locate rendezvous point for the group. [e.g., route(MULTICAST, groupId)], and ask it to return its IP address. • The source caches the IP address and uses it for future multicasts. • If the rendezvous point changes or fails, it uses Pastry again to find the new rendezvous point. • All multicast messages are sent from rendezvous point.

  24. Scribe Implementation

  25. Reliability of Scribe Repairing the Tree • Periodically, each non-leaf node sends out a heartbeat message to all of its children. • When a leaf node does not receive a heartbeat after a certain period of time, it sends a JOIN message with the group’s identifier. • Pastry will route the message to a new parent, thus fixing the multicast tree.

  26. Reliability of Scribe Failure of Rendezvous Point • The state of rendezvous point is replicated across k closest nodes to the root node (Typical value of k is 5). • These k nodes are all children of the root node. • When a root node fails, its immediate children detect the failure and join again through pastry. • Pastry routes the new join message to a new root (a live root with the numerically closest nodeId to the groupId), which takes over the role of the rendezvous point.

  27. Reliability of Scribe • Children table entries are discarded unless the child node sends a explicit message stating it wants to remain in the table. • Tree repair mechanism scales well: • Fault detection is done by sending messages to a small number of nodes • Recovery from faults is local and only a small number of nodes is involved (O(log2bN))

  28. Scribe - Providing Additional Guarantees • Scribe only provides reliable, ordered delivery of multicast messages only if the TCP connections do not fail. • Scribe provides a simple mechanism to allow other applications to implement stronger reliability guarantees. • forwardHandler(msg): Invoked by Scribe before the node forwards a multicast message to its children. • joinHandler(msg): Invoked by Scribe after a new child is added to one of the node’s children tables. • faultHandler(msg): Invoked by Scribe when a node suspects its parent is faulty.

  29. Additional Reliability Example • forwardHandler • The root assigns a sequence number to each message • Multicast messages are buffered by the root and by each node in the multicast tree. • Messages are retransmitted after the multicast tree is repaired. • faultHandler • adds the last sequence number delivered by the node to the JOIN message that is sent out to repair the tree. • joinHandler • retransmits buffered messages numbers above n to the new child.

  30. Experimental Setup • Randomly generated network topology with 5050 routers • Scribe was run on 100,000 end nodes randomly assigned to routers with uniform distribution • Using different random seeds, ten different topologies were generated • Results are averaged over all ten topologies • Experimented with a wide range of group sizes and large number of groups • Size of group with rank r: gsize(r)=floor(N*r^(-1.25) + .5) • Group membership selected randomly with uniform distribution

  31. Delay Penalty • Compare delay between Scribe multicast and IP multicast • Measure distribution of delay to deliver a message to each member of a group • Two metrics: • RMD • 50% of groups less than 1.69 • Max = 4.26 • RAD • 50% of groups less than 1.68 • Max = 2

  32. Node Stress • Stress imposed by maintaining groups and handling forwarding packets and duplicate packets at the end node instead of on the routers • Measure the number of groups with non-empty children tables and the number of entries in children tables • In our simulation with 1500 groups • Non-empty children tables per node: Avg=2.4, max=40 • Children table entries per node: Avg=6.2, max=1059

  33. Link Stress Experiment • Computed link stress by counting the number of packets that are sent over each link when a message is sent to each of the 1500 groups. • Total number of links is 1,035,295 • Total number of messages for Scribe is 2,489,824 • Total number of messages for IP multicast is 758,853 • Mean number of message per link: • 2.4 for Scribe • 0.7 for IP multicast • Maximum Link Stress: • 4031 for Scribe • 950 for IP multicast

  34. Bottleneck Remover • When a node detects it is overloaded, it selects the group that consumes the most resources. • Then it chooses the child in this group that is farthest away. • The parent then drops the child by sending it a message containing the children table for the group along with delays between each children and the parent. • When the child receives the message it does the following: • It measures the delay between itself and other child in the children table received. • It then computes the delay between itself and the parent via each of the nodes. • Finally, it sends a JOIN message to the node that provides the least combined delay.

  35. Bottleneck Remover Results • This introduces potential for routing loops • When a loop is detected, the node sends another JOIN message to generate a new random route. • The bottleneck remover limits the number of entries for its children tables at a cost of increased link stress during join. • Average link stress increases from 2.4 to 2.7 and maximum increases from 4031 to 4728.

  36. Scalability with Many Small Groups • 50,000 Scribe nodes • 30,000 Scribe group with 11 nodes per group • Average number of children entries per node is 21.2 compared to a plain (naïve) multicast average of only 6.6 • Average link stress: • 6.1 for Scribe • 1.6 for IP multicast • 2.9 for Naïve multicast • Scribe entries are higher because it creates trees with long paths and no branching.

  37. Conclusion • Scribe is a fully decentralized and large-scale application-level multicast infrastructure built on top of Pastry. • Designed to scale to large number of groups, large group size, and supports multiple multicasting sources per group. • Scribe and Pastry’s randomized placement of nodes, groups, and multicast roots balances the load and the multicast tree. • Scribe uses a best effort delivery scheme but can be extended to guarantee more strict multicast requirements. • Experimental results show that Scribe can efficiently support large number of nodes, groups, and a wide range of group sizes compared to IP multicasting.

More Related