1 / 34

A Scalable, Commodity Data Center Network Architecture

A Scalable, Commodity Data Center Network Architecture . Mohammad Al-Fares Alexander Loukissas Amin Vahdat. Presenter: Xin Li. Outline. Introduction Data Center Network Motivation Scale , Disadvantages of Current Solutions Fat-Tree

lorin
Télécharger la présentation

A Scalable, Commodity Data Center Network Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat Presenter: Xin Li

  2. Outline • Introduction Data Center Network • Motivation Scale, Disadvantages of Current Solutions • Fat-Tree Topology, Multi-path Routing, Addressing, Two-Level Routing, Flow Scheduling, Fault-Tolerance, Power & Heat • Evaluation and Result • Conclusion

  3. Introduction Under client web service control App Guest OS oversubscription VM

  4. Motivation • The principle bottleneck in large-scale clusters is often inter-node communication bandwidth • Two solutions: Specialized hardware and communication protocols e.g. Infiband, Myrinet(supercomputer environment) Cons: No commodity parts (expensive) The protocols are not compatible with TCP/IP Commodity Ethernet Switches Cons: Scale poorly non-linear cost increases with cluster Size. high-end core switch, oversubscription(tradeoff)

  5. Oversubscription Ratio Upper Link Bandwidth(UB) B B B …………… Server n Server 2 Server 1 Oversubscription Ratio= B*n/UB

  6. Current Data Center Topology • Edge hosts connect to 1G Top of Rack (ToR) switch • ToR switches connect to 10G End of Row (EoR) switches • Large clusters: EoRswitches to 10G core switches Oversubscription of 2.5:1 to 8:1 typical in guidelines • No story for what happens as we move to 10G to the edge Key challenges: performance, cost, routing, energy, cabling

  7. Data Center Cost

  8. Design Goals • Scalable interconnection bandwidth Arbitrary host communication at full bandwidth • Economies of scale Commodity Switch • Backward compatibility Compatible with hosts running Ethernet and IP

  9. Fat-Tree Topology Core Switches K Pods k/2 Aggregation Switch in each pod k/2 Edge Switches in each pod k/2 servers in each Rack

  10. Fat-tree Topology Equivalent

  11. Routing (k/2)*(k/2) shortest path! IP needs extension here! Single-Path Routing VS Multi-Path Routing Static VS Dynamic

  12. ECMP(Equal-Cost Multiple-Path Routing) Static Follow scheduling limited multiplicity of path to 8-16 Increase routing table multiplicatively, hence latency time Advantage: Don’t need reordering! Modern Switch support! Extract Source and Destination Address Hash Function(CRC16) Determine which region fall in 2 3 4 1 0 Hash-Threshold

  13. Two-level Routing Table 192.168.1.2/24 • Routing Aggregation 192.168.1.10/24 0 192.168.1.45/24 192.168.1.89/24 1 192.168.2.3/24 192.168.2.8/24 192.168.2.10/24

  14. Two-level Routing Table Addressing • Using 10.0.0.0/8 private IP address • Pod Switch: 10. pod. Switch.1. pod range is [0, k-1](left to right) switch range is [0, k-1] (left to right, bottom to top) • Core Switch: 10. k. i. j (i,j) is the point in (k/2)*(k/2) grid • Host: 10.pod. Switch.ID ID range is [2, k/2+1] (left to right)

  15. 10.0.0.1 10.0.0.2 10.0.0.3 10.4.1.1 Two-level Routing Table 10.4.1.2 10.4.2.1 10.4.2.2 10.2.0.3 10.2.0.2

  16. Two-level Routing Table • Two-level Routing Table Structure • Two-level Routing Table implementation TCAM=Ternary Content-Addressable Memory Parallel searching Priority encoding

  17. Two-level Routing Table---example • example

  18. Two-level Routing Table Generation Aggregation Two Level Routing Table Generator

  19. Two-level Routing Table Generation Core Switch Two Level Routing Table Generator

  20. Two-Level Routing Table • Avoid Packet Reordering • traffic diffusion occurs in the first half of a packet journey • Centralized Protocol to Initialize the Routing Table

  21. Flow Classification(Dynamic) • Soft State(Compatible with Two-Level Routing Table) • A flow=packet with the same source and destination IP address • Avoid Reordering of Flow • Balancing • Assignment and Updating

  22. Flow Classification—Flow Assignment Hash(Src,Des) Have seen this hash value? N Record new flow record f Y Assign f to least-loaded port x Lookup previously assign port x Send packet on port x Send packet on port x

  23. Flow Classification—Update Perform every t seconds For every aggregate switch D= • Find the largest flow f assigned to port pmaxwhose size is smaller than D; Then assign this flow to pmin

  24. Flow Scheduling • distribution of transfer times and burst lengths of Internet traffic is long-tailed • Large flow dominating • Large flow should be specially handled

  25. Flow Scheduling • Eliminates global congestion • Prevent long lived flows from sharing the same links • Assign long lived flows to different links Edge Switch Detecting Flow size above a threshold Notify the central controller Assign this flow to non-conflicting path

  26. Fault-Tolerance • Bidirectional Forwarding Detection session (BFD) • Lower- to Upper-layer Switches • Upper-layer to Core Switches • For flow scheduling, it is much more easier to handle.

  27. Failure b/w upper layer and core switches Outgoing inter-pod traffic: local routing table marks the affected link as unavailable and chooses another core switch Incoming inter-pod traffic: core switch broadcasts a tag to upper switches directly connected signifying its inability to carry traffic to that entire pod, then upper switches avoid that core switch when assigning flows destined to that pod

  28. Failure b/w lower and upper layer switches Outgoing inter- and intra pod traffic from lower-layer: • the local flow classifier sets the cost to infinity and does not assign it any new flows, chooses another upper layer switch Intra-pod traffic using upper layer switch as intermediary: • Switch broadcasts a tag notifying all lower level switches, these would check when assigning new flows and avoid it Inter-pod traffic coming into upper layer switch: • Tag to all its core switches signifying its ability to carry traffic, core switches mirror this tag to all upper layer switches, then upper switches avoid affected core switch when assigning new flaws

  29. Power and Heat

  30. Experiment Description—Fat-tree, Click • 4-port fat-tree, there are 16 hosts, four pods (each with four switches), and four core switches. • We multiplex these 36 elements onto ten physical machines, interconnected by a 48-port ProCurve 2900 switch with 1 Gigabit Ethernet links. • Each pod of switches is hosted on one machine; each pod’s hosts are hosted on one machine; and the two remaining machines run two core switches each. • bandwidth-limited to 96Mbit/s to ensure that the configuration is not CPU limited. • Each host generates a constant 96Mbit/s of outgoing traffic

  31. Experiment Description—hierarchical tree,click • four machines running four hosts each, and four machines each running four pod switches with one additional uplink • The four pod switches are connected to a 4-port core switch running on a dedicated machine. • 3.6:1 oversubscription on the uplinks from the pod switches to the core switch • Each host generates a constant 96Mbit/s of outgoing traffic

  32. Result

  33. Conclusion • Bandwidth is the scalability bottleneck in large scale clusters • Existing solutions are expensive and limit cluster size • Fat-tree topology with scalable routing and backward compatibility with TCP/IP and Ethernet

  34. Q&A Thank you

More Related