Data Center Routing – Traffic Engineering

Data Center Routing – Traffic Engineering Yao Lu Rui Zhang ECE 260C VLSI Advanced Topics

Outline • What is routing/traditional routing algorithm • What is data center • Difference between data center and the Internet • Some Recent work in data center TE • Open questions/proposals

What is routing

Traditional routing algorithm • RIP (Routing Information Protocol) • IGRP (Interior Gateway Routing Protocol) • EIGRP (Enhanced Interior Gateway Routing Protocol) • OSPF (Open Shortest Path First) • IS-IS (Intermediate System-to-Intermediate System) • BGP (Border Gateway Protocol)

What is data center • Nowadays, 40% of the total Internet traffic goes to Google[1]

Difference between data center and the Internet • Design Goal • latency, reliability, throughput, energy, etc. • Properties • Well-structured topology • Movability of the locations of sources and destinations • Global knowledge of the whole data center network

Recent work • Equal-Cost Multi-Path (ECMP)[7] • Valiant Load Balancing (VLB)[6] • CamCube[5] • Hedera[8] • Joint VM Placement and Routing (JVMPR)[4]

ECMP • Many equal cost paths going up to the core switches • Only one path down from each core switch • Randomly allocate paths to flows using hash of the flow S D

VLB • Goal • Guarantee equal-spread load-balancing in a mesh network • Method • Bouncing individual packets from a source switch in the mesh off of randomly chosen intermediate “core” switches, which finally forward those packets to their destination switch.

Camcube • 3D Torus Topology • Offer Camcube API • To let service/application to design its own routing protocal • Core services • Basic routing algorithm • link state-based protocol

Estimate Flow Demands Detect Large Flows Place Flows Hedera • Detect Large Flows • Flows that need bandwidth but are network-limited • Estimate Flow Demands • Use min-max fairness to allocate flows between src-dst pairs • Place Flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths

Hedera • Large Flow Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps)

Hedera • Demand Estimation • Goal • Estimate available bandwidth to allocate • Method • Using min-max fairness, given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally between incoming flows • Repeat until all flows converge

Hedera A X B Y C Senders

Hedera A X B Y C Receivers

Hedera A X B Y C Senders

Hedera A X B Y C Receivers

Hedera • Flow Placement • Goal • Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized • Method • Global First Fit: • Greedily choose path that has sufficient unreserved b/w • Simulated Annealing: • Iteratively find a globally better mapping of paths to flows

Hedera • Global First Hit • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow

Hedera • Simulated Annealing • 4 specifications • State space • Neighboring states • Energy • Temperature • Simple example: Minimizing f(x) F(x)

Hedera • State: All possible mapping of flows to paths • Constrained to reduce state space size • Flows to a destination constrained to use same core • Neighbor State: Swap paths between 2 hosts • Within same pod • Function/Energy: Total exceeded b/w capacity • Using the estimated demand of flows • Minimize the exceeded capacity • Temperature: Iterations left • Fixed number of iterations (1000s)

Hedera

JVMPR • Joint VM Placement and Routing • Goal: Efficient traffic engineering under dynamic arrivals and departures of jobs • One method：Localizing traffic by flexible VM placement node utilization • Another method：Avoiding congestion by intelligent routing link utilization coupled with each other

JVMPR existing VM VM we need to add • Figure1:The left structure is the existing VMs and traffic • The middle structure is good VM placement with high congestion • The right structure is a worse placement with lower congestion

JVMPR • JVMPR consider placement and routing at the same time • It develops an approximation algorithm that leverages the specific structure of the joint design problem

JVMPR • Placement and Route Selection • Placement: The feasible decision space for VM placement is • Routing：The feasible decision space for routing is

JVMPR • Optimize Resource Utilization • costnet: Network cost • Measure the congestion • costnode: Node cost • Operating cost induced by a swith or a machine • Goal: Minimize the total cost

JVMPR • Any problem? • Yes! • The number of jobs is not fixed • Jobs enter or depart the system dynamically • Better way: Online solution • Static problem setting to a dynamic environment • Key idea: Perform local re-optimization

JVMPR • Online solution algorithm • Upon a new job arrival, assign the new job to one configuration accoridng to the transition probability • Upon a job departure, pick one job and migrate it to new machines according to the transition probability

JVMPR • Why dynamic JVMPR solution is appealing? • We do not require VM migrations when new jobs arrive and at most one job migration when jobs depart • The computation of migration probability only requires local information

JVMPR Max Core Switch Utilization Percentage of elephant flows Fig. Performance comparison

JVMPR • What is the price we pay for it? • The approximated Markov chain no longer converges to the exact stationary distribution • But to a neighborhood around it • Need a lot computation

Summary

Open questions/proposals • Imperfection of current algorithms • Hedera • Large flow detection too simple • Demand estimation only considered TCP flows • JVMPR • Demand a lot of computation • It is approximation • Not fully take advantage of the nice features of data center • Combine topology, movability and VM placement together • Add VM placement consideration into Hedera

Reference [1] http://www.forbes.com/sites/timworstall/2013/08/17/fascinating-number-google-is-now-40-of-the-internet/ [2] Moy, John T. OSPF: anatomy of an Internet routing protocol. Addison-Wesley Professional, 1998. [3] Chen, Kai, Chengchen Hu, Xin Zhang, Kai Zheng, Yan Chen, and Athanasios V. Vasilakos. "Survey on routing in data centers: insights and future directions." Network, IEEE 25, no. 4 (2011): 6-10. [4] Jiang, Joe Wenjie, Tian Lan, Sangtae Ha, Minghua Chen, and Mung Chiang. "Joint VM placement and routing for data center traffic engineering." In INFOCOM, 2012 Proceedings IEEE, pp. 2876-2880. IEEE, 2012. [5] Abu-Libdeh, Hussam, Paolo Costa, Antony Rowstron, Greg O'Shea, and Austin Donnelly. "Symbiotic routing in future data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011): 51-62. [6] Farrington, Nathan, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. "Helios: a hybrid electrical/optical switch architecture for modular data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011): 339-350. [7] Hopps, Christian E. "Analysis of an equal-cost multi-path algorithm." (2000). [8] Al-Fares, Mohammad, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. "Hedera: Dynamic Flow Scheduling for Data Center Networks." In NSDI, vol. 10, pp. 19-19. 2010.

Thank you!

Data Center Routing – Traffic Engineering