Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera: Dynamic Flow Scheduling for Data Center Networks • Mohammad Al-Fares • SivasankarRadhakrishnan • BarathRaghavan • Nelson Huang • Amin Vahdat • Presented by TD Graphics stolen from original NSDI 2010 slides (thanks Mohammad!)

Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers Flow-hash-based load balancing insufficient Simple key insight/idea • Find flows that need bandwidth and periodically rearrange them in the network to balance load across ECMP paths Algorithmic challenges • Estimate bandwidth demands of flows • Find optimal allocation network paths to flows

ECMP Paths • Many equal cost paths going up to the core switches • Only one path down from each core switch • Randomly allocate paths to flows using hash of the flow • Agnostic to available resources • Long lasting collisions between long (elephant) flows S D

Collisions of elephant flows • Collisions possible in two different ways • Upward path • Downward path S1 S2 S3 D2 S4 D1 D4 D3

Collisions of elephant flows • Average of 61% of bisection bandwidth wasted on a network of 27K servers S1 S2 S3 D2 S4 D1 D4 D3

Hedera Scheduler • Detect Large Flows • Flows that need bandwidth but are network-limited • Estimate Flow Demands • Use min-max fairness to allocate flows between src-dst pairs • Place Flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Detect Large Flows Place Flows

Elephant Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps) • What if only mice on host? • Default ECMP load-balancing efficient for small flows

Demand Estimation • Flows can be constrained in two ways • Host-limited (at source, or at destination) • Network-limited • Measured flow rate is misleading • Need to find a flow’s “natural” bandwidth requirement when not limited by the network • Forget network, just allocate capacity between flows using min-max fairness

Demand Estimation • Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally between incoming flows • Repeat until all flows converge • Guaranteed to converge in O(|F|) time

Demand Estimation A X B Y C Senders

Demand Estimation A X B Y C Receivers

Demand Estimation A X B Y C Senders

Demand Estimation A X B Y C Receivers

Flow Placement • Find a good allocation of paths for the set of large flows, such thatthe average bisection bandwidth of the flows is maximized • That is maximum utilization of theoretically available b/w • Two approaches • Global First Fit: Greedily choose path that has sufficient unreserved b/w • Simulated Annealing: Iteratively find a globally better mapping of paths to flows

Global First-Fit Scheduler ? ? ? • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow Flow A 0 1 2 3 Flow B Flow C

Global First-Fit Scheduler • Flows placed upon detection, are not moved • Once flow ends, entries + reservations time out Flow A 0 1 2 3 Flow B Flow C

Simulated Annealing • Annealing: slowly cooling metal to give it nice properties like ductility, homogeneity, etc • Heating to enter high energy state (shake up things) • Slowly cooling to let the crystalline structure settle down in a low energy state • Simulated Annealing: treat everything as metal • Probabilistically, shake things up • Let it settle slowly (gradient descent)

Simulated Annealing • 4 specifications • State space • Neighboring states • Energy • Temperature • Simple example: Minimizing f(x) f(x) x

Simulated Annealing • State: All possible mapping of flows to paths • Constrained to reduce state space size • Flows to a destination constrained to use same core • Neighbor State: Swap paths between 2 hosts • Within same pod, • Within same ToR, • etc

Simulated Annealing • Function/Energy: Total exceeded b/w capacity • Using the estimated demand of flows • Minimize the exceeded capacity • Temperature: Iterations left • Fixed number of iterations (1000s) • Achieves good core-to-flow mapping • Sometimes very close to global optimal • Non-zero probability of worse state

Simulated Annealing Scheduler ? ? ? ? • Example run: 3 flows, 3 iterations Core Flow A 2 2 2 0 1 2 3 Flow B 1 0 0 Flow C 0 2 3

Simulated Annealing Scheduler ? ? ? ? • Final state is published to the switches and used as the initial state for next round Core Flow A 2 0 1 2 3 Flow B 0 Flow C 3

Simulated Annealing • Optimizations • Assign a single core switch to each destination host • Incremental calculation of exceeded capacity • Using previous iterations best result as initial state

Fault-Tolerance Scheduler • Link / Switch failure • Use PortLand’s fault notification protocol • Hedera routes around failed components Flow A 0 1 2 3 Flow B Flow C

Fault-Tolerance Scheduler • Scheduler failure • Soft-state, not required for correctness (connectivity) • Switches fall back to ECMP Flow A 0 1 2 3 Flow B Flow C

Evaluation

Evaluation Data Shuffle • 16-hosts: 120 GB all-to-all in-memory shuffle • Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

Reactiveness • Demand Estimation: • 27K hosts, 250K flows, converges < 200ms • Simulated Annealing: • Asymptotically dependent on # of flows + # iterations • 50K flows and 10K iter: 11ms • Most of final bisection BW: first few hundred iterations • Scheduler control loop: • Polling + Estimation + SA = 145ms for 27K hosts

Limitations • Dynamic workloads, large flow turnover faster than control loop • Scheduler will be continually chasing the traffic matrix • Need to include penalty term for unnecessary SA flow re-assignments ECMP Hedera Stable Matrix Stability Unstable Flow Size

Conclusions • Simulated Annealing delivers significant bisection BW gains over standard ECMP • Hederacomplements ECMP • RPC-like traffic is fine with ECMP • If youare running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

Thoughts [1] Inconsistencies • Motivation is MapReduce, but Demand Estimation assumes sparse traffic matrix • Something fuzzy • Demand Estimation step assumes all bandwidth available at host • Forgets the poor mice :’(

Thoughts [2] Inconsistencies • Evaluation Results (already discussed) • Simulation using their own flow-level simulator • Makes me question things • Some evaluation details fuzzy • No specification of flow sizes (other than shuffle), arrival rate, etc

Thoughts [3] • Periodicity of 5 seconds • Sufficient? • Limit? • Scalable to higher b/w?

Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

Presentation Transcript

Information Flow Prediction and People Mining

Hippocratic Data Management

Data Management in Sensor Networks

Process Models: Data Flow Diagrams

Flow cytometry data handling and analysis

Chapter 8: FLOW IN PIPES

TCP Sliding Windows, Flow Control, and Congestion Control

Geometric Networks in ArcGIS

Tropical Cyclone Rainfall

Scheduling

Intro to Flow Cytometry

Dynamic Scheduling

Contract-based scheduling

Lecture 9 Dynamic Scheduling of Pipeline

Computer Networks

Dynamic routing versus static routing

ソフトウェア工学特論 (11)

PTS Data Center Solutions Data Center Decisions Presentation 2008

ECE/CS 372 – introduction to computer networks Lecture 7

Switching Architectures for Optical Networks

Reverse engineering gene regulatory networks

Computer Networks