Flowlet Switching

Flowlet Switching Srikanth Kandula Shan Sinha & Dina Katabi

ISPs Want to Split Traffic Across Multiple Paths

ISPs Want to Split Traffic Across Multiple Paths • Load balancing to remove hot spots • Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …) 70% 30%

ISPs Want to Split Traffic Across Multiple Paths Unpredictable Traffic Rebalance Traffic • Load balancing to remove hot spots • Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …) 70% 30%

ISPs Want to Split Traffic Across Multiple Paths Unpredictable Traffic • Load balancing to remove hot spots • Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …) 30% 70%

Much research on balancing and rebalancing load, • But implementation is hard particularly with dynamic ratios • Either sacrifice accuracy or reorder TCP packets

Much research on balancing and rebalancing load, • But implementation is hard particularly with dynamic ratios • Either sacrifice accuracy or reorder TCP packets Problem • Given the desired split ratios – possibly dynamic • Split traffic accurately, at the edge router, without reordering TCP’s packets

Existing Scheme 1: Packet-Based Splitting • Assign packets to paths proportional to the desired ratios • Reorders TCP packets causing bad throughput

Existing Scheme 2: Flow-Based Splitting • Assign TCP flows to each path proportional to the desired ratio • Flows are not all equal: Elephants & Mice • So, estimate the rate of each TCP flow • But rates change with time • Too complex • Very inaccurate if desired ratios change

How to Split Traffic? Flow-Based • Inaccurate • No packet reordering • Hard to track if ratios change Packet-Based • Accurate • Reorders TCP packets • Easily tracks dynamic ratios Can we combine the best of the two approaches?

This Talk • Show how to send a single TCP flow down multiple paths without reordering • Accurately split traffic even when desired ratios are dynamic • Easy to implement

TCPflow Flowlet Switching • If the previous packet from the flow has left the merging point  Can reassign the flow to a different path 1 2

Delay = D1 Delay = D2 Flowlet Switching Given > |D2-D1|

Delay = D1 Delay = D2 Flowlet Switching Given > |D2-D1| Flowlets are bursts from same flow separated by at least ; they can be switched independently! Idle ≥ 

Last_Seen (s) Path 3 9920.2659 Implementing Flowlet Switching is Simple • Router at the split point hashes packet header • If (Now - Last_Seen) > , flow can change path • Reassign path proportionally to the desired split ratios hash SRCip DSTip SRCPort DSTPort

Does it Really Work? • Traces collected on a peering link, an edge link and two core links • Split Vectors (3 paths) • Static (.3, .3, .4) • Dynamic – sinusoidal with amplitude 60%, period 20min [Akella04,Chuah02]

Is Flowlet Switching Accurate? Error

Is Flowlet Switching Accurate? Error Flowlet switching is much more accurate than flow-based switching

Can do Flowlet Switching without Per-Flow State Errors stabilize for small table Fig. shows Avg. and Max. of many traces 4 16 64 256 1024 2048 4096 8192 Hash Table Entries #Active Flows ~ 50,000; But… Router maintains a hash table < 1000 entries (5KB).

Understanding Flowlets

But Where do Flowlets come from? • Can’t be just timeouts or short flows; most of the bytes are in the elephants • Why can a large flow be broken into many small flowlets?

Flowlets exist because TCP is bursty at RTT and sub-RTT scales • Well-known that TCP usually sends a window in one or a few bursts and waits for acks [Zhang91,Zhang03, Jiang04] • Some Reasons • Slow-start • Ack compression • Window is much smaller than delay-BW product

Flowlets exist because TCP is Bursty Most flowlets have inter-arrivals less than an RTT  most flowlets are sub-windows

Why Flowlet Switching is Accurate? • 80% of bytes are in flowlets smaller than 10KB • Assigning a flowlet to a path isn’t a long commitment

Why Flowlets can Track Dynamics? Arrival Rate of both flows and flowlets (/sec) 143.16 Edge 1454.98 611.95 Peering 8661.43 3784.10 Core1 35287.04 111.33 Core2 2848.76 An order of magnitude more opportunities to rebalance!

Why flowlet switching doesn’t need per-flow state?

3 # Active Flowlets 2 1 0 Time Why flowlet switching doesn’t need per-flow state? Flow 1 Flow 2 Flow 3

3 2 1 0 Why flowlet switching doesn’t need per-flow state? Flow 1 Flow 2 Flow 3 # Active Flowlets Time

#Active Flowlets 18.41 28.08 240.12 50.66 Why flowlet switching doesn’t need per-flow state? Trace Edge Peering Core1 Core2

#ActiveFlows #Active Flowlets 1450.42 18.41 8477.33 28.08 47883.33 240.12 1559.33 50.66 Why flowlet switching doesn’t need per-flow state? Trace Edge Peering Core1 Core2 #Active flowlets is 2 orders of magnitude smaller than flows  Very small hash table

Why Flowlet Switching is Possible? • TCP burstiness at small time scales • Small commitment; many more chances to rebalance • Few simultaneously active flowlets • Why can a large flow be broken into many small flowlets? • Why is flowlet switching accurate? • Why flowlet switching does not need per-flow state?

Configuring Flowlet Switching For our traces which are a diverse collection of traffic within continental US • ~50ms is a good and safe choice! • Our procedure is a constructive way to find  Flowlet separation > delay difference But, how to find delay difference?

~50ms results in accurate splitting Flowlet Separation of 50ms is Good Any flowlet timeout in [50, 100] ms yields highly accurate splits

Flowlet Separation of 50ms is Safe 1 % .8 % .6 % .4 % .2 % 0 % Even if delay difference >> 50ms, prob. of reordering is negligible compared to drop. rate in the Internet (about 1%)

Conclusion • Harness TCP burstiness to split traffic at a finer resolution than a flow without reordering • Flowlet Switching: • Splitting errors are a few percents • Reordering probability is negligible compared to drop prob. in the Internet • Easy to implement • Enable ISPs to do dynamic load balancing

Flowlet Switching