TCP & Data Center Networking

TCP & Data Center Networking TCP & Data Center Networking: Overview TCP Incast Problem & Possible Solutions DC-TCP MPTCP (multipath TPC) Please read the following papers [InCast] [DC-TCP] [MPTCP] CSci5221: TCP and Data Center Networking

TCP Congestion Control: Recap • Designed to address network congestion problem • reduce sending rates when network conges • How to detect network congestionat end systems? • Assume packet losses (& re-ordering)  network congestion • How to adjust sending rates dynamically? • AIMD (additive increase & multiplicative decrease): • no packet loss in one RTT: W  W+1 • packet loss in one RTT: W  W/2 • How to determine the initial sending rates? • probe the network available bandwidth via “slow start” • W:=1; no loss in one RTT: W  2W • Fairness: assume everyone will use the same algorithm

TCP Congestion Control: Devils in the Details • How to detect packet losses? • e.g., as opposed to late-arriving packets? • estimate (average) RTT times, and set a time-out threshold • called RTO (Retransmission Time-Out) timer • packets arriving very late are treated as if they were lost! • RTT and RTO estimations: Jacobson’s algorithm • Compute estRTT and devRTT using exponential smoothing: • estRTT := (1-a)estRTT + sampleRTT (a>0 small, e.g., a=0.125) • devRTT:=(1-a)devRTT + a|sampleRTT-devRTT| • Set RTO conservatively: • RTO:= max{minRTO, estRTT + 4xdevRTT} where minRTO = 200 ms • Aside: many variants of TCP: Tahoe, Reno, Vegas, ...

But …. • Internet vs. data center network: • Internet propagation delay: 10-100 ms • data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps •  packet transmission time is 0.01 ms

What Special about Data Center Transport • Application requirements (particularly, low latency) • Particular traffic patterns • customer facing vs. internal: often co-exist • internal: e.g., • Google file system • Map-Reduce • … • Commodity switches: shallow buffer • And time is money!

How does search work? Partition/Aggregate Application Structure Deadline = 250ms MLA MLA TLA • Time is money • Strict deadlines (SLAs) • Missed deadline • Lower quality result • Many requests per query • Tail-latency matters Picasso ……… 1. Art is a lie… 1. 1. Deadline = 50ms 2. The chief… • 2. Art is a lie… 2. Art is… ….. 3. ….. ….. 3. 3. Picasso “Everything you can imagine is real.” “Computers are useless. They can only give you answers.” “It is your work in life that is the ultimate seduction.“ “I'd like to live as a poor man with lots of money.“ “Bad artists copy. Good artists steal.” “Art is a lie that makes us realize the truth. “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” Deadline = 10ms Worker Nodes

Data Center Workloads Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive • Partition/Aggregate (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-100MB] (Data update)

Flow Size Distribution > 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

packet size S_DATA 1 Ethernet: 1-10Gbps small buffer B 2 3 switch link capacity C Round Trip Time (RTT): 100-10us N A Simple Data Center Network Model Logical data block (S) (e.g., 1 MB) aggregator Server Request Unit (SRU) (e.g., 32 KB) N servers

TCP Incast Problem • Vasudevan et al. (SIGCOMM’09) • Synchronized fan-in congestion: •  Caused by Partition/Aggregate. Worker 1 Aggregator Worker 2 Worker 3 Worker 4 RTOmin = 200 ms 7-8 dropped Req. sent TCP timeout Rsp. sent 7-8 resent 1 – 6 done Link Idle! time

TCP Throughput Collapse Collapse! • TCP Incast • Cause of throughput collapse: • coarse-grained TCP timeouts

Incast in Bing MLA Query Completion Time (ms)

Problem Statement TCP retransmission timeouts • High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows) • Highly-synchronized flows on bottleneck link • Limited switch buffer size (e.g., 32 KB) How to provide high goodput for data center applications? TCP throughput degradation N 13

µsecond Retransmission Timeouts (RTO) One Quick Fix: µsecond TCP + no minRTO • RTT tracked in milliseconds 200ms 200µs? • Track RTT in µsecond 0? RTO = max( minRTO, f(RTT) )

Solution: µsecond TCP + no minRTO Proposed solution Throughput (Mbps) Unmodified TCP more servers  High throughput for up to 47 servers Simulation scales to thousands of servers

TCP in the Data Center • TCP does not meet demands of applications. • Requires large queues for high throughput: • Adds significant latency. • Wastes buffer space, esp. bad with shallow-buffered switches. • Operators work around TCP problems. • Ad-hoc, inefficient, often expensive solutions • No solid understanding of consequences, tradeoffs

Data Center Workloads Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive • Partition/Aggregate (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-100MB] (Data update)

Flow Size Distribution > 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

Queue Buildup Sender 1 • Large flows buildup queues. • Increase latency for short flows. Receiver How was this supported by measurements? Send 2 • Measurements in Bing cluster • For 90% packets: RTT < 1ms • For 10% packets: 1ms < RTT < 15ms

Data Center Transport Requirements • High Burst Tolerance • Incast due to Partition/Aggregate is common. • Low Latency • Short flows, queries • 3. High Throughput • Continuous data updates, large file transfers The challenge is to achieve these three together.

DCTCP: Main Idea • React in proportion to the extent of congestion. • Reduce window size based on fractionof marked packets.

DCTCP: Algorithm Mark K Don’t Mark B • Sender side: • Maintain running average of fractionof packets marked (α). • Adaptive window decreases: • Note: decrease factor between 1 and 2. Switch side: • Mark packets when Queue Length > K.

DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, ECN Marking Thresh = 30KB

Multi-path TCP (MPTCP) In a data center with rich path diversity (e.g., Fat-Tree or Bcube), can we use multipath to get higher throughput? Initially, there is one flow.

In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow.

In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

The MPTCP protocolMPTCP is a replacement for TCP which lets you use multiple paths simultaneously. user space The sender stripes packets across paths The receiver puts the packets in the correct order socket API MPTCP MPTCP TCP IP addr addr1 addr2

Design goal 1:Multipath TCP should be fair to regular TCP at shared bottlenecks A multipath TCP flow with two subflows Regular TCP • Strawman solution:Run “½ TCP” on each path To be fair, Multipath TCP should take as much capacity as TCP at a bottleneck link, no matter how many paths it is using.

Design goal 2:MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s Each flow has a choice of a 1-hop and a 2-hop path. How should we split its traffic?

Design goal 2:MPTCP should use efficient paths 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s If each flow split its traffic 1:1 ...

Design goal 2:MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s If each flow split its traffic ∞:1 ...

Design goal 2:MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Theoretical solution (Kelly+Voice 2005; Han, Towsley et al. 2006) Theorem: MPTCP should send all its traffic on its least-congested paths. This will lead to the most efficient allocation possible, given a network topology and a set of available paths.

Design goal 3:MPTCP should be fair compared to TCP wifi path: high loss, small RTT 3G path: low loss, high RTT Goal 3a.A Multipath TCP user should get at least as much throughput as a single-path TCP would on the best of the available paths. Goal 3b. A Multipath TCP flow should take no more capacity on any link than a single-path TCP would. Design Goal 2 says to send all your traffic on the least congested path, in this case 3G. But this has high RTT, hence it will give low throughput.

Design goals redundant How does MPTCP try to achieve all this? Goal 1. Be fair to TCP at bottleneck links Goal 2. Use efficient paths ... Goal 3. as much as we can, while being fair to TCP Goal 4. Adapt quickly when congestion changes Goal 5. Don’t oscillate

How does MPTCPcongestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. • Increase wr for each ACK on path r, by • Decrease wr for each drop on path r, by wr/2

How does MPTCPcongestion control work? Design goal 3: At any potential bottleneck S that path r might be in, look at the best that a single-path TCP could get, and compare to what I’m getting. Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. • Increase wr for each ACK on path r, by • Decrease wr for each drop on path r, by wr/2

How does MPTCPcongestion control work? Design goal 2: We want to shift traffic away from congestion. To achieve this, we increase windows in proportion to their size. Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. • Increase wr for each ACK on path r, by • Decrease wr for each drop on path r, by wr/2

MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput. MPTCP shifts its traffic away from the congested link. Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

MPTCP chooses efficient paths in a BCubedata center, hence it gets high throughput. throughput [Mb/s] Packet-level simulations of BCube (125 hosts, 25 switches, 100Mb/s links) and measured average throughput, for three traffic matrices. For two of the traffic matrices, MPTCP and½TCP (strawman) were as good. For one of the traffic matrices, MPTCP got 19% higher throughput. perm. traffic matrix sparse traffic matrix local traffic matrix

TCP & Data Center Networking