ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

ICTCP: IncastCongestion Control for TCPin Data Center Networks∗ Haitao Wu ⋆ , ZhenqianFeng ⋆ †, ChuanxiongGuo ⋆ , Yongguang Zhang ⋆ {hwu, v-zhfe, chguo, ygz}@microsoft.com, ⋆ Microsoft Research Asia, China †School of computer, National University of Defense Technology, China B99106017 圖資三謝宗昊

Outline • Background • Design Rationale • Algorithm • Implementation • Experimental results • Discussion and conclusion

Outline • Background • Design Rationale • Algorithm • Implementation • Experimental results • Discussions, related work and conclusion

Background • In distributed file systems, files are stored at multiple servers. • TCP does not work well for many-to-one traffic pattern on high-bandwidth, low-latency networks.

Background • Three preconditions of data center • Be well structured and layered to achieve high-bandwidth and low-latency. Buffer size of ToR (top-of-rack) • Barrier synchronized many-to-one traffic pattern is common in data center network • Transmission data volume for such traffic pattern is usually small

Background • TCP incastcollapse • Due to multiple connections overflow the Ethernet switch buffer in a short period of time. • Intense packet losses and thus TCP retransmission and timeout • Previous solution • Reducing the waiting time for packet loss • Control switch buffer occupation to avoid overflow by using ECN and modified TCP at both sender and receiver side

Background • This paper focus on: • Avoiding packet losses before incast congestion • Modify TCP receiver only • Receiver side knows the throughput of all TCP connections and the available bandwidth

Background • Well controlling the receive windows is challenging • Receive window should be small enough to avoid incast congestion • Also should be large enough for good performance and other non-incast cases • Good setting for one scenario may not fit well to others

Background • The technical novelities in this paper: • Use the available bandwidth as a quota to coordinate the receive window increase • Per flow congestion control is performed independently in slotted time of RTT on each connection • Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one

Background “Goodput” is thorughput obtained and observed at applicaiotn layer • TCP incast congestion • Happen when multiple sending servers under the same ToR switch send to one receiver server simultaneously • TCP throughput is severely degraded on incast congestion

Background • TCP goodput, receive window and RTT • A small static TCP receive buffer may prevent TCP incast congestion collaspe→ Can’t work dynamically • Requires either losses or ECN marks to trigger windows decrease

Background • TCP goodput, receive window and RTT • TCP Vegas: Make the assumption that increase of RTT is only caused by packet queuing at bottleneck buffer. • Unfortunately, the increase of TCP RTT in high-bandwidth, low-latency does not follow such assumption

Design Rationale • Goal • Improve TCP performance for incast congestion. • No new TCP option or modification to TCP header.

Design Rationale • Three observation which form the base for ICTCP • Available bandwidth at receiver side is the signal for receiver to do congestion control. • The frequency of receive window based congestion control should be made according to the per-flow feedback-loop independenty • A receive window based scheme should adjust the window according to both link congestion status and also application requirement. • Set a proper receiver window to all TCP connections sharing the same last-hop • Due to the parallel TCP connections may belong to the same job

Outline • Background • Design Rationale • Algorithmn • Implementation • Experimental results • Disscussion and conclusion

Algorithm • Available bandwidth • C: The link capacity of the interface on receiver server • BWT:Bandwidth of total incoming traffic observed on that interface • : :Parameter to absorb potential oversubscribed during windows adjustment • BWA: The quota of all incoming connections to increase receive window for higher throughtput

Algorithm • Available bandwidth

Algorithm • Window adjustment on single connection • : Incoming measured throughput • : Sample of current throughput (on connection i)

Algorithm • Window adjustment on single connection • : : Expected throughput • : Receive window of I • We have the max procedure to endure <=

Algorithm • Window adjustment on single connection • : The ratio of throughput difference of connection i • <= , thus \

Algorithm • Window adjustment on single connection • We have two thresholds , ( > )to differentiate three case: • <= or <= → increase receive window if in global second sub-slot and having enough quota of available bandwidth → decrease receive window by one MSS^2 if this condtion hold for three continuous RTT • Otherwise, keep current receive window • Initiate newly established or long time idle connection in slow start • Go into congestion avoidance when above second and third is met, or the first case is met but no enough quota

Algorithm • Fairness controller for multiple connections • Fairness is only considered among low-latency flows • For windows decrease, cut the receive window by MSS^3, for connections that have receive window larger than average. • For windows increase, be automatically achieved by algorithm we have talked about.

Implement • Develop ICTCP as a NDIS driver on Windows OS. • Naturally supports the case for virtual machine • The incoming throughput in very short time scale can be easily obtained. • Does not touch TCP/UP implementation in Windows kernel.

Implement • Redirect the packet to header parser module • Packet header is parsed and the information on flow table is updated • Algorithm module is responsible for receive window calculation • If a TCP ACK packet is sent out, the header modifier change the receive window field in TCP header if need.

Implement • Support for Virtual Machines • The total capacity of virtual NICs is typically configured high than physical as most virtual machine won’t be busy at the same time • The observed virtual link capacity and available bandwidth does not represent the real value • There are two solution • Change the setting to make the total capacity of virtual NICs equal to physical NIC • Deploy a ICTCP driver on virtual machine host server

Implement • Obtain fine-grained RTT at receiver • Define the reverse RTT as the RTT after a exponential filter at the TCP receiver side. • The reverse RTT can be obtained in data traffic on both side. • The data traffic on reverse direction may not be enough for keep obtaining live reverse RTT → Use TCP timestamp • For implement, modify the timestamp counter into 100ns granularity

Experimental results

Discussion and Conclusion • Discussion three issues • Scalability: if the number of connections become extremely large • Switching the receive window between several value • How to handle congestion while sender and receiver are not under the same switch • Use ECN to obtain congestion information • Whether ICTCP works for future high-bandwidth low-latency network • The switch buffer should be enlarged correspondingly • The MSS should be enlarged.

Discussion and Conclusion • Conclusion • Focus on receiver based congestion control to prevent packet loss • Adjust TCP receive window on the ratio of difference of achieved and expected per connection throughput • Experimental results show that ICTCP is effective to avoid congestion

Thanks for listening

ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

Presentation Transcript

Chapter 12

Wireless, mobile networking

Learning Bayesian Networks from Data

Congestion Control Algorithms of TCP in Emerging Networks

Computer Networks

Neural networks for structured data

Chapter 6 Medium Access Control Protocols and Local Area Networks

Transport Layer

Chapter 4 Circuit-Switching Networks

Learning Bayesian Networks from Data

Transport Layer

Scalable Data Collection in Sensor Networks

7: TCP

Chapter 3: Transport Layer

Chapter 3 Transport Layer

Chapter 3: Transport Layer

Chapter 3: Transport Layer

Chapter 4 Circuit-Switching Networks

Chapter 5

Transport Layer

How Do We Capitalize WiFi Hotspots?