Results on High Throughput and QoS Between the US and CERN

Results on High Throughput and QoS Between the US and CERN California Institute of technology - Cern External Network Division Sylvain.ravot@cern.ch Datagrid WP7 - 24 January, 2002

Agenda • TCP performance over high latency/bandwidth network • TCP behavior • TCP limits • TCP improvement • Scavenger measurements over the transatlantic link • Load balancing over the transatlantic links

TCP overview: Slow Start and Congestion Avoidance Connection opening : cwnd = 1 segment Congestion Avoidance Slow Start cwnd = SSTHRESH Exponential increase for cwnd until cwnd = SSTHRESH Additiveincrease for cwnd Retransmission timeoutSSTHRESH:=cwnd/2cwnd:= 1 segment Retransmission timeout SSTHRESH:=cwnd/2 • Exponential increase for cwnd : for every useful acknowledgment received, cwnd := cwnd + (1 segment size) • Additive increase for cwnd : for every useful acknowledgment received, cwnd := cwnd + (segment size)*(segment size) / cwnd it takes a full window to increment the window size by one.

TCP overview: Fast Recovery Connection opening : cwnd = 1 segment Congestion Avoidance Slow Start cwnd = SSTHRESH Exponential increase for cwnd until cwnd = SSTHRESH Additiveincrease for cwnd Retransmission timeout SSTHRESH:=cwnd/2 cwnd:= 1 segment 3 duplicate ack received 3 duplicate ack received Retransmission timeout SSTHRESH:=cwnd/2 Fast Recovery Expected ack received cwnd:=cwnd/2 Exponentialincrease beyond cwnd Retransmission timeout SSTHRESH:=cwnd/2

TCP overview: Slow Start and congestion Avoidance Example Here is an estimation of the cwnd: Cwnd average of the last 10 samples. Cwnd average over the life of the connection to that point SSTHRESH Slow start Congestion Avoidance • Slow start : fast increase of the cwnd • Congestion Avoidance : slow increase of the window size

Tests configuration Lxusa-ge.cern.ch (Chicago) GbEth GbEth Uslink- POS 155 Mbps GbEth Calren2 / Abilene Ar1-chicago Pcgiga-gbe.cern.ch(Geneva) Cernh 9 Plato.cacr.caltech.edu(California) • CERN<->Caltech (California) • RTT : 175 ms • Bandwith-delay product : 2,65 MBytes. It is difficult to evaluate the available bandwidth between CERN and Caltech. By using UDP flows, we transferred data at a rate of 120 Mbit/s without losing any packet (we performed the test during 60s). For higher rate we were losing packets. We deduced from this simple measurements that the available bandwidth was about 120 Mbit/s. We repeated several time this measurement in order to check if the network conditions were changing. • CERN<->Chicago • RTT : 110 ms • Bandwidth-delay-product : 1.9 MBytes. Tcp flows were generated by Iperf. Tcpdump was used to capture packets flows, and tcptrace and xplot were used to plot and summarize tcpdump data set.

Influence of the initial SSTHRESH on TCP performance (1) The two plots below show the initial SSTHRESH influence on performance : Slow start Congestion avoidance SSTHRESH = 1460Kbyte SSTHRESH = 730Kbyte Cwnd=f(time) ( Throughput of the connection = 33 Mbit/s) Cwnd=f(time) ( Throughput of the connection = 63 Mbit/s) During congestion avoidance and without loss, the cwnd increases by one segment each RTT. In our case,we have no loss, so the window increases by 1460 bytes each 175 ms. If the cwnd is equal to 730 kbyte, it takes almost 4 minutes to have a cwnd larger than the bandwidth delay product (2,65 MByte). In other words, we have to wait almost 4 minutes to use the whole capacity of the link !!!

Influence of the initial SSTHRESH on TCP performance (2) • Linux TCP implementation evaluates the initial ssthresh from the cwnd size of the previous TCP connection. The initial ssthresh depend on the environment of the previous connection ( bandwidth, RTT, loss rate …) => this parameter is not optimally set when the environment change. • We modified the linux kernel in order to be able to set the initial ssthresh. We measured the influence of this parameter on TCP performance between Cern and Caltech (2,65 Mbits bandwidth delay product, 175 ms RTT) BandwidthDelayProduct

Too large cwnd compared to the available bandwidth • By limiting the cwnd size to the value of the estimated bandwidth delay product (2,65 Mbyte) we get 121 Mbit/s throughput. In this case, no loss occurs. • When the cwnd can reach larger value than the bandwidth delay product, loss may occur and performance decrease. In the plot here, the cwnd can be larger than the bandwidth delay product, we only get 53 Mbit/s throughput. 2) Fast Recovery (Temporary state to repair the lost) Losses occur when the cwnd is larger than 3,5 Mbyte 1) A packet is lost New loss 3) Back to slow start(Fast Recovery couldn’t repair the lostThe packet lost is detected by timeout => go back to slow start cwnd = 2 MSS) Cwnd when packets are lost because the window size is too large Losses occur when the window size becomes larger than 3,5 Mbyte. Then the cwnd size go back to 1 MSS (Max. Segment Size) => performance are affected. Losses occur because the network hasn’t enough capacity for storing all the sent packets. To get high throughput over a high delay/bandwidth network, we need to avoid losses. From this simple example, we see that if the window size is too large, some packets are dropped and performance decreases.

TCP Improvement • Example: • Assuming the following parameter: • - no loss- SSTHRESH = 65535 byte - RTT = 175 ms (RTT between Cern and Caltech without congestion)- bandwidth = 120 Mbit/s=> bandwidth-delay-product = 2.65 Mbyte • We can easily estimate the time needed to increase the cwnd to a size larger than the bandwidth-delay-product. During congestion avoidance, cwnd is incremented by 1 full-sized segment per round-trip time (RTT). Therefore, to increase the congestion window size from 65535 bytes to 2.65 Mbytes, it takes more than 5 minutes! • Idea : • Increase the speed at which the window size increase. • We change the TCP algorithm : • During slow start, for every useful acknowledgement received, cwnd increases by N segments. N is called slow start increment. • During congestion avoidance, for every useful acknowledgement received, cwnd increases by M * (segment size) * (segment size) / cwnd.It’s equivalent to increase cwnd by M segments each RTT. M is called congestion avoidance increment • Note: N=1 and M=1 in common TCP implementations.

TCP tuning by modifying the slow start increment Slow start, 0.8s Slow start, 2.0s Cwnd of the last 10 samples. Cwnd average over the life of the connection to that point Congestion window (cwnd) as function of the timeSlow start increment = 1, throughput = 98 Mbit/s Congestion window (cwnd) as function of the timeSlow start increment = 3, throughput = 116 Mbit/s Slow start , 1.2s Slow start, 0.65s Congestion window (cwnd) as function of the timeSlow start increment = 2, throughput = 113 Mbit/s Congestion window (cwnd) as function of the timeSlow start increment = 5, throughput = 119 Mbit/s Note that for each connection the SSTHRESH (slow start threshold) is equal to 4,12 Mbyte

TCP tuning by modifying the congestion avoidance increment (1) Cwnd is increased by 1200 bytes in 27 sec. Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 37.5 Mbit/s SSTHREH = 0.783 Mbyte Cwnd is increased by 12000 bytes (10*1200)in 27 sec. Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 61.5 Mbit/s => A lager congestion avoidance increment improve performance.

TCP tuning by modifying the congestion avoidance increment (2) congestion avoidance increment • SSTHRESH < bandwidth-delay product (blue, pink and yellow plots), larger is the congestion avoidance increment, better are the performance. • SSTHRESH > bandwidth-delay product (red plots), Cwnd is larger than the bandwidth delay product at the end of slow start, so the connection use the whole available bandwidth (120 Mbit/s) since the beginning of congestion avoidance. The increment size doesn’t influence the throughput.

Benefice of larger congestion avoidance increment when losses occur We simulate losses by using a program which drops packets according to a configured loss rate. For the next two plots, the program drop one packet every 10000 packets. 2) Fast Recovery (Temporary state until the loss is repaired) 1) A packet is lost 3) cwnd:=cwnd/2 Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 8 Mbit/s Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 20 Mbit/s When a loss occur, the cwnd is divide by two. The performance is determined by the speed at which the cwnd increases after the loss. So higher is the congestion avoidance increment, better is the performance,

TCP over high latency/bandwidth conclusion • To achieve high throughput over high latency/bandwidth network, we need to : • Set the initial slow start threshold (ssthresh) to an appropriate value for the delay and bandwidth of the link. The initial slow start threshold has to be larger than the bandwidth-delay product but not too large. • Avoid loss by limiting the max cwnd size. • Recover fast if lost occurs: • Larger cwnd increment => we increase faster the cwnd after a loss • Smaller window reduction after a loss (Not studied in this presentation) • …..

Scavenger Service • Introduction :Qbone Scavenger Service (QBSS) is an additional best-effort class of service. A small amount of network capacity is allocated for this service; when the default best-effort capacity is underutilized, QBSS can expand itself to consume the unused capacity. • Goal of our test : • Does the Scavenger traffic affect performance of the normal best effort traffic? • Does the Scavenger Service use the whole available bandwidth?

Tests configuration • CERN<->Chicago • RTT : 116 ms • Bandwidth-delay-product : 1.9 MBytes. QBSS traffic is marked with DSCP 001000 (  Tos Field 0x20) GbEth Uslink- POS 155 Mbps GbEth Ar1-chicago Pcgiga-gbe.cern.ch(Geneva) Lxusa-ge.cern.ch (Chicago) Cernh9 • Cernh9 configuration : • policy-map match-all qbss match ip dscp 8 • policy-map qos-policyclass qbssbandwidth percent 1 queue-limit 64 random-detectclass class-defaultrandom-detect • interface ...service-policy output qos-policy TCP and UDP flows were generated by Iperf. QBSS traffic is marked using the TOS option of iperf :iperf –c lxusa-ge –w 4M –p 5021 --tos 0x20

Scavenger and TCP traffic (1) • We ran two connections at the same time. Packets of connection #2 were marked (scavenger traffic) and packets of the connection #1 were not marked. We measured how the two connections shared the bandwidth. • TCP scavenger traffic doesn’t affect TCP normal traffic. Packets of connection #2 are dropped by the scavenger service, so the connection #2 reduces its rate before affecting the connection #1. The throughput of the connection #2 remains low because the loss rate of the scavenger traffic is high.

How does TCP Scavenger traffic use the available bandwidth? • We performed TCP scavenger transfer when the available bandwidth was larger than 120 Mbps. We measured the performance of the scavenger traffic. Available bandwidth • We performed the same tests without marking the packets. We had a throughput larger than 120 Mbps. • TCP scavenger traffic doesn’t use the whole available bandwidth. Even if there is no congestion on the link, some packets are dropped by the router. It is probably due to the small size of the queue reserved for scavenger traffic (queue-limit 64).

Scavenger Conclusion • TCP Scavenger traffic doesn’t affect normal traffic. TCP connection are very sensitive to loss. When congestion occurs, scavenger packets are dropped first and the TCP scavenger source immediately reduces its rate. Therefore normal traffic isn’t affected. • Scavenger traffic expands to consume unused bandwidth , but doesn’t use the whole available bandwidth. • Scavenger is a good solution to transfer data without affecting normal (best effort) traffic. It has to be kept in mind that scavenger doesn’t take advantage of the whole unused bandwidth. • Future Work • Our idea is to implement a monitoring tool based on Scavenger traffic. We could generate UDP scavenger traffic without affecting normal traffic in order to measure the available bandwidth. • Can we use the Scavenger Service to perform tests without affecting the production traffic? • Does the Scavenger traffic behave as the normal traffic when no congestion occurs?

Load balancing over the transatlantic link • Load balancing allows to optimize resources by distributing traffic over multiple paths for transferring data to a destination. Load balancing can be configured on a per-destination or per-packet basis.On Cisco routers, there are two types of load balancing for CEF (Cisco Express Forwarding) : • Per-Destination load balancing • Per-Packets load balancing • Per-Destination load balancing allows router to use multiple paths to achieve load sharing. Packets for a given source-destination pair are guaranteed to take the same path, even if multiple paths are available. • Per-Packets load balancing allows router to send successive data packets without regard to individual hosts. It uses a round-robin method to determine which path each packet takes to the destination. • We tested the two types of load balancing between Chicago and CERN using our two STM-1 circuits.

Configuration • CERN<->Chicago • RTT : 116 ms • Bandwidth-delay-product : 2 * 1.9 MBytes. POS 155 Mbps – circuit #2 GbEth GbEth POS 155 Mbps – circuit #1 Cernh9 Pcgiga-gbe.cern.ch(Geneva) Lxusa-ge.cern.ch (Chicago) Ar1-chicagoCisco 7507 Cernh9Cisco 7507

Load balancing : Per Destination vs Per Packets • MRTG report: traffic from Chicago to CERN Load Balancing type: Per Destination Per Packets Per Destination Traffic from Chicago to CERN on the link #1 Traffic from Chicago to CERN on the link #2 When the bulk of data passing through parallel links is for a single source/destination pair, per-destination load balancing overloads a single link while the other link has very little traffic. Per-packet load balancing allows to use alternate paths to the same destination.

Load Balancing Per-Packets and TCP performance • UDP flow (Cern -> Chicago): Cern: [sravot@pcgiga sravot]$ iperf -c lxusa-ge -w 4M -b 20M -t 20 [ ID] Interval Transfer Bandwidth [ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec [ 3] Sent 35716 datagrams Chicago: [sravot@lxusa sravot]$ iperf -s -w 4M -u [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec 0.446 ms 0/35716 (0%) [ 3] 0.0-20.0 sec 17795 datagrams received out-of-order 50 % of the packets are received out-of-order. • TCP flow (Cern -> Chicago): [root@pcgiga sravot]# iperf -c lxusa-ge -w 5700k -t 30 [ ID] Interval Transfer Bandwidth [ 3] 0.0-30.3 sec 690 MBytes 191 Mbits/sec By using TCPtrace to plot and summarize TCP flows which were captured by TCPdump, we measured that 99,8 % of the acknowledgements are selective (SACK). The performance is quite good even if packets are received out of order. The SACK option is efficient. However, we were not able to get a higher throughput than 190 Mbit/s. It seems that receiving too much out-of-order packets limits TCP performance.

Load Balancing Conclusion • Conclusion We decided to remove the load balancing per packets option because it was impacting the operational traffic. Each packets flow going through the Uslink was disordered. Load balancing per packets is inappropriate for traffic that depends on packets arriving at the destination in sequence.

Results on High Throughput and QoS Between the US and CERN