1 / 43

TCP: extensions for highspeed networks

TCP: extensions for highspeed networks. Scalable TCP: Improving performance in hightspeed wide area networks. Rif: http://datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf. Agility of standard congestion window adjustment algorithm. Round trip time = 200 ms Packet size = 1500 bytes

delano
Télécharger la présentation

TCP: extensions for highspeed networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCP: extensions for highspeed networks

  2. Scalable TCP: Improving performance in hightspeed wide area networks Rif: http://datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf

  3. Agility of standard congestion window adjustment algorithm Round trip time = 200 ms Packet size = 1500 bytes Bandwith = 1 Gbps cwnd = 16000 ca After the detection of a congestion event: cwnd = 8000 ==> 500 Mbps To reach the sending rate of 1Gbps will take 8000 RTT (about 27 minutes)

  4. Scalable TCP Altering the congestion window adjustment algorithm, the agility with large windows can be improved. For each acknowledgment received in a round trip time in which congestion has not been detected: cwnd  cwnd + 0.01 and on the first detection of congestion in a given round trip time: cwnd  cwnd – [0.125*cwnd]

  5. Scalable TCP Using Scalable TCP algorithm, the time for a source to double its sending rate is about 70 RTT for any rate. Round trip time = 200 ms Packet size = 1500 bytes Bandwith = 1 Gbps After a congestion event, the Scalable TCP algorithm will recover its original rate after a transient in under 3 seconds.

  6. Supporting loss rate for a connection: the maximum packet loss rate that a congestion control algorithm will tolerate to sustain a given level of throughput Packet loss recovery time for a given rate and connection: the length of time required by an congestion control algorithm to return to its initial sending rate following the detection of a packet loss. Traditional TCP connections are unable to achieve high throughput in highspeed wide area networks due to the long packet loss recovery times and the need for low supporting loss rates.

  7. Properties of a traditional TCP connection with a RTT of 200 ms and a segment size of 1500 bytes

  8. Analysis and Design The source and destination pair are identified with a route r. The end-to-end dropping probability on a route r is denoted by Pr(t). Cwndr and Tr denotes the congestion window and the round trip time of a connection on route r. Scalable TCP window update algorithm: Cwndr  Cwndr + a Cwndr Cwndr– [b * cwndr] Packet loss recovery times for a traditional TCP connection are proportional to the connection’s window size and round trip time. Packet loss recovery times for a scalable TCP connection are proportional to the connection’s round trip time only.

  9. Traditional TCP scaling properties. Scalable TCP scaling properties. Figures show the congestion window dynamics of a single connection over a dedicated link of capacity c or C (c < C).

  10. Experiments (1) Testbed topology used for experiments The DataTAG testbed consists of 12 high performance PCs that have Supermicro P4DP8-G2 motherboards with dual 2.4GHz Xeon processors and 2 gigabytes of memory. SysKonnect SK-9843 Gigabit Ethernet cards on a 133MHz/64bit PCI bus provided connectivity to the testbed network. 6 servers are located at CERN, Geneva, and 6 servers at StarLight, Chicago. The clusters are connected through two Cisco 76xx routers with a 2.4Gbps packet over SONET link between Geneva and Chicago. The PCs are connected to each routers through gigabit Ethernet ports.

  11. Experiments (2) • Scalable TCP was implemented in a system with Linux 2.4.19 kernel. • This kernel implements a sophisticated TCP stack supporting: • TCP extensions for high performance (window scaling, timestamping, protection again wrapped sequence number) • SACK • D-SACK • The round trip time for a ping from Geneva to Chicago was 120ms. In the experiments that follow the interface between Geneva and Chicago had a FIFO queue of 2048 packets. All the other gigabit Ethernet interfaces on the routers had the factory default setting of a 40 packet FIFO queue. • Three sender side test cases are compared: TCP in an unaltered Linux 2.4.19 kernel, TCP in a Linux 2.4.19 kernel with the gigabit kernel modifications, and Scalable TCP in a Linux 2.4.19 kernel with the gigabit kernel modifications.

  12. Experiments (3) Number of 2 Gigabyte transfers completed in 1200 seconds. In these tests 4 server and receiver pairs were used with TCP flows distributed evenly across the 4 machines. Each receiver in Chicago would requested a file of size 2 Gigabytes from its associated server in Geneva. The server responded by transferring 2 Gigabytes of data (from memory) back to the receiver in Chicago. Upon completion of the 2 Gigabyte transfer the connection was completed and another request was initiated. This was intended to capture some slow-start and termination dynamics. In all cases each TCP socket had send and receive buffers set to 64MB; this allowed a single flow to make full use of any bandwith available to it.

  13. TCP Westwood: Bandwith estimation for enhanced transport over wireless links http://www.cs.ucla.edu/NRL/hpi/tcpw/tcpw_papers/2001-mobicom-0.pdf “Link Capacity Allocation andNetwork Control by Filtered Input Rate in High speed Networks”, IEEE/ACM Transections on Networking, vol. 3, no. 1, Feb. 1995, pp. 10-25 http://www.cs.ucla.edu/NRL/hpi/tcpw/tcpw_papers/tcpwred-spects2003.pdf

  14. Introduction (1) Current Tcp implementations rely on packet loss as an indicator of network congestion. • In the wired portion of the network a congested router is indeed the likely reason of packet loss; • On a wireless link, on the other hand, a noisy, fading radio channel is the more likely cause of loss.  Problems in TCP Reno since it does not posses the capability to distinguish and isolate congestion loss from wireless loss.

  15. Introduction (2) • The best performing approach to solve the problem was shown to be a localized link layer solution applied directly to the wireless links. • “Snooping” protocol • The protocol appropriately called “Snoop” monitors the packets flowing over the wireless link as well as their related acknowledgments. • The protocol entities cache copies of TCP data packets and monitor the ACKs in the reverse direction. If a packet loss is detected (i.e., through duplicate acknowledgments, DUPACKs), the cached copy is used for local retransmission, and any packet carrying feedback information back to the TCP sender isextracted so as to avoid “premature” retransmission at the TCP sender. • Snoop, however, has its own limitations. (It requires a snoop proxy in the base station - If the TCP sender is mobile the TCP code must be modified to respond to Explicit Loss Notification (ELN) packets from the base station.)

  16. End-to-end bandwith measurement The ACK-based measurement procedure (1) A fundamental design philosophy of the Internet TCP congestion control algorithm is that it must be performed end-to-end. The network is considered as a "black box''. The key idea of TCP Westwood is to exploit TCP acknowledgment packets to derive rather sophisticated measurements. After a congestion episode (i.e. the source receives three duplicate ACKs or a timeout) the source uses the measured bandwidth to properly set the congestion window and the slow start threshold, starting a procedure that we will call faster recovery. When an ACK is received by the source, it conveys the information that an amount of data corresponding to a specific transmitted packet was delivered to the destination. If the transmission process is not affected by losses, simply averaging the delivered data count over time yields a fair estimation of the bandwidth currently used by the source.

  17. The ACK-based measurement procedure (2) When duplicate ACKs (DUPACKs), indicating an out-of-sequence reception, reach the source, they should also count toward the bandwidth estimate, and a new estimate should be computed right after their reception. However, the source is in no position to tell for sure which segment triggered the DUPACK transmission, and it is thus unable to update the data count by the size of that segment. For the sake of simplicity, we assume all TCP segments to be of the same size.

  18. Filtering the ACK reception rate If an ACK is received at the source at time tk, this implies that a corresponding amount of data dkhas been received by the TCP receiver. We can measure the following sample of bandwidth used by that connection as: bk = dk / (tk – tk-1) where tk-1is the time the previous ACK was received. Since congestion occurs whenever the low-frequency input traffic rate exceeds the link capacity, we employ a low-pass filter to average sampled measurements and to obtain the low frequency components of the available bandwidth.

  19. TCP Westwood (1) How the bandwith estimation can be used by the congestion control algorithm executed at the sender side of a TCP connection ? The general idea is to use the bandwidth estimate BWE to set the congestion window (cwin) and the slow start threshold (ssthresh) after a congestion episode. Algorithm after n duplicated ACKs if (n DUPACKs are received) ssthresh = (BWE * RTTmin) / seg_size; if (cwin > ssthresh) /* congestion avoid. */ cwin = ssthresh; endif endif (RTTmin = the smallest RTT sample observed over the duration of the connection)

  20. TCP Westwood (2) Algorithm after coarse timeout expiration if (coarse timeout expires) ssthresh = (BWE * RTTmin) / seg_size; if (ssthresh < 2) ssthresh = 2; endif; cwin = 1; endif

  21. TCP Westwood Performance, Fairness and Friendliness Bandwith estimation effectiveness (1) TCPW with concurrent UDP traffic: bandwith estimation on a 5 Mb/s bottleneck link with one-way propagation delay of 30ms

  22. Bandwith estimation effectiveness (2) Consider a single TCPW connection sharing the bottleneck link with UDP connections. Packets are queued and transmitted on the link in FCFS order. In addition to demonstrating the accuracy of the bandwidth estimation algorithm, this scenario also illustrates the capability of a TCP Westwood connection to use the bandwidth left over by dynamic UDP flows. The configuration simulated here features a 5 Mb/s bottleneck link with a one-way propagation delay of 30ms. One TCP connection shares the bottleneck link with two ON/OFF UDP connections, and TCP and UDP packets are assigned the same priority. Each UDP connection transmits at a constant bit rate of 1 Mb/s while ON. Both UDP connections start in the OFF state; after 25s, the first UDP connection is turned ON, joined by the second one at 50s; the second connection follows an OFF-ON-OFF pattern at times 75s, 125s and 175s; at time 200s the first UDP connection is turned off as well. The UDP connections remain silent until the end of the simulation. The TCPW connection sends data throughout the simulation.

  23. TCPW fairness (1) Fair bandwidth sharing implies that all connections are provided with similar opportunity to transfer data. Sequence numbers vs. time for long and short RTT connections without RED (Random Early Detection) Sequence numbers vs. time for long and short RTT connections with RED (Random Early Detection)

  24. TCPW fairness (2) Two flows with different E2E round trip times (RTT) share the bandwidth more effectively under TCPW than under TCP Reno. We ran simulations in which connections were subject to 50ms and 200ms RTT, respectively. Figures show sequence number progress for TCPW and Reno connections without and with RED, respectively. In all cases the short connection progresses faster as expected. We note however that TCPW provides better fairness than Reno across different propagation times. The reason for the superior fairness exhibited by TCPW is that the long connection suffers less reduction in cwin and ssthresh. In Reno, cwin reduction is independent of RTT. The results show that both protocols benefit from RED, as far as fairness is concerned. Remarkably, the improvement in TCPW due to RED was higher than the improvement in Reno.

  25. TCPW friendliness (1) Friendliness is another important property of a TCP protocol. TCPW must be “friendly” to other TCP variants. TCPW connections must be able to coexist with connections from TCP variants while providing opportunities for all connections to progress satisfactorily. At least, TCPW connections should not result in starvation of connections running other TCP variants. Better yet, the bandwidth share of TCPW connections should be equal to their fair share. We ran simulation experiments with the following parameters: 2-Mbps bottleneck link, 20 flows total, all flows with 100ms RTT.

  26. TCPW friendliness (2) With all 20 connections running TCPW, the average throughput per connection was 0.0994 Mbps. All 20 Reno connections resulted in an average throughput of 0.0992 Mbps. As predicted, we got the same results for the two schemes. We then ran 10 Reno with 10 Westwood connections sharing the same 2Mbps bottleneck link over a path of 100ms RTT. The average throughput for a TCPW connection went up to 0.1078,and that of a Reno connection went down to 0.0913. This shows that TCPW behavior departs from “fair share” by 16% (TCPW gains 8% and TCP Reno loses 8%). This unfairness is rather moderate and it can be tolerated as it allows for coexistence with Reno.

  27. TCPW performance in mixed (wired and wireless) networks Independent loss model in ground radio environment (1) A simple simulation topology Figure shows a topology of a mixed network with a wired portion including a 10-Mbps link between a source node and a base station. The propagation time over the wired link is initially assumed to be 45ms. Later, the propagation time is varied from 0 to 250ms to represent a variety of wired network environments ranging from campus to intercontinental connections. The wireless portion of the network is a very short 2-Mbps wireless link with a propagation time of 0.01ms. The wireless link is assumed to connect the base station to a destination mobile terminal. Errors are assumed to occur in both directions of the wireless link.

  28. Independent loss model in ground radio environment (2) Throughput vs. error rate of the wireless link We compare the throughput of TCPW to that of Reno and SACK assuming independent (Bernoulli) errors ranging from 0 to 5% packet loss probability. The error model assumed here is equivalent to the “exponential error” model in which the time between successive errors is exponentially distributed. The results in Figure show that TCPW gains up to394 % over Reno or SACK. This gain occurs at a realistic packet error probability of 1%.

  29. Independent loss model in ground radio environment (3) Throughput vs. one-way propagation delay To assess TCPW throughput gain and its relation to the E2E propagation time, we ran simulations with the wired portion propagation time varying from 0 to 250ms. The results in Figure show a significant gain for TCPW of up to 567%, at a propagation time of 100ms. When the propagation time is small (say, less than 5ms), all protocols are equally effective. This is because a small window is adequate and window optimization is not an issue. TCPW reaches maximum improvement over Reno and Sack as the propagation time increases to about 100ms. After that, in this experiment, the gain starts to decrease as the feedback information used to estimate the available bandwidth arrives too late to be of significant help to TCPW.

  30. Independent loss model in ground radio environment (4) Throughput vs. link capacity Simulation results in Figure show that TCPW gains also increase significantly as the bottleneck link transmission speed increases (again, because what matters is the window size determined by the bandwidth-delay product). Thus, TCPW is more effective than TCP Reno in utilizing the Gbps bandwidth provided by new-generation, high-speed networks. Figure shows that the improvement obtained via TCPW increases to approximately 550% when the wireless link speed reaches 8 Mbps. The error model is still Bernoulli with parameter 0.5%, and the E2E propagation time is 45ms.

  31. Leo satellite model (1) Another environment where TCPW is likely to be valuable is the LEO satellite system. LEO Satellites present an environment with varying link quality and relatively long propagation delay. Also, in the future, higher transmission speeds are expected. That is where TCPW would be most beneficial. Throughput vs. link capacity of the Satellite link

  32. Leo satellite model (2) We considered for this study a scenario where a single hop, up to the satellite and down to an earth terminal, connects a terminal to a gateway and from there to the terrestrial network. One way (e.g. terminal to gateway) propagation time is assumed to be 100ms. The error rate is assumed 0.1% in normal operating conditions. Occasionally, if the LEO system supports satellite diversity, a handoff to a different LEO satellite (different orbit) becomes necessary to overcome the blocking due to buildings, thick foliage etc. During handoff, we assume all packets are lost. In our model, the handoff from one satellite to another needs 100ms to complete, and the period between handoffs is 4s, say. Figure shows the major improvements of TCPW over Reno and SACK, especially at high speeds.

  33. Fast TCP: Motivation, Architecture, Algorithms, Performance.From theory to Experiments. http://netlab.caltech.edu/pub/papers/FAST-infocom2004.pdf http://netlab.caltech.edu/pub/papers/FAST-csreport2003.pdf http://netlab.caltech.edu/pub/papers/fast-network05.pdf

  34. TCP Reno difficulties • Equilibrium problem • Packet level: Additive Increment too slow, Multiplicative Decrement too drastic. • Flow level: requires very small lossprobability. • Dynamic problem • Packet level: must oscillate on a binary signal. • Flow level: unstable at large window.

  35. Fast TCP: Architecture and algorithms We separate the congestion control mechanism of TCP into 4 components that are functionally independent so that they can be designed separately and upgrade asynchronously. Data Control it determines which packets to transmit Window control  it determines how many packets to transmit Burstiness control  it determines when to transmit the packet This decision are made based on information provided by the estimation component.

  36. Estimation When a positive acknowledgement is received, it calculates the RTT for corresponding data packet and updates the average queueing delay and the minimum RTT. When a negative acknowledgement (signaled by 3 duplicate ack or timeout) is received, it generates a loss indication for this data packet to the other components. The estimation component generates both a multi-bit queueing delay sample and an one-bit loss-or-no-loss sample for each data packet. The queueing delay is smoothed by taking a moving average with the weight η(t):=min(3/wi(t), ¼) that depends on the window wi(t) at time t as follow. The k-th RTT sample Ti(k) updates the average RTT according to: where tk is the time at which the k-th RTT sample is received. Taking di(k) to be the minimum RTT observed so far, the average queueing delay is estimated as:

  37. Data Control • It selects the next packet to send from 3 pools of candidates: • New packets • Packed that are deemed lost (negatively acknowledged) • Transmitted packets that are not yet acknowledged When there is no loss new packets are sent in sequence as old packets are acknowledged (self-clocking or ack-clocking). During loss recovery a decision must be made on whether to retransmit lost packets, to keep transmitting new packets, or to retransmit older packets that are neither acknowledged nor marked as lost. Certainly lost packets  longer recovery time It is also important to transmit a sufficient number of new packets to maintain reliable RTT measurements for window control. It’s important to find the right mix!

  38. Window Control It determines congestion window based on congestion information (queueing delay and packet loss). The same algorithm is used for congestion window computation regardless of the sender state. The congestion control reacts both queueing delay and packet loss. FAST TCP periodically updates the congestion window according to: where ϒ ε [0, 1], baseRTT is the minimum RTT observed so far and qdelay is the end-to-end (average) queueing delay.

  39. Burstiness Control (1) This component smooths out transmission of packets in a fluid-like manner to track the available bandwith. It’s particularly important in networks with large bandwith-delay products where large bursts may create long queues and even massive losses. TCP Reno uses self-clocking to regulate burstiness by transmitting a new packet only when an old packet is acknowledged. This works well when receiver acknowledges every data packet. When congestion window is large self-clocking is not sufficient. Lost or delayed acks can often lead to a single ack acknowledging a large number of outstanding packets. Self-clocking will allow the transmission of a large burst of packets. Acks may arrive in a burst at a sender due to queueing of acks in the reverse path (ack compression) of the connection, again triggering a large burst of outgoing packets. In networks with large bandwith-delay product, congestion window can be increased by a large amount during transient, e.g. in slow start. This breaks packet conservation and self-clocking, and allows a large burst of packets to be sent.

  40. Burstiness Control(2) • There are two burstiness control mechanism, one to supplement self-clocking in streaming out individual packets and the other to increase window size smoothly in smaller bursts. • Burstiness reduction decides how many packets to send, when an ack advances congestion window by a large amount and attempts to limit the burst size on a smaller time scale then one RTT. • Window paging determines how to increase congestion window over the idle time of a connection to the target determined by the window control component. Burstiness reduction We define the instantaneous burstiness B(t), at time t as the extra backlog introduced during the RTT before t:

  41. Burstiness Control(3) where Ti(t) and wi(t) are the round trip time a and the window size, and W(s,t) is the number of packets sent in the time interval [s,t]. When an acknowledgement arrives, Bi(t) is calculated before trasmitting any new packets. New packets are sent only if Bi(t) is less then a threshold, and postponed otherwise. Window pacing Idle time detection is accomplished by checking the difference in time between the most recent packet transmission and the current time when an acknowledgement is received. If this time difference is larger than a threshold, we deem the idle time long enough to insert additional packet transmission. The increment to the congestion window in the RTT will be spread over this idle time when it reappears.

  42. Packet level implementation (1) TCP is an event based protocol. Hence we need to translate our flow-level algorithms into event-based packet-level algorithms. There are 4 types of event that FAST TCP reacts to: On the reception of an acknowledgement After the transmission of a packet At the end of an RTT For each packet loss

  43. Packet level implementation (2) For each acknowledgment received, the estimation component computes the average queueing delay, and the burstiness control component determines whether packets can be injected. For each packet transmitted, the estimation component records a time stamp, and the burstiness control component updates corresponding data structures for bookkeeping. At a constant time interval, which we check on the arrival of each acknowled-gement, window control calculates a new window size. At the end of each RTT, burstiness reduction calculates the target throughput using the window and RTT measurement in the last RTT. Window pacing will then schedule to break up a large increment in congestion window into smaller increments over time. During loss recovery congestion window will be continually updated as if under normal network condition, based on congestion signals from the network. Each time a new packet loss is detected, a sender determines whether to transmit it right away or hold off until a more appropriate time.

More Related