End-to-End Protocols Getting Processes to Communicate
Outline • UDP • TCP • Remote Procedure Call • Performance
Underlying best-effort network • drop messages • re-orders messages • delivers duplicate copies of a given message • limits messages to some finite size • delivers messages after an arbitrarily long delay
Common end-to-end services • guarantee message delivery • deliver messages in the same order they are sent • deliver at most one copy of each message • support arbitrarily large messages • support synchronization • allow the receiver to flow control the sender • support multiple application processes on each host
0 16 31 SrcPort DstPort Checksum Length Data Simple Demultiplexer (UDP) • Unreliable and unordered datagram service • Adds multiplexing • No flow control • Endpoints identified by ports • servers have well-known ports • see /etc/services on Unix • Header format • Optional checksum • Psuedo header + UDP header + data • Pseudo header: • Three fields from IP header: • protocol number, • source IP address, • destination IP address • UDP length field
UDP message queue Application Application Application process process process Ports Queues Packets demultiplexed UDP Packets arrive
Outline • UDP • TCP • Remote Procedure Call • Performance
Application process Application process W rite Read bytes bytes … … TCP TCP Send buffer Receive buffer … Segment Segment Segment T ransmit segments TCP Overview • Connection-oriented • Byte-stream • app writes bytes • TCP sends segments • app reads bytes • Full duplex • Flow control: keep sender from overrunning receiver • Congestion control: keep sender from overrunning network
Data Link Versus Transport • Potentially connects many different hosts • need explicit connection establishment and termination • Potentially different RTT • need adaptive timeout mechanism • Potentially long delay in network • need to be prepared for arrival of very old packets • Potentially different capacity at destination • need to accommodate different node capacity • Potentially different network capacity • need to be prepared for network congestion
Segment Format 10 0 4 16 31 SrcPort DstPort SequenceNum Acknowledgment 0 Flags AdvertisedWindow HdrLen Checksum UrgPtr Options (variable) Data
Data (SequenceNum) Sender Receiver Acknowledgment + AdvertisedWindow Segment Format (cont) • Each connection identified with 4-tuple: • (SrcPort, SrcIPAddr, DsrPort, DstIPAddr) • Sliding window + flow control • acknowledgment, SequenceNum, AdvertisedWinow • Flags • SYN, FIN, RESET, PUSH, URG, ACK • Checksum • pseudo header + TCP header + data • Pseudo header = SrcIPAddr+DstIPAddr+IP datagram length
Three-way handshake to establish a connection Active participant Passive participant (client) (server) SYN, SequenceNum = x , y 1 + SYN + ACK, SequenceNum = x Acknowledgment = ACK, Acknowledgment = y + 1
State Transition Diagram CLOSED /SYN Activeopen state Close Passiveopen Close event/action LISTEN SYN/SYN + ACK Send/ SYN SYN/SYN + ACK SYN_RCVD SYN_SENT ACK SYN + ACK/ACK Close /FIN ESTABLISHED Close /FIN FIN/ACK FIN_WAIT_1 CLOSE_WAIT FIN/ACK ACK Close /FIN ACK + FIN/ACK FIN_WAIT_2 CLOSING LAST_ACK Timeout after two ACK ACK segment lifetimes FIN/ACK TIME_WAIT CLOSED
Terminating a connection • Termination is symmetric: application process on each side needs to close its connection independently. • Three combinations lead to CLOSED • This side close first • The other side closes first • Two sides close at the same time • Also possible is that one side closes immediately after another • A connection at TIME_WAIT has to wait for two times the maximum amount of time an IP datagram might live in the Internet before it moves to CLOSED.
CLOSED (client) Active open /SYN Passive open Close Close (server) LISTEN SYN/SYN + ACK Send/ SYN SYN/SYN + ACK SYN_RCVD SYN_SENT ACK SYN + ACK/ACK Close /FIN ESTABLISHED Passive close 1 2 Close /FIN FIN/ACK FIN_WAIT_1 CLOSE_WAIT FIN/ACK 3 4 ACK Close /FIN ACK + FIN/ACK FIN_WAIT_2 CLOSING LAST_ACK Timeout after two 6 ACK ACK segment lifetimes 5 FIN/ACK TIME_WAIT CLOSED • State-Transition Diagram Active close 7
CLOSED (client) Active open /SYN • State-Transition Diagram Passive open Close Close (server) LISTEN SYN/SYN + ACK Send/ SYN SYN/SYN + ACK SYN_RCVD SYN_SENT ACK SYN + ACK/ACK Active close Close /FIN ESTABLISHED 1 Close /FIN FIN/ACK FIN_WAIT_1 CLOSE_WAIT FIN/ACK 2 ACK Close /FIN ACK + FIN/ACK FIN_WAIT_2 CLOSING LAST_ACK Timeout after two 3 ACK ACK segment lifetimes FIN/ACK TIME_WAIT CLOSED 4
CLOSED (client) Active open /SYN Passive open Close Close (server) LISTEN SYN/SYN + ACK Send/ SYN SYN/SYN + ACK SYN_RCVD SYN_SENT ACK SYN + ACK/ACK Close /FIN ESTABLISHED Active close Passive close 1 2 Close /FIN FIN/ACK FIN_WAIT_1 CLOSE_WAIT Server closes immediately FIN/ACK 2 ACK Close /FIN 3 ACK + FIN/ACK FIN_WAIT_2 CLOSING LAST_ACK Timeout after two 4 ACK ACK segment lifetimes FIN/ACK TIME_WAIT CLOSED 5 • State-Transition Diagram
Sending application Receiving application TCP TCP LastByteWritten LastByteRead LastByteAcked LastByteSent NextByteExpected LastByteRcvd Sliding Window Revisited • Sending side • LastByteAcked < = LastByteSent • LastByteSent < = LastByteWritten • buffer bytes between LastByteAcked and LastByteWritten • Receiving side • LastByteRead < NextByteExpected • NextByteExpected < = LastByteRcvd +1 • buffer bytes between LastByteRead and LastByteRcvd
Flow Control • Send buffer size: MaxSendBuffer • Receive buffer size: MaxRcvBuffer • Receiving side • LastByteRcvd - LastByteRead < = MaxRcvBuffer • AdvertisedWindow = MaxRcvBuffer - (NextByteExpected - NextByteRead) • Sending side • LastByteSent - LastByteAcked < = AdvertisedWindow • EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked) • LastByteWritten - LastByteAcked < = MaxSendBuffer • block sender if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer • Always send ACK in response to arriving data segment • Persist when AdvertisedWindow= 0
Protection Against Wrap Around • The sequence space should be twice as big as the windows size. • 232 >> 2216 • The sequence number should not wrap around within the MSL (=120s). • This depends on the network bandwidth. • The 32-bit sequence number space is adequate for today’s network. • Future TCP connection might ask for larger sequence number space to protect against the sequence number wrapping around.
Keeping the Pipe Full • A 16-bit AdvertisedWindow field allows a window of only 64 KB. • To keep the pipe full, the AdvertisedWindow should be larger than the delay x bandwidth product. • The AdvertisedWindow is not big enough to handle even a T3 connection across the continental US. • TCP extension provides a mechanism for effectively increasing the size of the advertised window.
Silly Window Syndrome • How does an aggressive sender exploit open window? • Receiver-side solutions • after advertising zero window, wait for space equal to a maximum segment size (MSS) • delayed acknowledgements Sender Receiver
Nagle’s Algorithm • How long does sender delay sending data? • too long: hurts interactive applications • too short: poor network utilization • strategies: timer-based vs self-clocking • When application generates additional data • if fills a max segment (and window open): send it • else • if there is unack’ed data in transit: buffer it until ACK arrives • else: send it
Adaptive Retransmission • TCP sets the timeout as a function of the RTT between the two ends of the connection. • Choosing an appropriate timeout value is not that easy. • TCP uses an adaptive retransmission mechanism.
The Original Algorithm • Measure SampleRTT for each segment-ACK pair • Compute weighted average of RTT • EstRTT = ax EstRTT + (1- a) x SampleRTT • a is between 0.8 and 0.9 • Set timeout based on EstRTT • TimeOut=2 x EstRTT
Karn/Partridge Algorithm • Do not sample RTT when retransmitting • Double timeout after each retransmission Sender Receiver Sender Receiver Original transmission Original transmission TT TT ACK Retransmission SampleR SampleR Retransmission ACK
Jacobson/Karels Algorithm • New Calculations for average RTT • Diff = SampleRTT - EstRTT • EstRTT = EstRTT + (dx Diff) • Dev = Dev + d( |Diff| - Dev) • where d is a factor between 0 and 1 • Consider variance when setting timeout value • TimeOut = m x EstRTT + f x Dev • where m = 1 and f = 4 • Notes • algorithm only as good as granularity of clock (500ms on Unix) • accurate timeout mechanism important to congestion control(later)
Record Boundaries • TCP is a byte-stream protocol and does not inject record boundaries in the byte stream. • Two different features can be used to insert record boundaries, informing the receiver how to break the byte stream into records. • using the URG flag and the UrgPtr field to signify record marker • using the PSH flag to flush the TCP buffer
TCP Extensions • TCP extension is implemented as header options • Three extensions to TCP: • Store timestamp in outgoing segments • Extend sequence space with 32-bit timestamp • Shift (scale) advertised window
Alternative design choices • Good for request/replay applications? • More segments are sent due to byte-stream-based and setup/teardown phases • Why not message-stream service? • Explicit setup/teardown phase a must? • How about rate-based flow control?
Outline • UDP • TCP • Remote Procedure Call • Performance
Remote Procedure Call • The request/reply paradigm, also called message transaction, is a common pattern of communication used by application program. • We want to design a transport protocol, RPC, that is more suitable for the request/reply message exchange.
Complete RPC mechanism • Two major RPC components: • A protocol that manages the messages exchange and that deals with the potentially undesirable properties of the underlying network • A stub compiler that • package the arguments into a request message on the client side and then • translate this message back into the arguments on the server side, and likewise with the return value
Complete RPC mechanism Caller Callee (client) (server) Return Return Arguments Arguments value value Server Client stub stub Request Reply Request Reply RPC RPC protocol protocol
Three microprotocols • We develop the RPC as a stack of three smaller protocols: • BLAST: fragments and reassembles large messages • CHAN: synchronizes request and reply message • SELECT: dispatches request messages to the correct process
BLAST (the receiving side) • After fragmenting the message and transmitting each of the fragments, the sender sets a timer called DONE. • Whenever an selective retransmission request (SRR) arrives, the sender retransmits the requested fragments and resets timer DONE. • Should the SRR indicate that all the fragments have arrived, the sender frees its copy of the message and cancels timer DONE. • If timer DONE ever expires, the sender gives up and frees its copy of messages.
BLAST (the sending side) • Whenever the first fragment arrive, the receiver sets a timer LAST_FRAG. • Should all the fragments be present, the receiver reassemble them into a complete message and passes it up to the higher-level protocol. • There are four exceptions that the receiver watches for: • The last fragment arrives but the message is not complete, the receiver sends an SRR and sets the timer RETRY. • If timer LAST_FRAG expires, then the receiver sends an SRR and sets the timer RETRY. • If timer RETRY expires for the 1st and 2nd time, then receiver resends an SRR. • If timer RETRY expires for the 3rd time, the receiver gives up.
Sender Receiver Fragment 1 Fragment 2 Fragment 3 Fragment 4 Fragment 5 Fragment 6 SRR Fragment 3 Fragment 5 SRR Timeline for BLAST
BLAST message format • ProtNum: identifies the high-level protocol on top of BLAST • MID: uniquely identifies this message • Length: how many bytes of data in this fragment • NumFrags: how many fragments in this message • Type: data message or SRR • FragMask:used a bit mask to distinguish among fragments
CHAN (Request/Reply) • CHAN implements a logical request/reply channel. • At any given time, there can be only one message transaction active on a given channel. • To account for the message loss, both sides save a copy of each message they send until an ACK for it has arrived. • Each side also sets a RETRANSMIT timer and resends the message should this timer expire. • Both sides reset this timer and try again some times before giving up and freeing the message. • MID can be used to detect duplicate messages. • To help the client distinguish between a slower server and a dead server, the client side can send a PROBE message and expect an ACK from a slower server.
At-most-once and zero-or-more • The most important property of each CHAN’s channel is that it preserve at-most-once. • For every request message that the client sends, at most one copy of the message is delivered to the server. • Other RPC protocol supports zero-or-more. • Each invocation on a client results in the remote procedure being invoked zero or more times. • This might not cause problems if the remote procedure being invoked is idempotent. • Multiple invocations have the same effect as just one.
Timeline for CHAN Server Client Client Server Request 1 Request Reply 1 ACK Request 2 Reply Reply 2 ACK … With implicit ACKs
CHAN message format • Type: REQ, REP, ACK, PROBE • CID: the logical channel to which the message belongs • MID: uniquely identifies each request/reply pair • BID: the boot id for the host • Length: how many bytes of data in this message • ProtNum: identifies the high-level protocol on top of CHAN
Timeout • CHAN involves three different timers: • RETRANSMIT timer on both sides • If it is too large, CHAN might wait an unnecessarily long time before retransmitting. • If it is too small, CHAN may load the network with unnecessary traffic. • PROBE timer on the client side • It is not critical to the performance. • CHAN would calculate the RETRANSMIT timeout using a mechanism similar to the TCP uses. • The only difference is that CHAN has to take into account the different sizes of messages.
Synchronous vs. Asynchronous Protocols • At the transport level, synchrony should be treated as a spectrum of possibilities. • At the asynchronous end of the spectrum, the application knows absolutely nothing when send returns. • At the synchronous end, the send operation typically returns a reply message. • Synchronous protocol implement the request/reply alternation. • Asynchronous protocols are used if the sender wants to be able to transmit many messages without waiting for a response. • With this definition, CHAN is a synchronous protocol.
SELECT (Dispatcher) • SELECT dispatches request messages to the appropriate procedures. • Unlike UDP, it is a synchronous protocol. • On the client side, • SELECT is given a procedure number that the client wants to invoke, it puts this number in its header, and then it invokes the cal operation in on a lower-level protocol like CHAN. • When this invocation returns, SELECT lets the return pass through to the client. No real demultiplexing is done. • On the server side, • SELECT uses the procedure number to invoke the right local procedure. • When this procedure returns, SELECT simply returns to the low-level protocol that just invoke it.
SELECT CHAN BLAST IP ETH A simple RPC stack • The fragmentation/reassembly burden is on BLAST. • CHAN implements the reliable delivery of request/reply message. • SELECT defines an address space for identifying remote procedures.
SunRPC • SunRPC has become a de facto standard. • The IETF is considering officially adopting SunRPC as a standard Internet protocol. • SunRPC implements the core request/reply algorithm but does not guarantee at-most-once semantics. • The role of SELECT is split between UDP and SunRPC. • The functionality implemented in BLAST is handled by IP.
SunRPC header formats 0 31 0 31 XID XID MsgType = CALL MsgType = REPLY RPCVersion = 2 Status = ACCEPTED Data Program Version Procedure Reply Credentials (variable) Verifier (variable) Data Request