Transport Layer Enhancements for Unified Ethernet in Data Centers

Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position

Context • Data center is evolving  Fabric should too. • Last talk: • Enhancements to Ethernet, already on track • This talk: • Enhancements to Transport Layer • Exploratory, not in any standards track.

Outline • Data Center evolution & transport impact • Transport deficiencies & remedies • Many areas of deficiencies … • Only Congestion Control and QoS addressed in detail • Summary & Call to Action

Data Center Today IPC Fabric • Tiered structure • Multiple incompatible fabrics • Ethernet, Fiber Channel, IBA, Myrinet, etc. • Management complexity • Dedicated servers for applications  Inflexible resource usage Storage Fabric database query client req/ resp business trans network Fabric SAN storage

Future DC: Stage 1 – Fabric Unification • Enet dominant, but convergence really on IP. • New layer2: PCI-Exp, Optical, WLAN, UWB, … • Most ULP’s run over transport over IP  Need to comprehend transport implications iSCSI storage database query client req/ resp business trans

Sub-cluster 2 Storage Nodes Sub-cluster1 Sub-cluster 3 Future DC: Stage 2 – Clustering & Virtualization • SMP  Cluster (cost, flexibility, …) • Virtualization • Nodes, network, storage, …  Virtual clusters (VC) • Each VC may have multiple traffic types inside Virtual Cluster1 IP ntwk Virtual Cluster 2 Virtual Cluster 3

Future DC: New Usage Models • Dynamically provisioned virtual clusters • Distributed storage (per node) • Streaming traffic (VoIP/IPTV + data services) • HPC in DC • Data mining for focused advertising, pricing, … • Special purpose nodes • Protocol accelerators (XML, authentication, etc.) New models  New fabric requirements

Fabric Impact • More types of traffic, more demanding needs. • Protocol impact at all levels • Ethernet: Previous presentation. • IP: Change affects entire infrastructure. • Transport: This talk • Why transport focus? • Change primarily confined to endpoints. • Many app needs relate to transport layer • App. interface (Sockets/RDMA) mostly unchanged. DC evolution  Transport evolution

Transport Issues & enhancements • Transport (TCP) enhancement areas • Better Congestion control and QoS • Support media evolution • Support for high availability • Many others • Message based & unordered data delivery. • Connection migration in virtual clusters. • Transport layer multicasting. • How do we enhance transport? • New TCP compatible protocol? • Use an existing protocol (SCTP)? • Evolutionary changes to TCP from DC perspective.

IP MAC App App transport transport IP IP MAC MAC What’s wrong with TCP Congestion control • TCP congestion control (CC) works independently for each connection  • By default TCP equalizes throughput  undesirable • Sophisticated QoS can change this, but … • Lower level CC  Backpressure on transport • Transport layer congestion control is crucial TL cong cntrl Cong feedback ECN/ICMP MAC MAC switch switch router

What’s wrong with QoS? • Elaborate mechanisms • Intserv (RSVP), Diffserv, BW broker, … • … But a nightmare to use • App knowledge, many parameters, sensitivity, … • What do we need? • Simple/intuitive parameters • e.g., streaming or not, normal vs. premium, etc. • Automatic estimation of BW needs. • Application focus, not flow focus! • QoS relevant primarily under congestion  Fix TCP congestion control, use IP QoS sparingly.

TCP Congestion Control Enhancements • Collective control of all flows of an app • Applicable to both TCP & UDP • Ensures proportional fairness of multiple inter-relatedflows • Tagging of connections to identify related flows. • Packet loss highly undesirable in DC • Move towards a delay based TCP variant. • Multilevel Coordination • Socket vs. RDMA apps, TCP vs. UDP, … • A layer above transport for coordination

Collective Congestion Control Cong. Control • Control connections thru a congested device together (control set) • Determining control set is challenging • BW requirement estimated automatically during non-congested periods SW0 S23 S13 SW2 SW1 S21 S11 CL2 CL1

Sample Collective Control • App 1: client1  server1 • Database queries over a single connection  Drives ~5.0 Mb/s BW • App2: client2  server1 • Similar to App1  Drives 2.5 Mb/s BW • App 3: client3  server2 • FTP, starts at t=30 secs  25 conn.  8 Mb/s

Sample Results Cong. Control • Modified TCP can maintain 2:1 throughput ratio • Also yields lower losses & smaller RTT. Collective control highly desirable within a DC

Adaptation to Media • Problem: TCP assumes loss  congestion, and designed for WAN (high loss/delay) • Effects: • Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration). • … but TCP is not a suitable transport. • Overkill for communications within a DC. • Solution:A self-adjusting transport • Support multiple congestion/flow-control regimes. • Automatically selected during connection setup.

Path 1 A B Path 2 High Availability Issues • Problem: Single failure  broken connection, weak robustness check, … • Effect: Difficult to achieve high availability. • Solution: • Multi-homed connections w/ load sharing among paths. • Ideally, controlled diversity & path management • Difficult: need topology awareness, spanning tree problem,

Summary & call to action • Data Centers are evolving • Transport must evolve too, but a difficult proposition • TCP is heavily entrenched, change needs an industry wide effort • Call to Action • Need to get an industry effort going to define • New features & their implementation • Deployment & compatibility issues. • Change will need push from data center administrators & planners.

Additional Resources • Presentation can be downloaded from the IDF web site – when prompted enter: • Username: idf • Password: fall2005 • Additional backup slides • Several relevant papers available at http://kkant.ccwebhost.com/download.html • Analysis of collective bandwidth control. • SCTP performance in data centers.

Backup

Comparative Fabric Features DC requirements TCP lacks many desirable features; SCTP has some

Transport Layer QoS Inter-app Web app • Needed at multiple levels • Between transport uses • Conn. of a given transport • Logical streams DB App • May be on two VM’s on • same physical machine. ntwk IPC page iSCSI Intra-app • Best BW subdivision to maximize performance? text images cntrl data Intra-conn Intra-conn • Requirements • Must be compatible with lower level QoS • PCI-Exp, MAC, etc. • Automatic estimation of bandwidth requirements • Automatic BW control

Multicasting in DC • Software/patch distribution • Multicast to all machines w/ same version. • Characteristics • Medium to large file transfer • Time to finish matters, BW doesn’t. • Scale: 10s to 1000s. • High performance computing • MPI collectives need multicasting • Characteristics • Small but frequent transfers • Latency premium, BW not an issue mostly. • Scale: 10s to 100’s

IP multicasting A A subnet2 subnet1 subnet2 subnet1 outer router outer router Transport layer multicasting TL multicasting

A B subnet2 subnet1 D outer router C subnet3 subnet4 TL multicasting value • Assumptions • A 16 node cluster w/ 4-node subclusters. • Mcast group: 2 nodes in each sub-cluster • Latencies: • endpt: 2 us, ack proc: 1 us, switch: 1 us • App-TL interface: 5 us • Latency w/o mcast • send: 7x2 + 3x1 + 2 = 19 us • ack: 1 + 3x1 + 7x1 = 11 us • reply: 5 + 2 + 7x2 = 21 us • Total: 19+11+21 = 51 us • Latency w/ mcast • send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us • ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us • Total = 17 + 10 + 5 = 32 us • Larger savings in full network mcast.

A subnet2 subnet1 outer router subnet3 subnet4 A S4 n2 n1 S2 S3 n2 n1 n2 n2 n1 n1 Hierarchical Connections • Choose a “leader” in each subnet. • Topology directed • Multicast connections to others nodes via leaders  • Ack consolidation at leaders (multicast) • Msg consolidation at leaders (reverse multicast) • Done by a layer above? (layer 4.5?)

Transport Layer Enhancements for Unified Ethernet in Data Centers

Transport Layer Enhancements for Unified Ethernet in Data Centers

Presentation Transcript

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

Transport Layer

DCTCP: Transport Optimized for Data Centers

Transport Layer

Transport Layer

Transport Layer

Transport Layer

A policy-aware switching layer for data centers

Ethernet Driver Enhancements

Transport Layer

Transport Layer

Transport Layer

Ethernet Unified Wire

Transport Layer

Transport Layer

Transport Layer

Transport Layer

A Distributed Energy Saving Approach for Ethernet Switches in Data Centers

A Policy-aware Switching Layer for Data Centers

Transport Layer

Transport Layer