vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload ArdalanKangarlou,SahanGamage, RamanaKompella, DongyanXu Department of Computer SciencePurdue University

Cloud Computing and HPC

Background and Motivation • Virtualization: A key enabler of cloud computing • Amazon EC2, Eucalyptus • Increasingly adopted in other real systems: • High performance computing • NERSC’s Magellan system • Grid/cyberinfrastructure computing • In-VIGO, Nimbus, Virtuoso

VM Consolidation: A Common Practice • Multiple VMs hosted by one physical host • Multiple VMs sharing the same core • Flexibility, scalability, and economy VM 3 VM 1 VM 2 VM 4 Key Observation: VM consolidation negatively impacts network performance! Virtualization Layer Hardware

Investigating the Problem Server Client VM 2 VM 3 VM 1 Sender Virtualization Layer Hardware

Q1: How does CPU Sharing affect RTT ? 180 160 US West – Australia 140 US East – Europe 120 RTT (ms) 100 RTT increases in proportion to VM scheduling slice (30ms) 80 US East – West 60 40 2 3 4 5 Number of VMs

Q2: What is the Cause of RTT Increase ? VM scheduling latency dominates virtualization overhead! buf buf buf Sender VM 2 VM 1 VM 3 CDF Driver Domain (dom0) 30ms Device Driver Hardware +dom0 processing x wait time in buffer 30ms RTT Increase

Q3: What is the Impact on TCP Throughput ? +dom0 x VM Connection to the VM is much slower than dom0!

Our Solution: vSnoop • Alleviates the negative effect of VM scheduling on TCP throughput • Implemented within the driver domain to accelerate TCP connections • Does not require any modifications to the VM • Does not violate end-to-end TCP semantics • Applicable across a wide range of VMMs • Xen, VMware, KVM, etc.

TCP Connection to a VM Driver Domain Sender VM1 Buffer Scheduled VM SYN Sender establishes a TCP connection to VM1 SYN VM2 VM Scheduling Latency VM3 VM1 buffer RTT SYN VM1 SYN,ACK SYN,ACK SYN,ACK VM2 VM3 VM Scheduling Latency RTT VM1 time

Key Idea: Acknowledgement Offload Driver Domain Sender VM Shared Buffer Scheduled VM SYN w/ vSnoop SYN VM2 Faster progress during TCP slowstart VM3 VM1 buffer SYN,ACK VM1 SYN,ACK SYN,ACK VM2 VM3 VM1 time

vSnoop’s Impact on TCP Flows • TCP Slow Start • Early acknowledgements help progress connections faster • Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09] • TCP congestion avoidance and fast retransmit • Large flows in the steady state can also benefit from vSnoop • Benefit not as much as for Slow Start

Challenges • Challenge 1: Out-of-order/special packets (SYN, FIN packets) • Solution: Let the VM handle these packets • Challenge 2:Packet loss after vSnoop • Solution: Let vSnoop acknowledge only if room in buffer • Challenge 3: ACKs generated by the VM • Solution: Suppress/rewrite ACKs already generated by vSnoop • Challenge 4: Throttle Receive window to keep vSnoop online • Solution: Adjusted according to the buffer size

State Machine Maintained Per-Flow Early acknowledgements for in-order packets Packet recv Start Active (online) In-order pkt Buffer space available In-order pkt Buffer space available Out-of-order packet No buffer In-order pkt No buffer Unexpected Sequence No buffer (offline) Out-of-order packet Don’t acknowledge Pass out-of-order pkts to VM

vSnoop Implementation in Xen Tuning Netfront VM2 VM1 VM3 Netfront Netfront Netfront buf buf buf Netback Netback Netback Bridge vSnoop Driver Domain (dom0)

Evaluation • Overheads of vSnoop • TCP throughput speedup • Application speedup • Multi-tier web service (RUBiS) • MPI benchmarks (Intel, High-Performance Linpack)

Evaluation – Setup • VM hosts • 3.06GHz Intel Xeon CPUs, 4GB RAM • Only one core/CPU enabled • Xen 3.3 with Linux 2.6.18 for the driver domain (dom0) and the guest VMs • Client machine • 2.4GHz Intel Core 2 Quad CPU, 2GB RAM • Linux 2.6.19 • Gigabit Ethernet switch

vSnoop Overhead • Profiling per-packet vSnoop overhead using Xenoprof[Menon VEE’05] • Per-packet CPU overhead for vSnoop routines in dom0 Minimal aggregate CPU overhead

TCP Throughput Improvement • 3 VMs consolidated, 1000 transfers of a 100KB file • Vanilla Xen, Xen+tuning, Xen+tuning+vSnoop 30x Improvement 0.192MB/s 0.778MB/s 6.003MB/s Median +Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

TCP Throughput: 2 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

TCP Throughput: 3 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 250KB 500KB 100KB 100MB Transfer Size

TCP Throughput: 5 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 vSnoop’s benefit rises with higher VM consolidation 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

TCP Throughput: Other Setup Parameters • CPU load for VMs • Number of TCP connections to VM • Driver domain on separate core • Sender being a VM vSnoop consistently achieves significant TCP throughput improvement

Application-Level Performance: RUBiS RUBiS Clients Apache MySQL dom1 dom2 dom1 dom2 vSnoop vSnoop Client Threads dom0 dom0 Client Server1 Server2

RUBiS Results

Application-level Performance – MPI Benchmarks • Intel MPI Benchmark: Network intensive • High-performance Linpack: CPU intensive MPI nodes dom1 dom2 dom1 dom2 dom1 dom2 dom1 dom2 vSnoop vSnoop vSnoop vSnoop dom0 dom0 dom0 dom0 Server4 Server1 Server2 Server3

Intel MPI Benchmark Results: Broadcast Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 40% Improvement 0.20 0.00 1MB 4MB 8MB 2MB 64KB 256KB 512KB 128KB Message Size

Intel MPI Benchmark Results: All-to-All Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 0.20 0.00 1MB 2MB 8MB 4MB 64KB 256KB 512KB 128KB Message Size

HPL Benchmark Results Xen Xen+tuning+vSnoop 1.800 40% 1.600 1.400 1.200 1.000 Gflops 0.800 0.600 0.400 0.200 0.000 (6K,2) (6K,4) (6K,8) (8K,8) (4K,2) (4K,4) (4K,8) (8K,2) (8K,4) (4K,16) (6K,16) (8K,16) Problem Size and Block Size (N,NB)

Related Work • Optimizing virtualized I/O path • Menon et al. [USENIX ATC’06,’08; ASPLOS’09] • Improving intra-host VM communications • XenSocket [Middleware’07], XenLoop [HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07] • I/O-aware VM scheduling • Govindan et al. [VEE’07], DVT [SoCC’10]

Conclusions • Problem: VM consolidation degrades TCP throughput • Solution: vSnoop • Leverages acknowledgment offloading • Does not violate end-to-end TCP semantics • Is transparent to applications and OS in VMs • Is generically applicable to many VMMs • Results: • 30x improvement in median TCP throughput • About 30% improvement in RUBiS benchmark • 40-50% reduction in execution time for Intel MPI benchmark

Thank you. For more information: http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop Or Google “vSnoop Purdue”

TCP Benchmarks cont. • Testing different scenarios: • a) 10 concurrent connections • b) Sender also subject to VM scheduling • c) Driver domain on a separate core a) b) c)

TCP Benchmarks cont. • Varying CPU load for 3 consolidated VMs: 40% CPU load: 60% CPU load: 80% CPU load:

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload