Realizing the Performance Potential of the Virtual Interface Architecture

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U.

VIA Goals • Communication infrastructure for System Area Networks (SANs) • Targets mainly high speed cluster applications • Efficiently harnesses the communication performance of underlying networks

Trends • The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly. • The latency introduced by the protocol is typically several times the latency of the transport layer. • The problem becomes acute especially for small messages

Targets VI architecture addresses the following issues: • Decrease the latency especially for small messages (used in synchronization) • Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized) • Reduce the CPU processing due to the message overhead

Overhead Overhead mainly comes from two sources: • Every network access requires one-two traps into the kernel • user/kernel mode switch is time consuming • Usually two data copies occur: • From the user buffer to the message passing API • From message layer to the kernel buffer

VIA approach • Remove the kernel from the critical path • Moving communication code out of the kernel into user space • Provide 0-copy protocol • Data is sent/received directly into the user buffer, no message copy is performed

VIA emerged as a standardization effort from Compaq, Intel, and Microsoft It was built on several academic ideas: • The main architecture most similar to U-Net • Essential features derived from VMMC Among current implementations : • GigaNet cLan – VIA implemented in hardware • Tandem ServerNet –VIA software driver emulated • Myricom Myrinet - software emulated in firmware

VIA architecture

VIA operations Set-Up/Tear-Down : • VIA is point-to-point connection oriented protocol • VI-endpoint : the core concept in VIA • Register/De-Register Memory • Connect/Disconnect • Transmit • Receive • RDMA

VIA operations Set-Up/Tear-Down :VIA is point-to-point connection oriented protocol • VI-endpoint : the core concept in VIA • VipCreateVi function creates a VI endpoint in the user space. • The user-level library passes the call to the kernel agent which passes the creation information to the NIC. • OS thus controls the application access to the NIC

VIA operations - cont’d Register/De-Register Memory: • All data buffers and descriptors reside in a registered memory • NIC performs DMA I/O operation in this registered memory • Registration pins down the pages into the physical memory and provides a handle to manipulate the pages and transfer the addresses to the NIC • It is performed once, usually at the beginning of the communication session

VIA operations - cont’d Connect/Disconnect: • Before communication, each endpoint is connected to a remote endpoint • The connection is passed to the kernel agent and down to the NIC • VIA does not define any addressing scheme, existing schemes can be used in various implementations

VIA operations - cont’d Transmit/receive: • The sender builds a descriptor for the message to be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area. • The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor. • The doorbells are maintained in an internal queue inside the NIC

VIA operations - cont’d Transmit/receive (cont’d): • Meanwhile, the receiver creates a descriptor that points to an empty data buffer and posts a doorbell in the receiver NIC queue • When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network. • The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data

VIA operations - cont’d RDMA: • As a mechanism derived from VMMC, VIA allows Remote DMA operations: RDMA Read and Write • Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu • Each node can read and write to the remote node address directly. • These operations posts potential implementation problems.

Evaluation Benchmarks • Two VI implementations : • GigaNet cLan B:125MB/sec, Latency 480ns • Tandem ServerNet, 50MB/S, Latency 300ns • Performance measured: • Bandwidth and Latency • Poling vs. Blocking • CPU Utilization

Bandwidth

Latency

Latency Polling/Blocking

CPU utilization

MPI performance using VIA • The challenge is to deliver performance to distributed application • Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead • How to optimize this layer in order to use it efficiently with VIA ?

MPI VIA - performance

MPI observations • Difference between MPI-UDP and MPI-VIA-baseline is remarkable • MPI-VIA-baseline is dramatically far from VIA-Native • Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead

MPI Improvements • Eliminating unnecessary copies: MPI UDP and VIA use a single set of receiving buffers, thus data should be copied to the application : allow the user to register any buffer • Choosing a synchronization primitive: All synchronization formerly using OS constructs/events. Better implementation using swap processor commands • No Acknowledge: Remove the acknowledge of the message by switching to a reliable VIA mode

VIA - Disadvantages • Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead • Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow • Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient

Conclusion • Small latency for small messages. Small messages have a strong impact on application behavior • Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?) • At the expense of an uncomfortable API

Realizing the Performance Potential of the Virtual Interface Architecture

Realizing the Performance Potential of the Virtual Interface Architecture

Presentation Transcript

Interfacing Java to the Virtual Interface Architecture

Realizing the Potential of Industrial Energy Efficiency- Superior Energy Performance and ISO 50001- Energy Management

Pipelined Implementation of Virtual Interface Architecture on Myrinet

Realizing Potential: The Expectation of Day Services

Realizing the statistical potential of administrative data

Realizing the Potential of the Los Angeles Electric Vehicle Market

REALIZING THE VISION

Realizing the Full Potential of PSM using Proxying

Realizing the Potential of Virtual Education for Struggling Students

HGSE T525 REALIZING THE POTENTIAL OF ONLINE PROFESSIONAL DEVELOPMENT

Realizing the Potential of Evaluation for PBB

The Radiologist’s Speech – Realizing the Full Potential of the Diagnostic Report

Realizing Potential Journey (RPJ)

Realizing the Full Potential of Northeastern’s Urban Vision Barry Bluestone

Citrus Mechanical Harvesting -Realizing the Potential

Realizing LIGO Virtual Data

Realizing the Potential of Urban Slums

Interfacing Java to the Virtual Interface Architecture

Realizing the Interactive Speech Interface in a Multi-user Virtual Environment

Peers In the Workplace: Realizing the Potential

Realizing the Potential of Your ERP

Interfacing Java to the Virtual Interface Architecture