Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology

Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang

Background • Multiprocessor-based system models • Parallel vector processor (PVP) • Symmetric multiprocessor (SMP) • Massively parallel processor (MPP) • Distributed shared memory machine • Cluster of workstations (COW) • COW features • Each node is a complete workstation minus peripherals ( monitor, keyboard, mouse,…) • Nodes are connected through a commodity network, e.g., Ethernet, FDDI, ATM switch, etc • A complete OS resides in each node

Motivation • Problem with COW • The inherent inability of scaling the performance of communication software along with the host CPU performance • High communication overhead : software overhead (time required for the preparation and authentication of the message) is significantly higher than hardware overhead (network setup and message propagation time) • Coprocessors on the network interface • Myrinet and ATM • But what should coprocessors do to minimize communication overheads?

Motivation • Critical step is the reduction ofhost communication overheads,rather than network latency. • Why? • Many existing parallel applications are designed to hide network latencies; • Multithreaded applications typically cannot benefit significantly from improving network latencies below the cost of several user-level thread context switches; • In a cluster, in contrast to a parallel machine, the schedulers of distinct nodes are only loosely synchronized – this implies the existence of highly dynamic offsets among schedulers and therefore among cooperating application threads on the order of tens of microseconds.

The VCM approach • VCM • Virtual Communication Machine • Enables applications to set a customized and lightweight communication path between their address spaces and the “wire” • Goal • Reduction of software communication overheads • How • Transfer selected communication-related processing activities from the host CPU(s) to the network coprocessor • A low-level abstraction between applications and coprocessor • Applications directly interact with VCM • Hide complexity via a user-level library • Usual protection via a kernel extension • VCM and applications operate asynchronously • VCM and applications use shared memory to communicate

VCM features • The intelligent network interface VCM • They changed the name in a later journal version. • VCM has an active role • Access to application address space • Extensions to shared-memory applications • Zero-copy messaging available at both ends • sending • receiving • Communication related processing can be transferred to the network coprocessor • Buffer pages are managed by the application • The application itself knows its behavior better • Multiple VCM supported for each host

VCM Architecture • Coprocessor is responsible for • Ensuring data integrity • Assembling/disassembling messages directly from/into an application’s data structure • Multiplexing/demultiplexing network messages • Enforcing protection • Three components • Virtual Communication Machine, implemented on the network coprocessor • A kernel extension module • For address space management and protection • A user-level library • Hide applications from the complexity of interacting with the VCM and the kernel extension

Application–VCM interaction • Application access a VCM by registering • Extend a shared memory space with VCM - Command Area • Application and VCM interact via command area • Program or instruction completion is signaled using status words that are placed in the command area. • Asynchronous operations • Coprocessor polls for new programs to execute • Host CPU(s) check for program and instruction completion by polling the status words. • Data transfers are performed only by the coprocessor • Improve the performance • Loop interaction • Bursty invocations with many identical parameters

Command Area

Implementation • Platform • Cluster of Sun UltraSPARCs I Model 170 • Solaris 2.5 • FORE SBA-200E network cards • 25MHz i960 microprocessor

Implementation • VCM interpreter • Running on the coprocessor • Order of requests • Protection-related instructions • VCM programs • Loop instructions • Incoming data • Protection and buffer page management • VCM accepts protection management instructions only from the kernel or from the connection server • VCM checks the correctness of all parameters received from an application • Messages longer than expected are truncated to the size of the receiving buffer

Implementation • VCM instruction set

Evaluation • Microbenchmarks • Synthetic client/server application • Ten client workstations issue back-to-back data requests to the server workstation • Traveling Salesman Problem (TSP) • Georgia Tech Time Warp (GTW) • A parallel kernel for discrete-event simulation • PHold, a synthetic application • PCS, a wireless network simulation

Performance - Microbenchmarks The latency is linear with the message size The maximum send rate approaches the maximum data capacity of the wire

Performance - Client/server application Outgoing bandwidth of the server as a function of the request size, when the server uses one or two interfaces.

Performance – TSP

Performance – PHold

Performance – PCS

Limitations • Requires special hardware • A network adapter card equipped with • A network coprocessor • A few megabytes of fast memory • One or more DMA under the control of the coprocessor • Network-specific hardware to help with performance critical processing (e.g., CRC). • How hard is it to port shared-memory applications to VCM-based COW?

Conclusion • Host communication overhead is crucial • VCM • Flexibility of integration between network and application • Low overhead on the host processor • latency and bandwidth close to the hardware limits • Enables zero-copy messaging • Porting of certain shared-memory parallel applications to VCM-based COW. • Performance is desirable, contribution is valuable

Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology

Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology

Presentation Transcript

Catalin Ciobanu Chalmers University of Technology

Andy Fujimoto

Georgia Institute of Technology

Vertical Customization for Event-based QoS Dr. Karsten Schwan, Dr. Greg Eisenhauer

Hasan Abbasi Matthew Wolf Jay Lofstead Fang Zheng Greg Eisenhauer Karsten Schwan

Marc Catalin Constantin

Georgia Institute of Technology

Georgia Institute of Technology

Min Lee , Vishal Gupta, Karsten Schwan College of Computing Georgia Tech, Atlanta

Karsten Schwan Greg Eisenhauer Matt Wolf Mustaq Ahamad (Nagi Rao - ORNL Constantinos Dovrolis)

Georgia Institute of Technology

M. D.Wolf, K. Schwan, and G. Eisenhauer Georgia Institute of Technology

Georgia Institute of Technology

Karsten Schwan, Calton Pu, Douglas Blough, Sudhakar Yalamanchili Jay Ramananathan Rajiv Ramnath

Karsten Schwan, Calton Pu, Douglas Blough, Sudhakar Yalamanchili Jay Ramananathan Rajiv Ramnath

IXP-resident Stream Handlers Ada Gavrilovska, Dr. Karsten Schwan,

Karsten Schwan Greg Eisenhauer Matt Wolf Mustaq Ahamad (Nagi Rao - ORNL Constantinos Dovrolis)

Karsten Nohl