CS160 – Lecture 3

CS160 – Lecture 3 Clusters. Introduction to PVM and MPI

Introduction to PC Clusters • What are PC Clusters? • How are they put together ? • Examining the lowest level messaging pipeline • Relative application performance • Starting with PVM and MPI

Clusters, Beowulfs, and more • How do you put a “Pile-of-PCs” into a room and make them do real work? • Interconnection technologies • Programming them • Monitoring • Starting and running applications • Running at Scale

Beowulf Cluster • Current working definition: a collection of commodity PCs running an open-source operating system with a commodity interconnection network • Dual Intel PIIIs with fast ethernet, Linux • Program with PVM, MPI, … • Single Alpha PCs running Linux

Beowulf Clusters cont’d • Interconnection network is usually fast ethernet running TCP/IP • (Relatively) slow network • Programming model is message passing • Most people now associate the name “Beowulf” with any cluster of PCs • Beowulf’s are differentiated from high-performance clusters by the network • www.beowulf.orghas lots of information

High-Performance Clusters Gigabit Networks - Myrinet, SCI, FC-AL, Giganet,GigE,ATM • Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection • Leverage HW, commodity SW (*nix/Windows NT), build key technologies => high performance computing in a RICH software environment

Cluster Research Groups • Many other cluster groups that have had impact • Active Messages/Network of workstations (NOW) UCB • Basic Interface for Parallelism (BIP) Univ. of Lyon • Fast Messages(FM)/High Performance Virtual Machines(HPVM) (UIUC/UCSD) • Real World Computing Partnership (Japan) • (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton)

Clusters are Different • A pile of PC’s is not a large-scale SMP server. • Why? Performance and programming model • A cluster’s closest cousin is an MPP • What’s the major difference? Clusters run N copies of the OS, MPPs usually run one. 

Application Program “Virtual Machine Interface” Actual system configuration Ideal Model: HPVM’s • HPVM = High Performance Virtual Machine • Provides a simple uniform programming model, abstracts and encapsulates underlying resource complexity • Simplifies use of complex resources

Virtualization of Machines • Want the illusion that a collection of machines is a single machines • Start, stop, monitor distributed programs • Programming and debugging should work seemlessly • PVM (Parallel Virtual Machine) was the first, widely-adopted virtualization for parallel computing • This illusion is only partially complete in any software system. Some issues • Node heterogeneity. • Real network topology can lead to contention • Unrelated – What is a Java Virtual Machine?

High-Performance Communication Switched Multigigabit, User-level access Networks Switched 100 Mbit OS mediated access • Level of network interface support + NIC/network router latency • Overhead and latency of communication  deliverable bandwidth • High-performance communication Programmability! • Low-latency, low-overhead, high-bandwidth cluster communication • … much more is needed … • Usability issues, I/O, Reliability, Availability • Remote process debugging/monitoring

Putting a cluster together • (16, 32, 64, … X) Individual Node • Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet • Scalable High-speed network • Myrinet, Giganet, Servernet, Gigabit Ethernet • Message-passing libraries • TCP, MPI, PVM, VIA • Multiprocessor job launch • Portable batch System • Load Sharing Facility • PVM spawn, mpirun, rsh • Techniques for system management • VA Linux Cluster Manager (VACM) • High Performance Technologies Inc (HPTI)

Communication style is message Passing Packetized message B A 4 3 2 1 1 2 • How do we efficiently get a message from Machine A to Machine B? • How do we efficiently break a large message into packets and reassemble at receiver? • How does receiver differentiate among message fragments (packets) from different senders?

Will use the details of FM to illustrate some communication engineering

FM on Commodity PC’s FM Host Library FM Device Driver FM NIC Firmware • Host Library: API presentation, flow control, segmentation/reassembly, multithreading • Device driver: protection, memory mapping, scheduling monitors • NIC Firmware: link management, incoming buffer management, routing, multiplexing/demultiplexing Pentium II/III NIC 1280Mbps ~450 MIPS ~33 MIPS PCI P6 bus

80 100+ MB/s 70 60 Bandwidth (MB/s) 50 40 30 20 n1/2 10 Msg size (bytes) 0 1,024 4,096 16,384 65,536 4 16 64 256 Fast Messages 2.x Performance • Latency 8.8ms, Bandwidth 100+MB/s, N1/2 ~250 bytes • Fast in absolute terms (compares to MPP’s, internal memory BW) • Delivers a large fraction of hardware performance for short messages • Technology transferred in emerging cluster standards Intel/Compaq/Microsoft’s Virtual Interface Architecture.

Comments about Performance • Latency and Bandwidth are the most basic measurements message passing machines • Will discuss in detail performance models because • Latency and bandwidth do not tell the entire story • High-performance clusters exhibit • 10X is deliverable bandwidth over ethernet • 20X – 30X improvement in latency

How does FM really get Speed? • Protected user-level access to network (OS-bypass) • Efficient credit-based flow control • assumes reliable hardware network [only OK for System Area Networks] • No buffer overruns ( stalls sender if no receive space) • Early de-multiplexing of incoming packets • multithreading, use of NT user-schedulable threads • Careful implementation with many tuning cycles • Overlapping DMAs (Recv), Programmed I/O send • No interrupts! Polling only.

OS-Bypass Background • Suppose you want to perform a sendto on a standard IP socket? • Operating System mediates access to the network device • Must trap into the kernel to insure authorization on each and every message (Very time consuming) • Message is copied from user program to kernel packet buffers • Protocol information about each packet is generated by the OS and attached to a packet buffer • Message is finally sent out onto the physical device (ethernet) • Receiving does the inverse with a recvfrom • Packet to kernel buffer, OS strip of header, reassembly of data, OS mediation for authorization, copy into user program

OS-Bypass • A user program is given a protected slice of the network interface • Authorization is done once (not per message) • Outgoing packets get directly copied or DMAed to network interface • Protocol headers added by user-level library • Incoming packets get routed by network interface card (NIC) into user-defined receive buffers • NIC must know how to differentiate incoming packets. This is called early demultiplexing. • Outgoing and incoming message copies are eliminated. • Traps to OS kernel are eliminated

NIC NIC Packet Pathway DMA Programmed I/O User level Handler 1 Pkt Pkt Pkt Pkt User Message Buffer User level Handler 2 DMA to/from Network User Message Buffer Pkt Pinned DMA receive region • Concurrency of I/O busses • Sender specifies receiver handler ID • Flow control keeps DMA region from being overflowed User Buffer

Fast Messages 1.x – An example message passing API and library • API: Berkeley Active Messages • Key distinctions: guarantees(reliable, in-order, flow control), network-processor decoupling (DMA region) • Focus on short-packet performance: • Programmed IO (PIO) instead of DMA • Simple buffering and flow control • Map I/O device to user space (OS bypass) Sender: FM_send(NodeID,Handler,Buffer,size); // handlers are remote procedures Receiver: FM_extract()

What is an active message? • Usually, message passing has a send with a corresponding explicit receive at the destination. • Active messages specify a function to invoke (activate) when message arrives • Function is usually called a message handler The handler gets called when the message arrives, not by the destination doing an explicit receive.

20 FM 18 16 1Gb Ethernet 14 12 10 Bandwidth(MB/s) 8 6 4 2 0 16 32 64 128 256 512 1024 2048 Msg Size (Bytes) FM 1.x Performance (6/95) • Latency 14 ms, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95] • Hardware limits PIO performance, but N1/2 = 54 bytes • Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)

The FM Layering Efficiency Issue • How good is the FM 1.1 API? • Test: build a user-level library on top of it and measure the available performance • MPI chosen as representative user-level library • porting of MPICH 1.0 (ANL/MSU) to FM • Purpose: to study what services are important in layering communication libraries • integration issues: what kind of inefficiencies arise at the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]

20 15 FM Bandwidth (MB/s) 10 MPI-FM 5 0 16 32 64 128 256 512 1024 2048 Msg Size MPI on FM 1.x - Inefficient Layering of Protocols • First implementation of MPI on FM was ready in Fall 1995 • Disappointing performance, only fraction of FM bandwidth available to MPI applications

100 90 80 70 60 % Efficiency 50 40 30 20 10 0 16 32 64 128 256 512 1024 2048 Msg Size MPI-FM Efficiency • Result: FM fast, but its interface not efficient

Header Source buffer Header Destination buffer MPI FM MPI-FM Layering Inefficiencies • Too many copies due to header attachment/removal, lack of coordination between transport and application layers

Redesign API - FM 2.x • Sending • FM_begin_message(NodeID, Handler, size) • FM_send_piece(stream,buffer,size) // gather • FM_end_message() • Receiving • FM_receive(buffer,size) // scatter • FM_extract(total_bytes) // rcvr flow control

MPI FM MPI-FM 2.x Improved Layering Header Source buffer Header Destination buffer • Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies

100 90 FM 80 70 MPI-FM 60 50 Bandwidth (MB/s) 40 30 20 10 0 8 4 16 32 64 128 256 512 4196 8192 1024 2048 16384 32768 65536 MPI on FM 2.x • MPI-FM: 91 MB/s, 13ms latency, ~4 ms overhead • Short messages much better than IBM SP2, PCI limited • Latency ~ SGI O2K Msg Size

100 90 80 70 60 50 40 30 20 10 0 4 8 16 32 64 128 256 512 1024 2048 4196 8192 16384 32768 65536 Msg Size MPI-FM 2.x Efficiency • High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98] • Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%) % Efficiency

HPVM III (“NT Supercluster”) • 256xPentium II, April 1998, 77Gflops • 3-level fat tree (large switches), scalable bandwidth, modular extensibility • => 512xPentium III (550 MHz) Early 2000, 280 GFlops • Both with National Center for Supercomputing Applications 280 GF, Early 2000 77 GF, April 1998

Supercomputer Performance Characteristics Mflops/ProcFlops/ByteFlops/NetworkRT Cray T3E 1200 ~2 ~2,500 SGI Origin2000 500 ~0.5 ~1,000 HPVM NT Supercluster 300 ~3.2 ~6,000 Berkeley NOW II 100 ~3.2 ~2,000 IBM SP2 550 ~3.7 ~38,000 Beowulf (100Mbit) 300 ~25 ~200,000 • Compute/communicate and compute/latency ratios • Clusters can provide programmable characteristics at a dramatically lower system cost

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024) Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Is the detail important? Is there something easier? • Detail of a particular high-performance interface illustrates some of the complexity for these systems • Performance and scaling are very important. Sometimes the underlying structure needs to be understood to reason about applications. • Class will focus on distributed computing algorithms and interfaces at a higher level (message passing)

How do we program/run such machines? • PVM (Parallel Virtual Machine) provides • Simple message passing API • Construction of virtual machine with a software console • Ability to spawn (start), kill (stop), monitor jobs • XPVM is a graphical console, performance monitor • MPI (Message Passing Interface) • Complex and complete message passing API • Defacto, community-defined standard • No defined method for job management • Mpirun provided as a tool for the MPICH distribution • Commericial and non-commercial tools for monitoring debugging • Jumpshot, VaMPIr, …

Next Time … • Parallel Programming Paradigms Shared Memory Message passing

CS160 – Lecture 3

CS160 – Lecture 3

Presentation Transcript

Lecture 2 : Visual Astronomy -- Stars and Planets

Lecture 2: Software Platforms

CS101 Introduction to Computing Lecture 6 Developing & Hosting a Web Page (Web Development Lecture 2)

Is Advertising Immoral? Lecture 1

BCB 444/544

Mobile Programming Lecture 2

Lecture 1. Introduction

Lecture 3: Dynamic ILP

Topics in Space Weather Lecture 11 The Upper Atmosphere

6.096 Lecture 10

Cold atoms

“Elementary Particles” Lecture 6

Week 1 Lecture Review

“Elementary Particles” Lecture 6

ENT Undergraduate Lecture

Lecture 2. Optimal Sequence Alignment

Cold atoms

Introduction to Regression Lecture 2.1

Lecture 2: Software Platforms

Is Advertising Immoral? Lecture 1

Lecturer name: Dr. fahad albadr Radiology chiarman Lecture Date: 20-9-2011

CS160 – Lecture 3

CS160 – Lecture 3

Presentation Transcript

Lecture 2 : Visual Astronomy -- Stars and Planets

Lecture 2: Software Platforms

CS101 Introduction to Computing Lecture 6 Developing &amp; Hosting a Web Page (Web Development Lecture 2)

Is Advertising Immoral? Lecture 1

BCB 444/544

Mobile Programming Lecture 2

Lecture 1. Introduction

Lecture 3: Dynamic ILP

Topics in Space Weather Lecture 11 The Upper Atmosphere

6.096 Lecture 10

Cold atoms

“Elementary Particles” Lecture 6

Week 1 Lecture Review

“Elementary Particles” Lecture 6

ENT Undergraduate Lecture

Lecture 2. Optimal Sequence Alignment

Cold atoms

Introduction to Regression Lecture 2.1

Lecture 2: Software Platforms

Is Advertising Immoral? Lecture 1

Lecturer name: Dr. fahad albadr Radiology chiarman Lecture Date: 20-9-2011

CS101 Introduction to Computing Lecture 6 Developing & Hosting a Web Page (Web Development Lecture 2)