Message Passing

Message Passing MPI on Origin Systems

MPI Programming Model

Compiling MPI Programs cc -64 compute.c -lmpi f77 -64 -LANG:recursive=on compute.f -lmpi f90 -64 -LANG:recursive=on compute.f -lmpi CC -64 compute.c -lmpi++ -lmpi -64 NOT required but improves functionality and optimization With 7.2.1 compiler level or higher, can use: -auto_use mpi_interface with f77 / f90 for compile time subroutine interface checking

Compiling MPI Programs • Must use header file from /usr/include since SGI libraries built with it (do not use public domain version) • FORTRAN: mpif.h or USE MPI • C: mpi.h • C++: mpi++.h • mpi_initversion must match main program language (if called from multiple shared memory threads must use mpi_init_thread)

Compiling MPI Programs • MPI definitions: • FORTRAN: MPI_XXXX (not case sensitive) • C: MPI_Xxxx (upper and lower case) • C++: Xxxx (part of name space MPI::) • Every entry point MPI_in the MPI Library has a “shadow” entry point PMPI_ to aid with implementation of user profiling • Array Services required to run MPI (arrayd)

Basic MPI Features

MPI Basic Calls • MPI has a large number of calls. The following are most basic: • every MPI program has to start and finish with these calls (the first and the last executable statements): • mpi_init • mpi_finalize • essential inquiry about the environment: • mpi_comm_size • mpi_comm_rank • basic communication calls: • mpi_send • mpi_recv • basic synchronization calls: • mpi_barrier Program mpitest include “mpif.h” call mpi_init(ierr) call mpi_comm_size(MPI_COMM_WORLD,np,ierr) call mpi_comm_rank(MPI_COMM_WORLD,id,ierr) do I=0,np-1 if(I.eq.id) print *,’np, id’,np,id call mpi_barrier(MPI_COMM_WORLD,ierr) enddo call mpi_finalize(ierr) stop end Compile with: f77 -o mpitest -LANG:recursive=on mpitest.f -lmpi run with: mpirun -npN[-stats-prefix “%g”]mpitest

MPI send and receive Calls • mpi_send(buf,count,datatype,dest,tag,comm,ierr) • mpi_recv(buf,count,datatype,dest,tag,comm,stat,ierr) • buff data to be send/recv • count number of items to be send; size of buf for recv • datatype type of data items to send/recv (MPI_INTEGER, MPI_FLOAT, MPI_DOUBLE_PRECISION, etc.) • dest id of the pear process (MPI_ANY_SOURCE) • tag integer mark of the message (MPI_ANY_TAG) • comm communication handle (MPI_COMM_WORLD) • stat status of the message of MPI_STATUS type; in Fortran INTEGER stat(MPI_STATUS_SIZE) call mpi_get_count(stat,MPI_REAL,nitems) where nitems can be <= count • check for errors: • if(ierr.ne.MPI_SUCCESS) call abort() message envelope

Using send and receive Calls • Example: • rules of use: • mpi_send/recv are defined as blocking calls • the program should not assume blocking behaviour (small messages are buffered) • when these calls return, the buffers can be (re-)used • the arrival order of messages send from A and B to C is not determined. Two messages from A to B will arrive in the order sent. • Message Passing programming models are non-deterministic. If(mod(myid,2).eq.0) then idst = mod(id+1,np) itag = 0 call mpi_send(A,N,MPI_REAL,idst,itag,MPI_COMM_WORLD,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr else isrc = mod(id-1+np,np) itag = MPI_ANY_TAG call mpi_recv(B,NSIZE,MPI_REAL,isrc,itag,MPI_COMM_WORLD,stat,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr call mpi_get_count(stat,MPI_REAL,N) endif

Another Simple Example

MPI send/receive: Buffering • MPI program should not assume buffering of messages. The following program is erroneous: • running on Origin2000 on 2 cpu the program will block after reaching the size i=2100 (because the buffering constraint MPI_BUFFER_MAX=16384, I.e. 2048 items of real*8) Program long_messages include ‘mpif.h’ real*8 h(4000) integer stat(MPI_STATUS_SIZE) call mpi_init(info) call mpi_comm_rank(MPI_COMM_WORLD, mype, info) call mpi_comm_size(MPI_COMM_WORLD, npes, info) do I = 1000, 4000, 100 ! Increasing size of the message call mpi_barrier(MPI_COMM_WORLD,info) print *,’mype=‘,mype,’ before send’,I call mpi_send(h,I,mpi_real8,mod(mype+1,npes),I,MPI_COMM_WORLD,info) call mpi_barrier(MPI_COMM_WORLD,info) call mpi_recv(h,I,MPI_REAL8, MOD(mype-1+npes,npes),I,MPI_COMM_WORLD,stat,info) enddo call mpi_finalize(info) END

MPI Asynchronous send/receive • Non-blocking send and receive calls are available: • mpi_isend(buf,count,datatype,dest,tag,comm,req,ierr) • mpi_irecv(buf,count,datatype,dest,tag,comm,req,ierr) • buf,count,datatype message content • dest,tag,comm message envelope • reqinteger holding the request id • the asynchronous call returns the request-id after registering the buffer . The request id can be used in the probe and wait calls: • mpi_wait(req,stat,ierr) • blocks until the MPI send or receive with req request-id completes • mpi_waitall(count,array-of-req,array-of-stat,ierr) • waits for all given communications to complete (a blocking call) • the (array-of-)stat can be probed for items received. The data can be retrieved with the recv call (or irecv call, or any other variety receive) • NOTE: although this interface announces asynchronous communication, the actual copy of buffers happens only at the time of the receive and wait calls

MPI Asynchronous: Example • Buffer management with asynchronous communcation: • buffers declared in isend/irecv can be (re-)used only after the communication has actually completed. • Requests should be freed (mpi_test, mpi_wait, mpi_request_free) for all the isend calls in the program, otherwise mpi_finalize might hang include ‘mpif.h’ integer stat(MPI_STATUS_SIZE,10) integer req(10) real B1(NB1,10) if(mype.eq.0) then ! Master receiving from all slaves do ip=1,npes-1 call mpi_irecv(B1(ip),NB1,MPI_REAL, ip,MPI_ANY_TAG,MPI_COMM_WORLD,req(ip),info) enddo nreq = npes else ! Slave send to master call mpi_isend(B1(mype),NB1,MPI_REAL,0,itag,MPI_COMM_WORLD,req,info) nreq = 1 endif … ! Some unrelated calculations call mpi_waitall(nreq,req,stat,ierr) … ! Data is available in B1 in the master process … ! Buffer B1 can be reused in the slave processes

Performance of Asynchronous Communication

MPI Functionality

MPI Most Important Functions • Synchronous communication: • mpi_send • mpi_recv • mpi_sendrecv • Asynchronous communication: • mpi_isend • mpi_irecv • mpi_iprobe • mpi_wait/waitall • Collective communication: • mpi_barrier • mpi_bcast • mpi_gather/scatter • mpi_reduce/allreduce • mpi_alltoall • Creating communicators: • mpi_comm_dup • mpi_comm_split • mpi_comm_free • mpi_intercomm_create • Derived data types: • mpi_type_contiguous • mpi_type_vector • mpi_type_indexed • mpi_type_pack • mpi_type_commit • mpi_type_free

MPI Most Important Functions • One-sided communication: • mpi_win_create • mpi_put • mpi_get • mpi_fence • Miscellaneous: • MPI_Wtime() • Based on SGI_CYCLE clock with 0.8 microsecond resolution

MPI Run Time System on SGI Array daemon Array daemon fork() t.exe N times fork() t.exe N times N N 0 0 N-1 N-1 • On SGI, all MPI programs are launched with the mpirun command • mpirun -np N executable-name arguments syntax on a single host • multi-host execution of different executables is possible • The mpirun establishes connection with the Array Daemon with the socket interface. • The Array Daemon launches the mpi executable. • N+1 threads are started. One additional thread is the “lazy” thread which is blocked in mpi_init() call and terminates when all other threads call mpi_finalize() • The mpirun -cpr (or -miser) will work on the single host to avoid the socket interface to the Array Daemon (for Checkpoint/Restart facility) • Note: start MPI programs with N < #procs Program name, path, environement variables mpirun -np N t.exe mpirun Host_A -np N a.out : Host_B -np M b.out HiPPI optimized communication

MPI Run Time on SGI

MPI Implementation on SGI • In C, mpi_init ignores all arguments passed to it • All MPI processes are required to call mpi_finalize at exit • I/O streams: • stdin is enabled only for the master thread (process with rank 0) • stdout and stderr are enabled for all the threads and line buffered • output from different MPI threads can be prepended with -prefix argument; output sent to mpirun process example: mpirun -prefix “<proc %g out of %G> “ prints: <proc 0 out of 2> Hello World <proc 1 out of 2> Hello World • see man mpi(5) and man mpirun(1) for a complete description • Systems with the HIPPI software installed will trigger usage of the HIPPI optimized communication (HIPPI bypass). If the hardware is not installed it is necessary to switch the HIPPI bypass off (setenv MPI_BYPASS_OFF TRUE) • With f77/f90, the -auto_use mpi_interface flag is available to check the consistency of mpi arguments at compile time • With -64 compilation, mpi run time maps out the address space such that shared memory optimizations are available to circumvent the double copy problem. In particular, communication involving static data (I.e. common blocks) can be sped up.

SGI Message-Passing Software • SGI Message Passing Toolkit (MPT 1.5) • MPI, SHMEM, PVM components • Packaged with Array Services software • MPT external web page: • http://www.sgi.com/software/mpt/ • MPT engineering internal web page • http://wwwmn.americas.sgi.com/mpi/

SGI Message-Passing Toolkit • Fully MPI 1.2 standard compliant (based on MPICH) • SHMEM API for one-sided communication • Support for selected MPI-2 features and will continue enhancing as customer needs dictate • MPI I/O (ROMIO version 1.0.2) • MPI one-sided communication • Thread safety • Fortran 90 bindings: USE MPI • C++ bindings • PVM available on IRIX (Public Domain version)

MPT: Supported Platforms • Now • IRIX SSI • IRIX clusters (GSN, Hippi, Ethernet) • IA32 and IA64 SSI with Linux • IA32 cluster (Myrinet, Ethernet) with Linux • Soon • Partitioned IRIX (NUMAlink interconnect) • IRIX clusters (Myrinet) • Partitioned SN IA (NUMAlink interconnect) • IA64 cluster (Myrinet, Ethernet)

Convenience Features in MPT MPI job management with LSF, NQE, PBS, others Totalview debugger interoperability Fortran MPI subroutine interface checking at compile time with USE MPI Aborted cluster jobs are cleaned up automatically Array Services provides job control for cluster jobs Array Services and MPI work together to propagate user signals to all slaves Use shell modules to install multiple versions of MPT on the same system.

MPI Performance Low latency and high bandwidth. Fetchop-assisted fast message queuing Fast fetchop tree barriers Very fast MPI and SHMEM one-sided communication Interoperability with SHMEM Support for SSI to 512 P Automatic NUMA placement Optimized MPI collectives Internal MPI statistics reporting Integration with PCP Direct send/recv transfers No-impact thread safety support Runtime MPI tuning

NUMAlink Implementation • Used by MPI_Barrier, MPI_Win_fence, and shmem_barrier_all • Fetch-Op-variables on Hub provide fast synchronization for flat and tree barrier methods • The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 usec CPU HUB ROUTER Fetch-op variable CPU

NUMAlink-based MPI Performance MPI Performance on Origin 2000 (Origin 3000)

SHMEM Model

SHMEM API

One-Sided Communication Pattern Barriers Processes C O M P U T E C O M M U N I C A T E C O M M U N I C A T E 0 C O M P U T E 1 2 3 4 N-1 Time

MPI Message Exchange(on host) 0 1 Process 0 fetchop Process 1 0 1 Messagequeues 0 Messageheaders 1 Messageheaders src 0 1 dst Databuffers MPI_Send(src,len,…) MPI_Recv(dst,len,…) Shared memory

MPI Message Exchange using Single Copy (on host) 0 1 Process 0 fetchop Process 1 0 1 Messagequeues 0 Messageheaders 1 Messageheaders src dst MPI_Send(src,len,…) MPI_Recv(dst,len,…) Shared memory

Performance of Synchronous Communication

Using Single Copy send/recv • Set MPI_BUFFER_MAX to N • any message with size > N bytes will be transferred by direct copy if • MPI semantics allow it • -64 ABI is used • the memory region it is allocated in is a globally accessible location • N=2000 seems to work well • shorter messages don’t benefit from direct copy transfer method • Look at stats to verify that direct copy was used.

Making Memory Globally Accessible for Single Copy send/recv • User’s send buffer must reside in one of the following regions: • static memory (-static/common blocks/DATA/SAVE) • symmetric heap (allocated with SHPALLOC or shmalloc) • global heap (allocated with f90 ALLOCATE statement and SMA_GLOBAL_ALLOC , MIPSPro version 7.3.1.1m ) • When SMA_GLOBAL_ALLOC is set, usually need to increase global heap size by setting SMA_GLOBAL_HEAP_SIZE

Global Communication Test Send (A) Receive (B) iw p0 p1 pn pn p2 p1 p0 • The ALL-to-ALL communication test : (known as COMMS3 in the Parkbench suite)

Global Communication The ALL-to-ALL communication test : MPI Version C every processor sends message to every other processor C then every processor receives messages directed to it. T0 = mpi_time() Do I = 1, NREPT CALL mpi_alltoall (A, iw, MPI_DOUBLE_PRECISION, B, iw, MPI_DOUBLE_PRECISION, MPI_COMM_WORLD,ier) End do T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages SHMEM Version T0 = mpi_time() Do I = 1, NREPT CALL shmem_barrier_all () Do j=0, NP-1 other = MOD (my_rank+j, NP) CALL shmem_put8(B(1+iw*my_rank), A(1+iw*other), iw, other) enddo T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages

Global Communication • Performance of the global communication test • Actions: • convert to Shmem • used single copy versions • on remotely accessible • variables AlltoAll Bandwidth for R12K@300MHz: The test case shows cache effects since every operation is performed 50 times. Global communication routines do already uses in MPT_1.4.0.0 a single copy algorithm for remotely accessible variables.

Global Communication Single copy Double copy Conclusions: Implement critical data exchange in MPI programs with SHMEM or single copy MPI on static or (shmalloc/shpalloc) allocated data.

MPI get/put • For codes that are latency sensitive, try using one-sided MPI (get/put). • latency over NUMAlink on O3000: • send/recv: 5 microseconds • mpi_get: 0.7 microseconds • if portability isn’t an issue use SHMEM instead • shmem_get latency: 0.5 microseconds (estimate by MPT group) • much easier to write code

Transposition with SHMEM vs. send/recv call shmem_barrier_all do 150 kk=1,lmtot ktag=ksendto(kk) call shmem_put8( y(1+(ktag-1)*len), x(1,ksnding(kk), len, ipsndto(kk) ) continue call shmem_barrier_all ltag=0 do 150 kk=1,lmtot ltag=ltag+1 ktag=ksendto(kk) call mpi_isend(x(1,ksnding(kk), len, mpireal, ipsndto(kk), ktag, mpicomm, iss(ltag), istat) ltag=ltag+1 ktag=krcving(kk) call mpi_irecv(y(1,krcving(kk), len, mpireal, iprcvfr(kk), ktag, mpicomm, iss(ltag), istat) 150 continue call mpi_wait_all(ltag,iss,istatm, istat)

Transposition with MPI_put • common/buffer/ yg(length) • integer(kind=MPI_ADDRESS_KIND) winsize, target_disp • ! Setup: create a window for array yg since we will do puts into it • call MPI_type_extent(MPI_REAL8, isizereal8, ierr) • winsize=isizereal8*length • call MPI_win_create(yg, winsize, isizereal8, MPI_INFO_NULL, MPI_COMM_WORLD, iwin, ierr)

Transposition with MPI_put • call mpi_barrier(MPI_COMM_WORLD,ierr) • do 150 kk=1,lmtot • ktag=ksendto(kk) • target_disp=(1+(ktag-1)*len)-1 • call mpi_put(x(1,ksnding(kk), len, MPI_REAL8, ipsndto(kk), target_disp, len, • MPI_REAL8, iwin, ierr) • 150 continue • call mpi_win_fence(0, iwin, ierr) • do kk=1,len*lmtot • y(kk)=yg(kk) • end do • ! Cleanup - destroy window • call mpi_barrier(MPI_COMM_WORLD,ierr) • call mpi_win_free(iwin, ierr)

Message Passing

Message Passing

Presentation Transcript

Message Passing Basics

Message Passing Communication

Message Passing Models

Message Passing

Message-Passing

Message Passing

Message Passing Interface

Message Passing

Message Passing

Message-Passing Computing

Message Passing

Message-Passing Computing

Message-Passing Computing

Message-Passing Computing

Message Passing Interface

Message Passing Interface

Message Passing Interface

Message Passing Fundamentals

Message Passing Computing

Message-Passing Computing