Message-Passing Computing

Chapter 2 Message-Passing Computing • Programming a Message Passing computer • Using a special parallel PL • Extending an existing language • Using a high-level language and providing a message-passing library • Here, the third option is employed COMPE472 Parallel Computing

Message-Passing Programming using User-level Message-Passing Libraries Two primary mechanisms needed: • A method of creating separate processes for execution on different computers • Static process creation (MPI): Before the execution number of processes are fixed • Dynamic process creation (MPI2): At runtime processess can be created 2. A method of sending and receiving messages COMPE472 Parallel Computing

Programming Models: 1. Multiple program, multiple data (MPMD) model Source Source fi le fi le Compile to suit processor Executable Processor 0 Processor p - 1 COMPE472 Parallel Computing

Programming models: 2. Single Program Multiple Data (SPMD) model. Different processes merged into one program. Control statements select different parts for each processor to execute. All executables started together - static process creation Source fi le Basic MPI w a y Compile to suit processor Ex ecutab les Processor 0 Processor p 1 - COMPE472 Parallel Computing

Multiple Program Multiple Data (MPMD) Model Separate programs for each processor. One processor executes master process. Other processes started from within master process - dynamic process creation. Process 1 Star t e x ecution of process 2 spawn(); Process 2 Time COMPE472 Parallel Computing

Basic “point-to-point”Send and Receive Routines Passing a message between processes using send() and recv() library calls: Process 1 Process 2 x y Mo v ement of data send(&x, 2); recv(&y, 1); Generic syntax (actual formats later) COMPE472 Parallel Computing

Synchronous Message Passing Routines that actually return when message transfer completed. Synchronous send routine • Waits until complete message can be accepted by the receiving process before sending the message. Synchronous receive routine • Waits until the message it is expecting arrives. • No need for buffer storage Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. COMPE472 Parallel Computing

Synchronous send() and recv() using 3-way protocol Process 1 Process 2 Time Request to send send(); Suspend Ac kno wledgment recv(); process Both processes Message contin ue (a) When occurs bef ore send() recv() Process 1 Process 2 Time recv(); Suspend Request to send process send(); Message Both processes contin ue Ac kno wledgment (b) When occurs bef ore recv() send() COMPE472 Parallel Computing

Asynchronous Message Passing • Routines that do not wait for actions to complete before returning. Usually require local storage for messages. • More than one version depending upon the actual semantics for returning. • In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care. COMPE472 Parallel Computing

MPI Definitions of Blocking and Non-Blocking • Blocking - return after their local actions complete, though the message transfer may not have been completed. • Non-blocking - return immediately. Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this. These terms may have different interpretations in other systems. COMPE472 Parallel Computing

How message-passing routines return before message transfer completed Message buffer needed between source and destination to hold message: Process 1 Process 2 Message b uff er Time send(); Contin ue recv(); Read process message b uff er COMPE472 Parallel Computing

Asynchronous (blocking) routines changing to synchronous routines • Once local actions completed and message is safely on its way, sending process can continue with subsequent work. • Buffers only of finite length and a point could be reached when send routine held up because all available buffer space exhausted. • Then, send routine will wait until storage becomes re-available - i.e then routine behaves as a synchronous routine. COMPE472 Parallel Computing

Message Tag • Used to differentiate between different types of messages being sent. • Message tag is carried within message. • If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send(). COMPE472 Parallel Computing

Message Tag Example To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y: Process 1 Process 2 x y Mo v ement of data send(&x,2, 5 ); recv(&y,1, 5 ); W aits f or a message from process 1 with a tag of 5 COMPE472 Parallel Computing

“Group” message passing routines Have routines that send message(s) to a group of processes or receive message(s) from a group of processes Higher efficiency than separate point-to-point routines although not absolutely necessary. COMPE472 Parallel Computing

Scatter Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process. Process 0 Process 1 Process p 1 - data data data Action buf scatter(); scatter(); scatter(); Code MPI f or m COMPE472 Parallel Computing

Gather Having one process collect individual values from set of processes. Process 0 Process 1 Process p 1 - data data data Action buf gather(); gather(); gather(); Code MPI f or m COMPE472 Parallel Computing

Reduce Gather operation combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root: Process 0 Process 1 Process p 1 - data data data Action buf + reduce(); reduce(); reduce(); Code COMPE472 Parallel Computing

AllGather & AllReduce • AllGather and AllReduce: perform gather/reduce and broadcast result • First a group must be formed and root process selected COMPE472 Parallel Computing

Barrier Barrier: synchronization point • Example barrier based on an allReduce • (typically more efficient implementations are used) COMPE472 Parallel Computing

PVM(Parallel Virtual Machine) Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge. Programmer decomposes problem into separate programs (usually master and group of identical slave programs). Programs compiled to execute on specific types of computers. Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile). COMPE472 Parallel Computing

Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine. W or kstation PVM Can ha v e more than one process daemon ru nning on each computer . Application prog r am (e x ecutab le) Messages sent through W or kstation netw or k W or kstation PVM daemon Application PVM prog r am daemon (e x ecutab le) Application prog r am (e x ecutab le) MPI implementation w e use is similar . COMPE472 Parallel Computing

MPI(Message Passing Interface) • Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability. • Defines routines, not implementation. • Several free implementations exist. COMPE472 Parallel Computing

MPIProcess Creation and Execution • Purposely not defined - Will depend upon implementation. • Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together. • Originally SPMD model of computation. • MPMD also possible with static creation - each program to be started together specified. COMPE472 Parallel Computing

Communicators • Defines scope of a communication operation. • Processes have ranks associated with communicator. • Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes. • Other communicators can be established for groups of processes. COMPE472 Parallel Computing

Using SPMD Computational Model main (int argc, char *argv[]) { MPI_Init(&argc, &argv); . . MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */ if (myrank == 0) master(); else slave(); . . MPI_Finalize(); } where master() and slave() are to be executed by master process and slave process, respectively. COMPE472 Parallel Computing

Unsafe message passing - Example Process 0 Process 1 Destination send(…,1,…); send(…,1,…); lib() Source recv(…,0,…); lib() (a) Intended beha vior recv(…,0,…); Process 0 Process 1 send(…,1,…); send(…,1,…); lib() (b) P ossib le beha vior recv(…,0,…); lib() recv(…,0,…); COMPE472 Parallel Computing

MPI Solution“Communicators” • Defines a communication domain - a set of processes that are allowed to communicate between themselves. • Communication domains of libraries can be separated from that of a user program. • Used in all point-to-point and collective MPI message-passing communications. COMPE472 Parallel Computing

Default CommunicatorMPI_COMM_WORLD • Exists as first communicator for all processes existing in the application. • A set of MPI routines exists for forming communicators. • Processes have a “rank” in a communicator. COMPE472 Parallel Computing

MPI Point-to-Point Communication • Uses send and receive routines with message tags (and communicator). • Wild card message tag (MPI_ANY_TAG)is available COMPE472 Parallel Computing

MPI Blocking Routines • Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent. • Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message. COMPE472 Parallel Computing

Parameters of blocking send MPI_Send(buf, count, datatype, dest, tag, comm) Address of Datatype of Message tag send b uff er each item Number of items Rank of destination Comm unicator to send process COMPE472 Parallel Computing

! Örnek send-recv MPI Fortran 90 ! program main use mpi implicit none integer status(MPI_STATUS_SIZE) integer :: ierr, rank, size,i,p,tag character*100 message call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr) if (rank .ne. 0)then message="Islemciden merhaba!" call MPI_SEND(message,len(message),MPI_CHARACTER,0,0,MPI_COMM_WORLD,ierr) else tag=0 do i=1,p-1 call MPI_RECV(message,len(message),MPI_CHARACTER,i,tag,MPI_COMM_WORLD,status,ierr) print *, "Islemci:",i,"mesaji:",message end do; end if call MPI_FINALIZE(ierr) end COMPE472 Parallel Computing

! Mid-point Rule to Compute PI using MPI_BCAST and MPI_REDUCE program main use mpi double precision starttime, endtime double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a double precision starttime, endtime integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) ! function to integrate call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 10 if (myid.eq.0) then print *, 'Enter the number of intervals: (0 quits) ' read(*,*) n endif starttime = MPI_WTIME() call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) ! check for quit signal if ( n .le. 0 ) goto 30 ! calculate the interval size h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum ! collect all the partial sums call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, & MPI_COMM_WORLD,ierr) endtime = MPI_WTIME() if (myid.eq.0) then print *, 'pi is ', pi, 'Error is ', abs(pi - PI25DT) print *, 'time is ', endtime-starttime, ' seconds' endif go to 10 30 call MPI_FINALIZE(ierr) stop end COMPE472 Parallel Computing

Parameters of blocking receive MPI_Recv(buf, count, datatype, src, tag, comm, status) Status Address of Datatype of Message tag after oper ation receiv e b uff er each item Maxim um n umber Rank of source Comm unicator of items to receiv e process COMPE472 Parallel Computing

Example To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */ if (myrank == 0) { int x; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status); } COMPE472 Parallel Computing

C - MPI Datatypes COMPE472 Parallel Computing

Fortran – MPI Basic Datatypes COMPE472 Parallel Computing

The status array • Status is a data structure allocated in the user’sprogram. • C language • int recvd_tag, recvd_from, recvd_count; • MPI_Status status; • MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG,..., &status) • recvd_tag = status.MPI_TAG; • recvd_from = status.MPI_SOURCE; • MPI_Get_count( &status, datatype, &recvd_count); • Fortran language • integer recvd_tag, recvd_from, recvd_count • integer status(MPI_STATUS_SIZE) • call MPI_RECV(..,MPI_ANY_SOURCE,MPI_ANY_TAG,..status,ierr) • tag_recvd = status(MPI_TAG) • recvd_from = status(MPI_SOURCE) • call MPI_GET_COUNT(status, datatype, recvd_count, ierr) COMPE472 Parallel Computing

MPI Nonblocking Routines • Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered. • Nonblocking receive - MPI_Irecv() - will return even if no message to accept. COMPE472 Parallel Computing

Nonblocking Routine Formats MPI_Isend(buf,count,datatype,dest,tag,comm,request) MPI_Irecv(buf,count,datatype,source,tag,comm, request) Completion detected by MPI_Wait() and MPI_Test(). MPI_Wait() waits until operation completed and returns then. MPI_Test() returns with flag set indicating whether operation completed at that time. Need to know whether particular operation completed. Determined by accessing request parameter. COMPE472 Parallel Computing

MPI_ISEND & MPI_IRECV • Fortran: • MPI_ISEND(buf, count, type, dest, tag, comm, req, ierr) • MPI_IRECV(buf, count, type, sour, tag, comm, req, ierr) • buf array of type type. • count (INTEGER) number of element of buf to be sent • type (INTEGER) MPI type of buf • dest (INTEGER) rank of the destination process • sour(INTEGER) rank of the source process • tag (INTEGER) number identifying the message • comm (INTEGER) communicator of the sender and receiver • req (INTEGER) output, identifier of the communications handle • ierr (INTEGER) output, error code (if ierr=0 no error occurs) COMPE472 Parallel Computing

Example To send an integer x from process 0 to process 1 and allow process 0 to continue, MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */ if (myrank == 0) { int x; MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1); compute(); MPI_Wait(req1, status); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status); } COMPE472 Parallel Computing

Send Communication Modes • Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached. • Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach(). • Synchronous Mode - Send and receive can start before each other but can only complete together. • Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care. COMPE472 Parallel Computing

Each of the four modes can be applied to both blocking and nonblocking send routines. • Only the standard mode is available for the blocking and nonblocking receive routines. • Any type of send routine can be used with any type of receive routine. COMPE472 Parallel Computing

Communication Modes and MPISubroutines COMPE472 Parallel Computing

Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: • MPI_BCAST() - Broadcast from root to all other processes • MPI_GATHER() - Gather values for group of processes • MPI_SCATTER() - Scatters buffer in parts to group of processes • MPI_ALLTOALL() - Sends data from all processes to all processes • MPI_REDUCE() - Combine values on all processes to single value • MPI_REDUCE_SCATTER() - Combine values and scatter results • MPI_SCAN() - Compute prefix reductions of data on processes COMPE472 Parallel Computing

Broadcast Sending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes. Process 0 Process 1 Process p - 1 data data data Action buf bcast(); bcast(); bcast(); Code MPI f or m COMPE472 Parallel Computing

Broadcast Illustrated COMPE472 Parallel Computing

Broadcast (MPI_BCAST) One-to-all communication: same data sent from root process to all others in the communicator • Fortran: • INTEGER count, type, root, comm, ierr • CALL MPI_BCAST(buf, count, type, root, comm, ierr) • Buf array of type type • C: • int MPI_Bcast(void *buf, int count, MPI_Datatypedatatypem int root, MPI_Comm comm) • All processes must specify same root, rank and comm COMPE472 Parallel Computing

Message-Passing Computing