Ace104 Lecture 8

Ace104Lecture 8 Tightly Coupled Components MPI (Message Passing Interface)

Motivation • To this point we have focused on highly granular, loosely coupled components via web services (ie using SOAP/XML/http) • Some components need to couple more tightly • Rate and volume of data exchange, e.g. • Granularity of interfaces • These components are normally controlled in a unified “back-end” environment, so inter-component security is a less prominent issue

Multi-grained services • Tight coupling implies fine granularity, but not necessarily an rpc architectural style • Real world architectures are built of multi-grain components • Low granularity loosely coupled components communicating via web services • These components themselves are made up of high granularity(sub)- components communicating via some more efficient mechanism • Java rmi • Raw sockets • MPI, etc.

Role of MPI -- HPC is not all • One good example of this is speeding up numerical operation by parallelization • Risk management, option pricing, data mining, flow simulation, etc. • These faster components can then be coupled via web services (e.g. this is the common architectural model of Grid Computing) • However, tight coupling is more general than parallel computing • Can be used for any sub-service where performance matters; has gained popularity recently in this area.

Standardization • Parallel computing community has resorted to “community-based” standards • HPF • MPI • OpenMP? • Some commercial products are becoming “de facto” standards, but only because they are portable • TotalView parallel debugger, PBS batch scheduler

Risks of Standardization • Failure to involve all stakeholders can result in standard being ignored • application programmers • researchers • vendors • Premature standardization can limit production of new ideas by shutting off support for further research projects in the area

Models for Parallel Computation • Shared memory (load, store, lock, unlock) • Message Passing (send, receive, broadcast, ...) • Transparent (compiler works magic) • Directive-based (compiler needs help) • Others (BSP, OpenMP, ...) • Task farming (scientific term for large transaction processing)

The Message-Passing Model • A process is (traditionally) a program counter and address space • Processes may have multiple threads (program counters and associated stacks) sharing a single address space • Message passing is for communication among processes, which have separate address spaces • Interprocess communication consists of • synchronization • movement of data from one process’s address space to another’s

What is MPI? • A message-passing library specification • extended message-passing model • not a language or compiler specification • not a specific implementation or product • For parallel computers, clusters, and heterogeneous networks • Full-featured • Designed to provide access to advanced parallel hardware for end users, library writers, and tool developers

Where Did MPI Come From? • Early vendor systems (Intel’s NX, IBM’s EUI, TMC’s CMMD) were not portable (or very capable) • Early portable systems (PVM, p4, TCGMSG, Chameleon) were mainly research efforts • Did not address the full spectrum of issues • Lacked vendor support • Were not implemented at the most efficient level • The MPI Forum organized in 1992 with broad participation by: • vendors: IBM, Intel, TMC, SGI, Convex, Meiko • portability library writers: PVM, p4 • users: application scientists and library writers • finished in 18 months

Novel Features of MPI • Communicators encapsulate communication spaces for library safety • Datatypes reduce copying costs and permit heterogeneity • Multiple communication modes allow precise buffer management • Extensive collective operations for scalable global communication • Process topologies permit efficient process placement, user views of process layout • Profiling interface encourages portable tools

MPI References • The Standard itself: • at http://www.mpi-forum.org • All MPI official releases, in both postscript and HTML • Books: • Using MPI: Portable Parallel Programming with the Message-Passing Interface, 2nd Edition, by Gropp, Lusk, and Skjellum, MIT Press, 1999. Also Using MPI-2, w. R. Thakur • MPI: The Complete Reference, 2 vols, MIT Press, 1999. • Designing and Building Parallel Programs, by Ian Foster, Addison-Wesley, 1995. • Parallel Programming with MPI, by Peter Pacheco, Morgan-Kaufmann, 1997. • Other information on Web: • at http://www.mcs.anl.gov/mpi • pointers to lots of stuff, including other talks and tutorials, a FAQ, other MPI pages

send/recv • Basic MPI functionality • MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) • MPI_Recv(void *buf, int count, MPI_Datatype type, int src, int tag, MPI_Comm comm, MPI_Status stat) • *stat is a C struct returned with at least the following fields • stat.MPI_SOURCE • stat.MPI_TAG • stat.MPI_ERROR

Blocking vs. non-blocking • Send/recv functions in previous slide is referred to as blocking point-to-point communication • MPI also has non-blocking send/recv functions that will be studied next class – MPI_Isend, MPI_Irecv • Semantics between two are very different – must be very careful to understand rules to write safe programs

Blocking recv • Semantics of blocking recv • A blocking receive can be started whether or not a matching send has been posted • A blocking receive returns only after its receive buffer contains the newly received message • A blocking receive can complete before the matching send has completed (but only after it has started)

Blocking send • Semantics of blocking send • Can start whether or not a matching recv has been posted • Returns only after message in data envelope is safe to be overwritten • This can mean that date was either buffered or that it was sent directly to receive process • Which happens is up to implementation • Very strong implications for writing safe programs

Examples MPI_Comm_rank(MPI_COMM_WORLD, rank); if (rank == 0){ MPI_Send(sendbuf, count, MPI_DOUBLE, 1, tag, comm); MPI_Recv(recvbuf, count, MPI_DOUBLE, 1, tag, comm, stat); } else if (rank == 1){ MPI_Recv(recvbuf, count, MPI_DOUBLE, 0, tag, comm, stat) MPI_Send(sendbuf, count, MPI_DOUBLE, 0, tag, comm) } Is this program safe? Why or why not? Yes, this is safe even if no buffer space is available!

Examples MPI_Comm_rank(MPI_COMM_WORLD, rank); if (rank == 0){ MPI_Recv(recvbuf, count, MPI_DOUBLE, 1, tag, comm, stat); MPI_Send(sendbuf, count, MPI_DOUBLE, 1, tag, comm); } else if (rank == 1){ MPI_Recv(recvbuf, count, MPI_DOUBLE, 0, tag, comm, stat); MPI_Send(sendbuf, count, MPI_DOUBLE, 0, tag, comm); } Is this program safe? Why or why not? No, this will always deadlock!

Examples MPI_Comm_rank(MPI_COMM_WORLD, rank); if (rank == 0){ MPI_Send(sendbuf, count, MPI_DOUBLE, 1, tag, comm); MPI_Recv(recvbuf, count, MPI_DOUBLE, 1, tag, comm, stat); } else if (rank == 1){ MPI_Send(sendbuf, count, MPI_DOUBLE, 0, tag, comm); MPI_Recv(recvbuf, count, MPI_DOUBLE, 0, tag, comm, stat); } Is this program safe? Why or why not? Often, but not always! Depends on buffer space.

Message order • Messages in MPI are said to be non-overtaking. • That is, messages sent from a process to another process are guaranteed to arrive in same order. • However, nothing is guaranteed about messages sent from other processes, regardless of when send was initiated

dest = 1 tag = 1 dest = 1 tag = 4 dest = * tag = 1 dest = * tag = 1 dest = 2 tag = * dest = 2 tag = * dest = * tag = * dest = 1 tag = 1 dest = 1 tag = 2 dest = 1 tag = 3 Illustration of message ordering P0 (send) P1 (recv) P2 (send)

Another example int rank = MPI_Comm_rank(); if (rank == 0){ MPI_Send(buf1, count, MPI_FLOAT, 2, tag); MPI_Send(buf2, count, MPI_FLOAT, 1, tag); } else if (rank == 1){ MPI_Recv(buf2, count, MPI_FLOAT, 0, tag); MPI_Send(buf2, count, MPI_FLOAT, 2, tag); else if (rank == 2){ MPI_Recv(buf1, count, MPI_FLOAT, MPI_ANY_SOURCE, tag); MPI_Recv(buf2, count, MPI_FLOAT, MPI_ANY_SOURCE, tag); }

Illustration of previous code send send recv send recv recv Which message will arrive first? Impossible to say!

Progress • Progress • If a pair of matching send/recv has been initiated, at least one of the two operations will complete, regardless of any other actions in the system • send will complete, unless recv is satisfied by another message • recv will complete, unless message sent is consumed by another matching recv

Fairnesss • MPI makes no guarantee of fairness • If MPI_ANY_SOURCE is used, a sent message may repeatedly be overtaken by other messages (from different processes) that match the same receive.

Send Modes • To this point, we have studied non-blocking send routines using standard mode. • In standard mode, the implementation determines whether buffering occurs. • This has major implications for writing safe programs

Other send modes • MPI includes three other send modes that give the user explicit control over buffering. • These are: buffered, synchronous, and ready modes. • Corresponding MPI functions • MPI_Bsend • MPI_Ssend • MPI_Rsend

MPI_Bsend • Buffered Send: allows user to explicitly create buffer space and attach buffer to send operations: • MPI_BSend(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) • Note: this is same as standard send arguments • MPI_Buffer_attach(void *buf, int size); • Create buffer space to be used with BSend • MPI_Buffer_detach(void *buf, int *size); • Note: in detach case void * argument is really pointer to buffer address, so that add • Note: call blocks until message has been safely sent • Note: It is up to the user to properly manage the buffer and ensure space is available for any Bsend call

MPI_Ssend • Synchronous Send • Ensures that no buffering is used • Couples send and receive operation – send cannot complete until matching receive is posted and message is fully copied to remove processor • Very good for testing buffer safety of program

MPI_Rsend • Ready Send • Matching receive must be posted before send, otherwise program is incorrect • Can be implemented to avoid handshake overhead when program is known to meet this condition • Not very typical + dangerous

Implementation oberservations • MPI_Sendcould be implemented as MPI_Ssend, but this would be weird and undesirable • MPI_Rsend could be implemented as MPI_Ssend, but this would eliminate any performance enhancements • Standard mode (MPI_Send) is most likely to be efficiently implemented

MPI’s Non-blocking Operations • Non-blocking operations return (immediately) “request handles” that can be tested and waited on. MPI_Isend(start, count, datatype, dest, tag, comm, request) MPI_Irecv(start, count, datatype, dest, tag, comm, request) MPI_Wait(&request, &status) • One canalso test without waiting: MPI_Test(&request, &flag, status)

Multiple Completions • It is sometimes desirable to wait on multiple requests: MPI_Waitall(count, array_of_requests, array_of_statuses) MPI_Waitany(count, array_of_requests, &index, &status) MPI_Waitsome(count, array_of_requests, array_of indices, array_of_statuses) • Thereare corresponding versions of test for each of these.

Embarrassingly parallel examples Mandelbrot set Monte Carlo Methods Image manipulation

Embarrassingly Parallel • Also referred to as naturally parallel • Each Processor works on their own sub-chunk of data independently • Little or no communication required

Mandelbrot Set • Creates pretty and interesting fractal images with a simple recursive algorithm zk+1 = zk * zk + c • Both z and c are imaginary numbers • for each point c we compute this formula until either • A specified number of iterations has occurred • The magnitude of z surpasses 2 • In the former case the point is not in the Mandelbrot set • In the latter case it is in the Mandelbrot set

Parallelizing Mandelbrot Set • What are the major defining features of problem? • Each point is computed completely independently of every other point • Load balancing issues – how to keep procs busy • Strategies for Parallelization?

Mandelbrot Set Simple Example • See mandelbrot.c and mandelbrot_par.c for simple serial and parallel implementation • Think how load balacing could be better handled

Monte Carlo Methods • Generic description of a class of methods that uses random sampling to estimate values of integrals, etc. • A simple example is to estimate the value of pi

1 Using Monte Carlo to Estimate p • Fraction of randomly • Selected points that lie • In circle is ratio of areas, • Hence pi/4 • Ratio of Are of circle to Square is pi/4 • What is value of pi?

Parallelizing Monte Carlo • What are general features of algorithm? • Each sample is independent of the others • Memory is not an issue – master-slave architecture? • Getting independent random numbers in parallel is an issue. How can this be done?

MPI Datatypes • The data in a message to send or receive is described by a triple (address, count, datatype), where • An MPI datatype is recursively defined as: • predefined, corresponding to a data type from the language (e.g., MPI_INT, MPI_DOUBLE) • a contiguous array of MPI datatypes • a strided block of datatypes • an indexed array of blocks of datatypes • an arbitrary structure of datatypes • There are MPI functions to construct custom datatypes, in particular ones for subarrays

MPI Tags • Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message • Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive • Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes

MPI is Simple • Many MPI programs can be written using just these six functions, only two of which are non-trivial: • MPI_INIT • MPI_FINALIZE • MPI_COMM_SIZE • MPI_COMM_RANK • MPI_SEND • MPI_RECV

Collective Operations in MPI • Collective operations are called by all processes in a communicator • MPI_BCAST distributes data from one process (the root) to all others in a communicator • MPI_REDUCE combines data from all processes in communicator and returns it to one process • In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency

Example: PI in C - 1 #include "mpi.h" #include <math.h> int main(int argc, char *argv[]) {int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is .16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize(); return 0; }

Alternative Set of 6 Functions • Using collectives: • MPI_INIT • MPI_FINALIZE • MPI_COMM_SIZE • MPI_COMM_RANK • MPI_BCAST • MPI_REDUCE

Process 0 Process 1 User data Local buffer the network Local buffer User data Buffers • When you send data, where does it go? One possibility is: Buffering

Avoiding Buffering • It is better to avoid copies: Process 0 Process 1 User data the network User data This requires that MPI_Send wait on delivery, or that MPI_Send return before transfer is complete, and we wait later.

Ace104 Lecture 8

Ace104 Lecture 8

Presentation Transcript

Lecture 8

Lecture 8

ACE104 Lecture 2

Lecture 8

Lecture #8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Ace104 Lecture 5

Ace104 Lecture 1

LECTURE № 8

LECTURE 8

Lecture 8