1 / 31

A Very Short Introduction to MPI

A Very Short Introduction to MPI. Vincent Keller, CADMOS (with Basile Schaeli, EPFL – I&C – LSP, 2008 - 2013). MPI. Run multiple instances of the same program mpiexec -n p myApp args Starts p instances of myApp Instances exchange information by sending messages to each other

Télécharger la présentation

A Very Short Introduction to MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Very Short Introduction to MPI Vincent Keller, CADMOS (with Basile Schaeli, EPFL – I&C – LSP, 2008 - 2013)

  2. MPI • Run multiple instances of the same program • mpiexec -n p myApp args • Starts p instances of myApp • Instances exchange information by sending messages to each other • Communications take place within a communicator • Set of processes, indexed from 0 to communicatorSize-1 • MPI_COMM_WORLD predefined communicator containing all processes • Compile with mpicc (C), mpicxx (C++) or mpif77/mpif90 (Fortran) • All are wrappers around regular compilers (try e.g. mpicc –show)

  3. hello.c int main(int argc, char *argv[]) { int size, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("I'm process %d out of %d\n", rank, size); MPI_Finalize(); }

  4. Point-to-point communications MPI_Send(buf, count, datatype, destination, tag, communicator) MPI_Recv(buf, count, datatype, source, tag, communicator, status) • Sends (receives) count elements of type datatype from (into) buffer buf. • Each send must be matched by a receive! • You must know source, tag, communicator and size of incoming message • If you don’t know, use MPI_Probe / MPI_Iprobe to get that information Mismatches cause race conditions or deadlocks

  5. ping.c • int main(int argc, char *argv[]) { • int rank; • int buf[100]; • MPI_Status status; • MPI_Init(&argc, &argv); • MPI_Comm_rank(MPI_COMM_WORLD, &rank); • if (rank == 0) • MPI_Send(buf, 100, MPI_INT, 1, 0, MPI_COMM_WORLD); • else • MPI_Recv(buf, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); • MPI_Finalize(); • }

  6. Watch out for deadlocks! • ping.c runs ok with 2 processes Process 0 Process 1 send(1, 0) recv(0, 0) finalize() finalize() • Deadlocks if more than 2 Process 0 Process 1 Process 2 send(1, 0) recv(0, 0) recv(0, 0) finalize() finalize()

  7. ping_correct.c int main(int argc, char *argv[]) { int rank; int buf[100]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) MPI_Send(buf, 100, MPI_INT, 1, 0, MPI_COMM_WORLD); else if (rank == 1) MPI_Recv(buf, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); MPI_Finalize(); }

  8. Blocking communications • MPI_Send • Returns when buffer can be reused • MPI_Ssend • Returns when other end called matching recv • MPI_Recv • Returns when message has been received • MPI_Sendrecv • Sends and receive within the same call to avoid deadlocks • See also: • MPI_Bsend, MPI_Rsend, MPI_Sendrecv_replace

  9. Non-blocking communications • MPI_Isend, MPI_Irecv • Do not wait for message to be buffered / sent / received • Fills an additional MPI_Request parameter that identifies the request • Do not use / modify / delete buffers until request completed • Wait calls block until request(s) completed MPI_Wait(request, status) MPI_Waitall(count, array_of_requests, array_of_statuses) • See also: • MPI_Issend, MPI_Ibsend, MPI_Irsend, MPI_Waitsome, MPI_Waitany, MPI_Test, MPI_Testall, MPI_Testany, MPI_Testsome

  10. iping.c int main(int argc, char *argv[]) { int rank; int buf[100]; MPI_Request request; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) MPI_Isend(buf, 100, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); else if (rank == 1) MPI_Irecv(buf, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); MPI_Finalize(); }

  11. exchange.c if (rank == 0) { MPI_Isend(buf1, 10, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); MPI_Recv(buf2, 10, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } else if (rank == 1) { MPI_Isend(buf1, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); MPI_Recv(buf2, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } MPI_Wait(&request, &status); memcpy(buf1, buf2, 10*sizeof(int)); • Processes 0 and 1 exchange the content of their buffers

  12. exchange_sendrecv.c • Processes 0 and 1 exchange the content of their buffers if (rank == 0) { MPI_Sendrecv(buf1, 10, MPI_INT, 1, 0, buf2, 10, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } else if (rank == 1) { MPI_Sendrecv(buf1, 10, MPI_INT, 0, 0, buf2, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } memcpy(buf1, buf2, 10*sizeof(int));

  13. exchange_sendrecv_replace.c • Processes 0 and 1 exchange the content of their buffers if (rank == 0) { MPI_Sendrecv_replace(buf1, 10, MPI_INT, 1, 0, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } else if (rank == 1) { MPI_Sendrecv_replace(buf1, 10, MPI_INT, 0, 0, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); }

  14. Wildcard receives • MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards switch(rank) { case 0: MPI_Send(..., 1, 0, ...); break; case 1: MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); break; case 2: MPI_Send(..., 1, 1, ...); break; } Process 0 Process 1 Process 2 send(1, 0) recv(*, *) send(1, 1) recv(*, *)

  15. Barrier synchronization MPI_Barrier(communicator) • Returns when all processes in communicator have joined switch(rank) { case 0: MPI_Send(..., dest = 1, tag = 0, ...); MPI_Barrier(MPI_COMM_WORLD); break; case 1: MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); MPI_Barrier(MPI_COMM_WORLD); MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); break; case 2: MPI_Barrier(MPI_COMM_WORLD); MPI_Send(..., src = 1, tag = 1, ...); break; }

  16. Barriers • Order of calls on process 1 is important • recv – barrier – recv: correct • barrier – recv – recv: message race or deadlock (depends on msg size) • recv – recv – barrier: deadlock Process 0 Process 1 Process 2 send(1, 0) recv(*, *) barrier barrier barrier recv(*, *) send(1, 1)

  17. Barriers • barrier – recv – recv • MPI_Send on process 0 may succeed if message can be buffered (MPI_Ssend always fails, MPI_Isend always succeeds) Process 0 Process 1 Process 2 send(1, 0) barrier barrier barrier recv(*, *) send(1, 1) recv(*, *)

  18. Barriers • recv – recv – barrier • Process 1 never enters the barrier since its second recv is never matched Process 0 Process 1 Process 2 send(1, 0) recv(*, 0) recv(2, *) barrier barrier barrier send(1, 1)

  19. Collective communications • MPI_Bcast( buffer, count, datatype, root, comm ) • Sends the same data to every process • MPI_Scatter( sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) • Sends pieces of a buffer to every process of the communicator • “The outcome is as if the root executed n send operationsMPI_Send(sendbuf+i*sendcount*extent(sendtype), sendcount, sendtype, i, ...)and each process executed a receiveMPI_Recv(recvbuf, recvcount, recvtype, i, ...)“

  20. Collective communications • MPI_Gather • Retrieves pieces of data from every process • MPI_Allgather • All pieces retrieved by all processes • MPI_Reduce • Performs a reduction operation (MPI_SUM, …) across all nodes • E.g. dot product on distributed vectors (produit scalaire) • MPI_Allreduce • Result distributed to all processes • MPI_Alltoall • Sends all data to all processes • Every process of the communicator must participate • Parameters must match Mismatches cause race conditions or deadlocks

  21. Receiving image parts in order /* Generate image parts */ ... /* Each process sends */ MPI_Isend(imgPart, partSize, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &request); /* Process 0 receives all parts into buf if (rank == 0) { char *buf = malloc(nProcs*partSize); for (int i=0; i<nProcs; i++) MPI_Recv(buf + i*partSize, partSize, MPI_BYTE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } MPI_Wait(&request, MPI_STATUS_IGNORE);

  22. Receiving image parts out-of-order /* Generate image parts */ ... /* Each process sends */ MPI_Isend(imgPart, partSize, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &request); /* Process 0 receives all parts into buf if (rank == 0) { char *buf = malloc(nProcs*partSize); MPI_Status s; int count; for (int i=0; i<nProcs; i++) { MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &s); MPI_Get_count(&s, MPI_BYTE, &count); MPI_Recv(buf + s.MPI_SOURCE*count, count, MPI_BYTE, s.MPI_SOURCE, s.MPI_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } MPI_Wait(&request, MPI_STATUS_IGNORE);

  23. Receiving image parts with a collective /* Generate image parts */ ... /* Process 0 is the root of the collective, i.e. the receiver of all parts */ int root = 0; char *buf = NULL; if (rank == root) /* Only the root must allocate buf */ buf = malloc(nProcs*partSize) MPI_Gather(part, partSize, MPI_BYTE, buf, partSize, MPI_BYTE, root, MPI_COMM_WORLD);

  24. Collective communications • Avoid putting collective calls within conditional clauses • Ex from slide 11: /* Send from 0 to 1 */ if (rank == 0) MPI_Send(..., 1, 0, ...); else if (rank == 1) MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); MPI_Barrier(MPI_COMM_WORLD); /* Send from 2 to 1 */ if (rank == 1) MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); else if (rank == 2) MPI_Send(..., 1, 1, ...);

  25. Custom communicators and datatypes You can define your own communicators MPI_Comm_dup duplicates a communicator (e.g. to enable private communications within library functions such as DPS) MPI_Comm_split splits a communicator into multiple smaller communicators (useful when using 2D and 3D domain decompositions, DPS game of life is 1D) You can define custom datatypes, e.g. Simple structs (MPI_Type_struct) Vectors (MPI_Type_vector) No pointers

  26. Does this program terminate? (assume 2 processes) int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(&rank, MPI_COMM_WORLD); if (rank) MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else MPI_Recv(&rank, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (rank) MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else MPI_Recv(&rank, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Finalize();

  27. MPI vs. DPS • MPI does not impose communication pattern + Suitable for all problems – Everything is left to the user – Serialization up to the user (but custom datatypes can be created) – Deadlocks may occur • DPS requires acyclic graph – Some (many?) problems not naturally mapped on a graph – Overheads for creating operations, threads, data objects + Graph provides convenient graphical representation, makes it easy to build pipelined applications + Built-in load-balancing, flow control + Automatic serialization of complex datatypes (pointers, STL, …) + Built-in checkpointing + Deadlock-free

  28. MPI vs. DPS • MPI communicators represent groups of processes • Memory space private to each process • One process may belong to multiple communicators • DPS thread collections represent groups of threads • Thread state private to each thread • One thread only belongs to a single thread collection • DPS runs on top of MPI • DPS applications become MPI applications • Submit DPS jobs like any MPI job (integrates nicely with infrastructure) • We could not implement MPI on top of DPS

  29. MPI vs. DPS • MPI allocates processes statically • MPI 2.0 supports for adding and removing processes dynamically • Whole application stops if one process (or one computer) crashes • When using TCP, DPS allocates processes dynamically • Fault-tolerance is built into the framework • For simple applications: just add –ft on command line • More in “Fault-tolerant applications” in “Advanced Tutorial” chapter • Both can run on heterogeneous clusters • Hence datatypes (MPI_INT, MPI_DOUBLE, …) to handle byte ordering • DPS tested on Solaris/Windows/Mac/Linux on x86/Sparc/PowerPC

  30. DPS over MPI dpsMain(int argv, char *argv[], dps::Application *app) { int support; // Initialize MPI with multithreading MPI_Init_thread(&argv, &argv, MPI_THREAD_MULTIPLE, &support); assert(support == MPI_THREAD_MULTIPLE); dps::Controller c(argc, argv, app); if (!app->init()) // Every process calls init() return; // abort if init returned false MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) app->start(); // Only main instance calls start() c.run(); // All instances start waiting for messages MPI_Finalize(); } int main(int argc, char *argv[]) { return dps::dpsMain(new MyApp()); }

  31. Useful links Function list and official MPI documentation: http://www.mpi-forum.org/docs/mpi-11-html/node182.html A quick MPI reference: http://www.mathematik.uni-marburg.de/~berthold/parprog05/MPI-QRef.pdf

More Related