1 / 30

MPI - A User level Message-passing Interface

MPI - A User level Message-passing Interface. K. Ramakrishna Shenai. History of MPI . Principal branches of MPI MPI 1.2 Version MPI 2.0 Version An MPI Process in normally written in C (C++) or fortran and linked with MPI Libraries. MPI Basic functionality. MPI Program consists of

jorn
Télécharger la présentation

MPI - A User level Message-passing Interface

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPI - A User level Message-passing Interface K. Ramakrishna Shenai

  2. History of MPI • Principal branches of MPI • MPI 1.2 Version • MPI 2.0 Version • An MPI Process in normally written in C (C++) or fortran and linked with MPI Libraries.

  3. MPI Basic functionality • MPI Program consists of • A set of processes occupying a separate and unshared address space • Communication of data by invoking MPI procedures • MPI 2, has the added ability of adding and destroying processes.

  4. MPI Basic functionality (contd.) • MPI Processes can be • Can have the same executable file running in different address spaces • Individual Processes are instances of different executables from different source files, linked together • Pass different command line arguments to different processes • Rank in a MPI_COMM_WORLD communicator • Each process has a unique rank in the range of [0 .. p-1]

  5. /* Example Program: Hello world */ # include <mpi.h> # include <stdio.h> # include <string.h> int main(int argc, char *argv[]){ int rank, size, partner; int namelen; char name[MPI_MAX_PROCESSOR_NAME]; char greeting[sizeof(name) + 100]; MPI_Init(&argc, 7argv); /* Initialize MPI */ MPI_Comm_size(MPI_COMM_WORLD, &size); /* Which one am I */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* Where am I running*/ MPI_Get_processor(name, &namelen); sprintf(greeting, “hello world %d of %d running on %s \n”, rank, size, name); if(rank==0) { fputs(greeting, stdout); for(partner=1; partner < size; partner++) { MPI_Status stat; MPI_Recv(greeting, sizeof(greeting), MPI_BYTE, partner, 1, MPI_COMM_WORLD, &stat); fputs(greeting, stdout); } }else { MPI_Send(greeting, strlen(greeting)+1, MPI_BYTE, 0, 1, MPI_COMM_WORLD); } MPI_Finalize(); exit(0); }

  6. Basic Concepts • MPI_Init and MPI_finalize • In C, MPI_init should be passed the address of the argc and argv. This procedure must be called before any other MPI Procedure • MPI_Finalize: This is a terminating procedure after which any other MPI procedure must not be implemented

  7. Basic Concepts • Communicators • Requirement  the caller specify a communicator argument, which define the context in which communication takes place • In C , Communicators have the type MPI_Comm • Every process within a communicator has size MPI_Comm_size and the process rank with MPI_Comm_rank. • MPI_Get_processor_name • A procedure that is used to determine what physical processor the current process is running. Usually returns the system’s Internet Hostname

  8. Basic Concepts:MPI_Send and MPI_Receive • They are point to point communication primitives • Transmission of messages directly from sender to receiver’s memory. • c prototypes • int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • int MPI_Receive(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • Mainly Blocking Mode of Communication, However in certain cases a non-blocking mode of Communication is possible. • Other communication modes (buffered, synchronous and ready exist)

  9. Basic Concepts • MPI_StatusStructure  can be examined to determine the source and tag of any message • Loosely Synchronous Operations  Basically means that all processes engaged in communication operation must call the corresponding MPI procedure at the same logical point in execution, with respect to all other MPI communication calls.

  10. Basic Concepts Balancing Sends and Receives /* InCorrect MAY Deadlock depending upon buffering */ MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat); /* Correct Send must match receive exactly */ If(rank > partner) { MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat);} else if(rank < partner){ MPI_Recv(in, count, type, partner, tag, comm, &stat); MPI_Send(out, count, type, partner, tag, comm);} else { /* rank==partner */ MPI_Type_size(type, &sz); memcpy(in, out, count, * sz); /*content of stat not set */ /* Also Correct use only MPI_Sendrecv Procedure */ Mpi_Sendrecv(out, count, type , partner, tag, in, count, type , partner, tag, comm, &stat);

  11. Basic Concepts • MPI Data types: Elementary MPI data types in C • MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNISIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE • Use of Standard Libraries • MPI does not alter any Standard Libraries. • Calls to external libraries (especially I/O libraries) should be made from a distinguished process, the rank=0 process. • LAM and MPICH are both quite flexible, when it comes to handling such libraries. • Errors in MPI

  12. Compiling using MPI on the Beowulf • MPICH is installed in a public location like /usr/local/mpich, also called <MPI_ROOT> • All user oriented utilities are stored in <MPI_ROOT>/bin, header files in <MPI_ROOT>/include and libraries in <MPI_ROOT>/lib/LINUX/ch_p4/ • <MPI_ROOT>/bin contain mpicc, the script files for compiling C programs with certain options. • Ways of Compiling a C program to give the executable • MPIROOT=/usr/local/mpi • mpicc hello.c –o hello • cc –o hello –I$MPIROOT/include hello.c – • L$MPIROOT/LINUX/ch_p4 -lmpich

  13. Running an MPI Program • mpirun utility • Options • -np <number_of_processes>: starts the program with the specifies number of processors • -machinefile <filename>; explicitly specify the list of processors. • -nolocal force the first process to execute on the first processor specified in the processor_list • mpirun will start process 0 on the processor that called mpirun.

  14. Parallel Data Structures with MPI • MPI provides a simple memory model over which Parallel data structures can be created. • Every symbol is actually a parallel object that exists on every process. The symbol can take different values for different processes. • Example Implementation: A Parallel Array • parray structure • Typedef struct { • int nelem_local; /*the number in local memory */ • void *data; • MPI_Comm comm; • int first_index_local; /*global index of the first • element of the local data (whose local index is 0) */ • int nelem_global; /* aggregate number of elements */ • size_t elem_size; • int commrank, commsize; • }parray;

  15. /* procedure to create a self consistent parallel array from information provided by each process about its own local data */ Void pacreate(parray *pa, void *data, int nelem, size_t size, MPI_Comm comm) { int last_index; pa->nelem_local = nelem; pa->data = data; pa->comm = comm; pa->elem_size = size; /* My first element is the sum of the number of elements in lower ranked procs */ MPI_Scan(&pa->nelem_local, &last_index, 1, MPI_INT, MPI_SUM, pa->comm); pa->first_index_local = last_index – pa->nelem_local; MPI_Comm_size(pa->comm, &pa->commsize); MPI_Comm_rank(pa->comm, &pa->commrank); /* The global element count is the last index in the highest rank processor (commsize – 1) use bcast to distribute it */ Mpi_Bcast(&last_index, 1, MPI_INT, pa->commsize-1, pa->comm); Pa->nelem_global = last_index; }

  16. Example: A Parallel Array • The parray structure automatically exists on every process, when defined. • Collective Communication • MPI_Broadcast  sends the content of the buffer on one process • MPI_Scan performs operation (like ‘+’) on values supplied by each processor, like the result returned on process I is the result of the operation applied to values supplied by every process of rank 0 to i. • MPI predefined Reductions/scan operations  MPI_MAX, MPI_SUM, MPI_PROD, MPI_MAXLOC (returns extremum, like an array index or rank) etc.

  17. Example: A Parallel Array (contd.) • Use of MPI_Allreduce  computes the result of an operation applied to a value supplied by every process. The result is returned to every process. Can be equivalent to MPI_Scan • MPI_Gather, MPI_Scatter are combination of two or more operations like MPI_Allreduce  MPI_Reduce + MPI_Broadcast

  18. /* procedure to compute the mean of the elements in a parallel array */ double parraymean(struct parray *pa) { double sum, sumall; int i; /* Compute the local sum of elements */ sum = 0; for(i=0; i<pa->nelem_mine; i++) { sum = sum + pa->data[i]; } /* Use all reduce to get the global sum */ MPI_Allreduce(&sum, &sumall, 1, MPI_DOUBLE, MPI_SUM, pa->comm); return sumall/pa->nelem_global; }

  19. A One dimensional Cellular Automaton • One Dimensional Cellular Automata (CA), are simple dynamic systems, where the placement of data is meaningful, requiring point to point communication. • A CA with half-width hw --> It is an array of values with update rule that states that the next value in location i depends only on the previous value in locations (i-hw,.......,i...........,i_hw). • The values of a Cellular Automaton can be integers, or bits etc. • The update rule can be an arbitrary function of the 2hw+1 input values. Special cases include linear functions and functions like “parity” which count the number of values in an input domain. • Structure definition of the the one dimension cellular automaton typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ /* other fields that control updateRule */ }CA_t;

  20. /* updating a cellular automaton */ void Caiterate(CA_t *ca, int ntime){ int i; for(i=0;i<ntimes; i++) { Cacopystate(ca); Caupdate(ca); } } static void Caupdate(CA_t *ca) { int n = ca->ncells; unsigned char *oldcenter = ca->old; unsigned char *new = ca->state; while(n-->0) { *new++ = updateRule(ca, oldcenter++); } } static void { Cacopystate(CA_t *ca){ memcpy(ca->old, ca->state, ca->ncells); /* Now for periodic boundary conditions */ memcpy(ca->old-ca->hw, ca->state+(ca->ncells - ca->hw), ca->hw); memcpy(ca->old+ca->ncells, ca->state, ca->hw); }

  21. A One dimensional Cellular Automaton (contd) • The twooperations in the CAiterate • The CAcopystate involves copying the contents of the state array to the old array so that we can write new values into the state. The second phase involves computing the new state depending on the values of the old array. • The periodic boundary conditions are imposed by padding the old array on both the ends by hw values, copied from the opposite end of the state. A B C D E F G H The arrow indicates a call to memcpy F G H A B C D E F G H A B C

  22. Parallelizing the problem ! • General guideline : Keep a maximum amount of the sequential code intact. The order of the elements and the relationship between elements stored in different processors is important. • Boundary data should be exchanged between neighbours • Modification • Adding a MPI_COMM Element to the CA Structure • calls to memcpy has been replaced to MPI_Sendrecv in CA Copystate typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ MPI_Comm comm; /* other fields that control updateRule */ }CA_t;

  23. Parallelizing the problem ! static void { Cacopystate(CA_t *ca){ int up_neighbor, down_neighbor, myrank, nproc;_ memcpy(ca->old, ca->state, ca->ncells); MPI_Comm_rank (ca->comm, &myrank); MPI_Comm_size (ca->comm, &nproc); up_neighbor = (myrank + 1)% nproc; down_neighbor = (myrank+nproc-1)%nproc; MPI_Sendrecv(ca->state + (ca->ncells - ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_UPTAG, ca->old, ca->hw, MPI_BYTE, down_neighbor, CA_UPTAG, ca->comm, &stat); MPI_Sendrecv(ca->state, ca->hw, MPI_BYTE, down_neighbor, CA_DOWNTAG, ca->old+(ca->ncells+ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_DOWNTAG, ca->comm, &stat);

  24. A B C D E F G H I J K L MN O P X Y Z A B C D E F G H I J K F G H I J K L MN O P Q R S In the Parallel version, when the state of CA is copied to old, it is padded at both ends by data from neighboring processors. Two processors are shown, The diagonal arrow represent data transfers by MPI_Sendrecv, and the dotted arrow represents a memcpy

  25. CONSIDERATIONS • Domain Decomposition and boundaries: Make a parallel version of a problem as a collection of near-copies of a sequential implementation with special boundary conditions derived by communication with neighbors. • Starting with a good sequential Implementation --> • Fundamental Algorithms are basically designed sequentially by default, hence changes required for parallelization should be as small as possible. • For the CA , it is the matter of obtaining boundary conditions from neighboring processors. • Avoid changing fundamental sequential parts of a code as far as possible.

  26. CONSIDERATIONS • Communication should be treated as a separate phase from local computation. It has a major effect on performance • Performance Considerations --> • Communication is Slow . Communication latency will dominate the time consideration. • Despite our local computation being fast, eventual performance will reduce due to time spent in communication,

  27. CONSIDERATIONS • Time taken to complete an iteration in N steps with P processors • tstep = 2 tlatency + (N/P)tupdate • = (N tupdate/P) + ( 1 + (2 tlatency/(N/P)tupdate) ) • first Term is just the time taken on one processor, divided by P, a perfect speedup result. • Last term represents how worse than perfect is the actual implementation --> it is the ration of the time spent on one processor to communication, 2 tlatency to the time spent in CAupdate, (N/P)tupdate

  28. MPI Advanced Features • Blocking and Non-blocking Calls and Alternative Sending Modes • Blocking Calls always wait for the requested action to complete before returning control to the caller. • Non blocking calls are Initiate background process, only used to avoid unnecessary copying between system buffers and user memory. • Standard Mode sends • Buffered Mode sends • Synchronous mode sends

  29. MPI Advanced Features • Derived data types • MPI includes a powerful set of procedures that can be used to refer no-contiguous data; example--> refer data of heterogeneous type, and integer followed by a set of floating point values. • Latency is a major problem on Beowulf systems, specially in transfer of data, an alternative is to pack multiple logical messages into a single physical message and to incur the startup cost only once. • Intercommunicators • Basically the MPI_COMM_WORLD, the group of all active processes and communicators are derived by selecting subsets from this communicators, the resulting logical structures are insufficient. • It is desirable to construct a new communicator, from the union of the two existing communicators and these are called INTERCOMMUNICATORS. Used in application relevant to functional parallelism.

  30. MPI 2 • Process creation and management = Processes can be dynamically created and destroyed. • One sided communication = Allowing either the sender or the receiver set the parameters for communication. Reduces complexity and can exploit fast communication mechanisms. • Extended collective operations = New procedures for creating collective operations and Intercommunicators. • External interfaces = Users and library designers can access internals of MPI’s objects, allows for new functionality to be layered atop MPI’s internal data structures. • New Language bindings for C++ and fortran. • I/0 MPI-I0 describes the application side of an abstract model of a parallel I/O, how to define requests to a parallel abstract I/O device from a distributed memory parallel program.

More Related