MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University

MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University

SCV Computing Facilities • SGI Origin2000 – 192 processors • Lego(32), Slinky (32), Jacks(64), Playdoh(64) • R10000, 195 MHz • IBM SP - four 16-processor nodes • Hal - login node • Power3, 375 MHz • IBM Regatta - three 32-processor nodes • Twister - login node (currently in friendly-user phase) • Power4, 1.3 GHz • Linux cluster - 38 2-processor nodes • Skate and Cootie - Two login nodes • 1.3 GHz Intel Pentium 3 (Myrinet Interconnect)

Useful SCV Info • SCV home page http://scv.bu.edu/ • Batch queues http://scv.bu.edu/SCV/scf-techsumm.html • Resource Applications • MARINER http://acct.bu.edu/FORMS/Mariner-Pappl.html • Alliance http://www.ncsa.uiuc.edu/alliance/applying/ • Help • Online FAQs • Web-based tutorials (MPI, OpenMP, F90, MATLAB, Graphics tools) • HPC consultations by appointment • Doug Sondak (sondak@bu.edu) • Kadin Tseng (kadin@bu.edu) • help@tonka.bu.edu, bugs@tonka.bu.edu • Origin2000 Repository http://scv.bu.edu/SCV/Origin2000/ • IBM SP and Regatta Repository http://scv.bu.edu/SCV/IBMSP/ • Alliance web-based tutorials http://foxtrot.ncsa.uiuc.edu:8900/

Multiprocessor Architectures Linux cluster • Shared-memory Between the 2 processors of each node. • Distributed-memory • There are 38 nodes; can use both processors in each node.

Shared Memory Architecture P0 P1 P2 Pn • • • • Memory P0, P1, …, Pn are processors.

Distributed Memory Architecture N0 N1 N2 • • • • Nn M0 M1 M2 Mn Interconnect M0, M1, … Mn are memories associated with nodes N0, N1, …, Nn. Interconnect is Myrinet (or Ethernet).

Parallel Computing Paradigms • Parallel Computing Paradigms • Message Passing (MPI, PVM, ...) • Distributed and shared memory • Directives (OpenMP, ...) • Shared memory • Multi-Level Parallel programming (MPI + OpenMP) • Distributed/shared memory

MPI Topics to Cover • Fundamentals • Basic MPI functions • Nonblocking send/receive • Collective Communications • Virtual Topologies • Virtual Topology Example • Solution of the Laplace Equation

What is MPI ? • MPI stands for Message Passing Interface. • It is a library of subroutines/functions, not a computer language. • These subroutines/functions are callable from fortran or C programs. • Programmer writes fortran/C code, insert appropriate MPI subroutine/function calls, compile and finally link with MPI message passing library. • In general, MPI codes run on shared-memory multi-processors, distributed-memory multi-computers, cluster of workstations, or heterogeneous cluster of the above. • Current MPI is MPI-1. MPI-2 will be available in the near future.

Why MPI ? • To enable more analyses in a prescribed amount of time. • To reduce time required for one analysis. • To increase fidelity of physical modeling. • To have access to more memory. • To provide efficient communication between nodes of networks of workstations, … • To enhance code portability; works for both shared- and distributed-memory. • For “embarrassingly parallel” problems, such as some Monte-Carlo applications, parallelizing with MPI can be trivial with near-linear (or even superlinear) scaling.

MPI Preliminaries • MPI’s pre-defined constants, function prototypes, etc., are included in a header file. This file must be included in your code wherever MPI function calls appear (in “main” and in user subroutines/functions) : • #include “mpi.h” for C codes • #include “mpi++.h” for C++ codes • include “mpif.h” for fortran 77 and fortran 9x codes • MPI_Init must be the first MPI function called. • Terminates MPI by calling MPI_Finalize. • These two functions must only be called once in user code.

MPI Preliminaries (continued) • C is case-sensitive language. MPI function names always begin with “MPI_”, followed by specific name with leading character capitalized, e.g., MPI_Comm_rank. MPI pre-defined constant variables are expressed in upper case characters, e.g.,MPI_COMM_WORLD. • Fortran is not case-sensitive. No specific case rules apply. • MPI fortran routines return error status as last argument of subroutine call, e.g., • call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) • Error status is returned as int function value for C MPI functions, e.g., • int ierr = MPI_Comm_rank(MPI_COMM_WORLD, rank);

What is A Message • Collection of data (array) of MPI data types • Basic data types such as int /integer, float/real • Derived data types • Message “envelope” – source, destination, tag, communicator

Modes of Communication • Point-to-point communication • Blocking – returns from call when task complete • Several send modes; one receive mode • Nonblocking – returns from call without waiting for task to complete • Several send modes; one receive mode • Collective communication

Essentials of Communication • Sender must specify valid destination. • Sender and receiver data type, tag, communicator must match. • Receiver can receive from non-specific (but valid) source. • Receiver returns extra (status) parameter to report info regarding message received. • Sender specifies size of sendbuf; receiver specifies upper bound of recvbuf.

MPI Data Types vs C Data Types • MPI types -- C types • MPI_INT – signed int • MPI_UNSIGNED – unsigned int • MPI_FLOAT – float • MPI_DOUBLE – double • MPI_CHAR – char • . . .

MPI vs Fortran Data Types • MPI_INTEGER – INTEGER • MPI_REAL – REAL • MPI_DOUBLE_PRECISION – DOUBLE PRECISION • MPI_CHARACTER – CHARACTER(1) • MPI_COMPLEX – COMPLEX • MPI_LOGICAL – LOGICAL • …

MPI Data Types • MPI_PACKED • MPI_BYTE

Some MPI Implementations There are a number of implementations : • MPICH (ANL) • Version 1.2.1 is latest. • There is a list of supported MPI-2 functionalities. • LAM (UND/OSC) • CHIMP (EPCC) • Vendor implementations (SGI, IBM, …) • Don’t worry! - as long as vendor supports the MPI Standard (IBM’s MPL does not). • Job execution procedures of implementations may differ.

Example 1 (Integration) We will introduce some fundamental MPI function calls through the computation of a simple integral by the Mid-point rule. p is number of partitions and n is increments per partition

Integrate cos(x) by Mid-point Rule

Example 1 - Serial fortran code Program Example1 implicit none integer n, p, i, j real h, result, a, b, integral, pi pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration p = 4 ! number of partitions (processes) n = 500 ! # of increments in each part. h = (b-a)/p/n ! length of increment result = 0.0 ! Initialize solution to the integral do i=0,p-1 ! Integral sum over all partitions result = result + integral(a,i,h,n) enddo print *,'The result =',result stop end

. . Serial fortran code (cont’d) example1.fcontinues . . . real function integral(a, i, h, n) implicit none integer n, i, j real h, h2, aij, a, fct, x fct(x) = cos(x) ! Integral kernel integral = 0.0 ! initialize integral h2 = h/2. do j=0,n-1 ! sum over all "j" integrals aij = a + (i*n + j)*h ! lower limit of integration integral = integral + fct(aij+h2)*h enddo return end

Example 1 - Serial C code #include <math.h> #include <stdio.h> float integral(float a, int i, float h, int n); void main() { int n, p, i, j, ierr; float h, result, a, b, pi, my_result; pi = acos(-1.0); /* = 3.14159... * a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ p = 4; /* # of partitions */ n = 500; /* increments in each process */ h = (b-a)/n/p; /* length of increment */ result = 0.0; for (i=0; i<p; i++) { /* integral sum over procs */ result += integral(a,i,h,n); } printf("The result =%f\n",result); }

. . Serial C code (cont’d) example1.c continues . . . float integral(float a, int i, float h, int n) { int j; float h2, aij, integ; integ = 0.0; /* initialize integral */ h2 = h/2.; for (j=0; j<n; j++) { /* integral in each partition */ aij = a + (i*n + j)*h; /* lower limit of integration */ integ += cos(aij+h2)*h; } return integ; }

Example 1 - Parallel fortran code • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Send, MPI_Recv PROGRAM Example1_1 implicit none integer n, p, i, j, ierr, master, Iam real h, result, a, b, integral, pi include "mpif.h“ ! pre-defined MPI constants, ... integer source, dest, tag, status(MPI_STATUS_SIZE) real my_result data master/0/ ! 0 is the master processor responsible ! for collecting integral sums …

. . . Parallel fortran code (cont’d) ! Starts MPI processes ... call MPI_Init(ierr) ! Get current process id call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) ! Get number of processes from command line call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) ! executable statements before MPI_Init is not ! advisable; side effect implementation-dependent pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! increments in each process h = (b - a)/ p / n dest = master ! proc to compute final result tag = 123 ! set tag for job

... Parallel fortran code (cont’d) my_result = integral(a,Iam,h,n) ! compute local sum write(*,"('Process ',i2,' has the partial result of’, & f10.6)”)Iam,my_result if(Iam .eq. master) then result = my_result ! Init. final result do source=1,p-1 ! loop to collect local sum (not parallel !) call MPI_Recv(my_result, 1, MPI_REAL, source, tag, & MPI_COMM_WORLD, status, ierr) ! not safe result = result + my_result enddo print *,'The result =',result else call MPI_Send(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, ierr) ! send my_result to dest. endif call MPI_Finalize(ierr) ! let MPI finish up ... end

Example 1 - Parallel C code #include <mpi.h> #include <math.h> #include <stdio.h> float integral(float a, int i, float h, int n); /* prototype */ void main(int argc, char *argv[]) { int n, p, i; float h, result, a, b, pi, my_result; int myid, source, dest, tag; MPI_Status status; MPI_Init(&argc, &argv); /* start MPI processes */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* current proc. id */ MPI_Comm_size(MPI_COMM_WORLD, &p); /* # of processes */

… Parallel C code (continued) pi = acos(-1.0); /* = 3.14159... */ a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ n = 500; /* number of increment within each process */ dest = 0; /* define the process that computes the final result */ tag = 123; /* set the tag to identify this particular job */ h = (b-a)/n/p; /* length of increment */ i = myid; /* MPI process number range is [0,p-1] */ my_result = integral(a,i,h,n); printf("Process %d has the partial result of %f\n", myid,my_result);

… Parallel C code (continued) if(myid == 0) { result = my_result; for (i=1;i<p;i++) { source = i; /* MPI process number range is [0,p-1] */ MPI_Recv(&my_result, 1, MPI_REAL, source, tag, MPI_COMM_WORLD, &status); result += my_result; } printf("The result =%f\n",result); } else { MPI_Send(&my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD);/* send my_result to “dest” } MPI_Finalize(); /* let MPI finish up ... */ }

Example1_2 – Parallel Integration • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Recv, MPI_Isend, MPI_Wait PROGRAM Example1_2 implicit none integer n, p, i, j, k, ierr, master real h, a, b, integral, pi integer req(1) include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result, Total_result, result data master/0/

Example1_2 (continued) c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! number of increment within each process dest = master ! define process that computes the final result tag = 123 ! set the tag to identify this particular job h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) ! Integral of process Iam write(*,*)'Iam=',Iam,', my_result=',my_result

Example1_2 (continued) if(Iam .eq. master) then ! the following is serial result = my_result do k=1,p-1 call MPI_Recv(my_result, 1, MPI_REAL, & MPI_ANY_SOURCE, tag, ! “wildcard” & MPI_COMM_WORLD, status, ierr) ! status - about source … result = result + my_result ! Total = sum of integrals enddo else call MPI_Isend(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, req, ierr) ! send my_result to “dest” call MPI_Wait(req, status, ierr) ! wait for nonblock send ... endif c**results from all procs have been collected and summed ... if(Iam .eq. 0) write(*,*)'Final Result =',result call MPI_Finalize(ierr) ! let MPI finish up ... stop end

Example1_3 Parallel Integration • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Bcast, MPI_Reduce, MPI_SUM PROGRAM Example1_3 implicit none integer n, p, i, j, ierr, master real h, result, a, b, integral, pi include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result data master/0/

Example1_3 (continued) c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration dest = 0 ! define the process that computes the final result tag = 123 ! set the tag to identify this particular job if(Iam .eq. master) then print *,'The requested number of processors =',p print *,'enter number of increments within each process' read(*,*)n endif c**Broadcast "n" to all processes call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)

Example1_3 (continued) c**Starts MPI processes ... h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) write(*,"('Process ',i2,' has the partial result of',f10.6)") & Iam,my_result call MPI_Reduce(my_result, result, 1, MPI_REAL, MPI_SUM, dest, & MPI_COMM_WORLD, ierr) ! Compute integral sum if(Iam .eq. master) then print *,'The result =',result endif call MPI_Finalize(ierr) ! let MPI finish up ... stop end

Compilations With MPICH • Create Makefile is practical way to compile program CC = mpicc CCC = mpiCC F77 = mpif77 F90 = mpif90 OPTFLAGS = -O3 f-example: f-example.o $(F77) $(OPTFLAGS) -o f-example f-example.o c-example: c-example.o $(CC) $(OPTFLAGS) -o c-example c-example.o

Compilations With MPICH (cont’d) Makefile continues . . . .c.o: $(CC) $(OPTFLAGS) -c $*.c .f.o: $(F77) $(OPTFLAGS) -c $*.f .f90.o: $(F90) $(OPTFLAGS) -c $*.f90 .SUFFIXES: .f90

Run Jobs Interactively On SCV’s Linux cluster, use mpirun.ch_gm to run MPI jobs. skate% mpirun.ch_gm –np 4 example (alias mpirun ‘/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.7-10enterprise/smp/pgi/ssh/bin/mpirun.ch_gm)

Run Jobs in Batch We use PBS, or Portable Batch System, to manage batch jobs. % qsub my_pbs_script Where qsub is a batch submission script while my_pbs_script is a user-provided script containing user-specific pbs batch job info.

My_pbs_script #!/bin/bash # Set the default queue #PBS -q dque # Request 4 nodes with 1 processor per node #PBS -l nodes=4:ppn=1,walltime=00:10:00 MYPROG="psor" GMCONF=~/.gmpi/conf.$PBS_JOBID /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF cd $PBS_O_WORKDIR NP=$(head -1 $GMCONF) mpirun.ch_gm --gm-f $GMCONF --gm-recv polling --gm-use-shmem --gm-kill 5 –-gm-np $NP PBS_JOBID=$PBS_JOBID $MYPROG

Output of Example1_1 skate% make –f make.linux example1_1 skate% mpirun.ch_gm –np 4 example1_1 Process 1 has the partial result of 0.324423 Process 2 has the partial result of 0.216773 Process 0 has the partial result of 0.382683 Process 3 has the partial result of 0.076120 The result = 1.000000 Not in order !

MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University