Parallel Systems

Parallel Systems Dr. Guy Tel-Zur

Agenda • Barnes-Hut (final remarks) • Continue slides5 from previous lecture • MPI Virtual Topologies • Scalapack • Mixing programming languages • Impressions from the SC12 conference • Home Assignment #2 נא לחשוב במרץ ולשלוח אלי הצעות למצגת גמר!!!

A simple Jacobi iteration • Analyze program from: http://www.mcs.anl.gov/research/projects/mpi/tutorial/mpiexmpl/src/jacobi/C/main.html

In this example, you will put together some of the previous examples to implement a simple Jacobi iteration for approximating the solution to a linear system of equations. In this example, we solve the Laplace equation in two dimensions with finite differences. This may sound involved, but really amount only to a simple computation, combined with the previous example of a parallel mesh data structure. Any numerical analysis text will show that iterating while (not converged) { for (i,j) xnew[i][j] = (x[i+1][j] + x[i-1][j] + x[i][j+1] + x[i][j-1])/4; for (i,j) x[i][j] = xnew[i][j]; }

will compute an approximation for the solution of Laplace's equation. There is one last detail; this replacement of xnew with the average of the values around it is applied only in the interior; the boundary values are left fixed. In practice, this means that if the mesh is n by n, then the values x[0][j] x[n-1][j] x[i][0] x[i][n-1] are left unchanged. Of course, these refer to the complete mesh; you'll have to figure out what to do with for the decomposed data structures (xlocal). Because the values are replaced by averaging around them, these techniques are called relaxation methods. We wish to compute this approximation in parallel. Write a program to apply this approximation. For convergence testing, compute diffnorm = 0; for (i,j) diffnorm += (xnew[i][j] - x[i][j]) * (xnew[i][j] - x[i][j]); diffnorm = sqrt(diffnorm);

You'll need to use MPI_Allreduce for this. (Why not use MPI_Reduce?) Have process zero write out the value of diffnorm and the iteration count at each iteration. When diffnorm is less that 1.0e-2, consider the iteration converged. Also, if you reach 100 iterations, exit the loop.

For simplicity, consider a 12 x 12 mesh on 4 processors The example solution uses the boundary values from the previous exercise; they are -1 on the top and bottom, and the rank of the process on the side. The initial data (the values of x that are being relaxed) are also the same; the interior points have the same value as the rank of the process. This is shown below: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

#include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) intargc; char **argv; { int rank, value, size, errcnt, toterr, i, j, itcnt; inti_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 );

/* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; }

itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status );

/* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; }

The Makefile # Generated automatically from Makefile.in by configure. ALL: jacobi SHELL = /bin/sh DIRS = jacobi: jacobi.c mpicc -o jacobijacobi.c -lm profile.alog: jacobi.c mpicc -o jacobi.log -mpilogjacobi.c -lm mpirun -np 4 jacobi.log /bin/mv jacobi.log_profile.log profile.alog clean: /bin/rm -f jacobijacobi.o jacobi.log #for dir in $(DIRS) ; do \ # ( cd $$dir ; make clean ) ; done

MPI Virtual Topologies • http://www.pdc.kth.se/education/historical/previous-years-summer-schools/2009/handouts/lect5.pdf

Cartesian Constructor intMPI_Cart_create ( MPI_Commcomm_old, intndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)

Cartesian Convenience - MPI_Dims_create intMPI_Dims_create( intnnodes, intndims, int *dims) MPI_Dims_create creates a division of processors in a cartesian grid.

See more: http://www.rc.usf.edu/tutorials/classes/tutorial/mpi/chapter10.html

#include "mpi.h" #include <stdio.h> #define SIZE 16 #define UP 0 #define DOWN 1 #define LEFT 2 #define RIGHT 3 int main(argc,argv) intargc; char *argv[]; { intnumtasks, rank, source, dest, outbuf, i, tag=1, inbuf[4]={MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,}, nbrs[4], dims[2]={4,4}, periods[2]={0,0}, reorder=0, coords[2]; MPI_Requestreqs[8]; MPI_Status stats[8]; MPI_Commcartcomm; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); An Example

if (numtasks == SIZE) { MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &cartcomm); MPI_Comm_rank(cartcomm, &rank); MPI_Cart_coords(cartcomm, rank, 2, coords); MPI_Cart_shift(cartcomm, 0, 1, &nbrs[UP], &nbrs[DOWN]); MPI_Cart_shift(cartcomm, 1, 1, &nbrs[LEFT], &nbrs[RIGHT]); printf("rank= %d coords= %d %d neighbors(u,d,l,r)= %d %d %d %d\n", rank,coords[0],coords[1],nbrs[UP],nbrs[DOWN],nbrs[LEFT], nbrs[RIGHT]); outbuf = rank; for (i=0; i<4; i++) { dest = nbrs[i]; source = nbrs[i]; MPI_Isend(&outbuf, 1, MPI_INT, dest, tag, MPI_COMM_WORLD, &reqs[i]); MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag, MPI_COMM_WORLD, &reqs[i+4]); } MPI_Waitall(8, reqs, stats); printf("rank= %d inbuf(u,d,l,r)= %d %d %d %d\n", rank,inbuf[UP],inbuf[DOWN],inbuf[LEFT],inbuf[RIGHT]); } else printf("Must specify %d processors. Terminating.\n",SIZE); MPI_Finalize(); }

Jacobi Iteration using MPI virtual topologies • Reference: Intermediate MPI from “Using MPI” book • http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/intermediate/main.htm • Similar: an MPI parallelization of the solution of the 2D Poisson's equation • http://www.ee.bgu.ac.il/~tel-zur/2003B/lecture10.pdf

Scalapack

http://www.netlib.org/scalapack/ Hands-On Exercises for ScaLAPACK http://acts.nersc.gov/scalapack/hands-on/ http://www.netlib.org/scalapack/pblas_qref.html PBLAS Quick Reference Card

C interface to BLACS http://www.netlib.org/blacs/cblacsqref.ps See next slide

BLACS

http://netlib.org/scalapack/slug/node184.html

Mixing C and FORTRAN • Demo on my other laptop • ~/tests/cprog.c and ffunction.f • http://www.cae.tntech.edu/help/programming/mixed_languages • Important! Column-major order vs. Row-major order: C => Row, Fortran=>Column • Ref: http://en.wikipedia.org/wiki/Row-major_order

Example: PDGEMV() The sample is using PDGEMV(), which computes a distributed matrix-vector product y =alphaAx+beta*y. A= [1 4 7 10 13 ; 3 6 9 12 15 ; 5 8 11 14 17 ; 7 10 13 16 19; 9 12 15 18 21] and x=[1 ; 1; 0; 0; 1]T, x is a column vector, Call PDGEMV() routine to compute y=Ax. The right result is y=[18; 24; 30; 36; 42]T T=transpose Code sample in C:pdgemv.c The function call in C is like, double alpha = 1.0; double beta = 0.0;pdgemv_("N",&M,&M,&alpha,A,&ONE,&ONE,descA,x,&ONE,&ONE,descx,&ONE,&beta,y,&ONE,& amp;ONE,descy,&ONE);

PDGEMV() Check pdgemv_c. here: http://www.netlib.org/scalapack/explore-html/da/dca/pdgemv___8c_source.html http://www.cs.utexas.edu/~plapack/Code/ScaLAPACK/PBLAS/pdgemv_.c

Purpose ======= PDGEMV performs one of the matrix-vector operations sub( Y ) := alpha*sub( A ) *sub( X ) + beta*sub( Y ), or sub( Y ) := alpha*sub( A )'*sub( X ) + beta*sub( Y ), where sub( A ) denotes A(IA:IA+M-1,JA:JA+N-1). When TRANS = 'N', sub( X ) denotes X(IX:IX,JX:JX+N-1), if INCX = M_X, X(IX:IX+N-1,JX:JX), if INCX = 1 and INCX <> M_X, and, sub( Y ) denotes Y(IY:IY,JY:JY+M-1), if INCY = M_Y, Y(IY:IY+M-1,JY:JY), if INCY = 1 and INCY <> M_Y, and, otherwise sub( X ) denotes X(IX:IX,JX:JX+M-1), if INCX = M_X, X(IX:IX+M-1,JX:JX), if INCX = 1 and INCX <> M_X, and, sub( Y ) denotes Y(IY:IY,JY:JY+N-1), if INCY = M_Y, Y(IY:IY+N-1,JY:JY), if INCY = 1 and INCY <> M_Y. Alpha and beta are scalars, and sub( X ) and sub( Y ) are subvectors and sub( A ) is an m by n submatrix.

Parallel Systems