The Need for Speed: Parallelization in GEM

The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

The reason behind parallel programming Single processor limitations • Processor clock speed is limited • Physical size of processor limits speed because signal speed cannot exceed speed of light • Single processor speed is limited by integrated circuits feature size (propagation delays and thermal problems) • Memory (size and speed - especially latency) • The amount of logic on a processor chip is limited by real estate considerations (die size / transistor size) • Algorithm limitations

Parallel computing: a solution • Increase parallelism within processor (multi operand functional units like vector units) • Increase parallelism on chip (multiple processors on chip) • Multi processor computers • Multi computer systems using a communication network (latency and bandwidth considerations)

Parallel computing paradigms • hardware taxonomy: SISD (Single Instruction Single Data) SIMD (Single Instruction Multiple Data) MISD (Multiple Instruction Single Data) MIMD (Multiple Instruction Multiple Data) • programmer taxonomy: SPMD : Single Program Multiple Data MPMD : Multiple Program Multiple Data • memory taxonomy: • SMP Shared Memory Parallelism • One processor can “ see ” another's memory • Cray X-MP, single node NEC SX-3/4/5/6 • DMP Distributed Memory Parallelism • Processors exchange “ messages ” • Cray T3D, IBM SP, ES-40, ASCI machines

SMP architectures NODE NODE Cpu Cpu Mem Mem Cpu Cpu Mem Mem Cpu Cpu Mem Mem Cpu Cpu Mem Bus topology Network / crossbar

SMPOpenMP (microtasking / autotasking) OpenMP works at the loop level (small granularity often at the loop level), multiple CPUs execute the same code in a shared memory space OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization. OpenMP uses the fork-join model of parallel execution:

OpenMPBasic features (FORTRAN “comments”) PROGRAM VEC_ADD_SECTIONS INTEGER ni, I, n PARAMETER (ni=1000) REAL A(ni), B(ni), C(ni) ! Some initializations n=4 DO I = 1, ni A(I) = I * 1.0 B(I) = A(I) ENDDO ! At the Fortran level: call omp_set_num_threads(n) !$OMP PARALLEL SHARED(A,B,C), PRIVATE(I) !$omp do DO I = 1, ni C(I) = A(I) + B(I) ENDDO !$omp enddo !$OMP END PARALLEL END 2 ways to initiate threads: At the shell level: n=4 export OMP_NUM_THREADS=n Parallel region

OpenMP !$omp parallel !$omp do do n=1,omp_get_max_threads() call itf_phy_slb ( n , F_stepno,obusval, cobusval, $ pvptr, cvptrp,cvptrm, ndim, chmt_ntr, $ trp,trm, tdu,tdv,tdt,kmm,ktm, $ LDIST_DIM, l_nk) enddo !$omp enddo !$omp end parallel !$omp critical jdo = jdo + 1 !$omp end critical !$omp single call vexp (expf_8,xmass_8,nij) !$omp end single

SMP: General remarks • Shared memory parallelism at the loop level can often be implemented after the fact if what is desired is a moderate level of parallelism • It can be also done to a lesser extent at the thread level in some cases but reentrancy, data scope (thread local vs global) and race conditions can be a problem. • Does NOT scale all that well • Limited to the real estate of a node

NODE Cpu Mem . . . . . . Node Node Node Cpu Mem Cpu High speed interconnect (network / crossbar) Mem Cpu . . . . . . Node Node Node DMP architecture

Lnj N 1 W E Lnj S 1 Lni Lni 1 1 2D domain decomposition:regular horizontal block partitioning PE topology: npex=2, npey=2 Lni+1 Gni 1 Lni Gnj global indexing PE matrix Pe (0,1) Pe (1,1) Rank PE #2 PE #3 Lnj+1 Lnj Pe (1,0) Pe (0,0) local indexing PE #0 PE #1 1

High level operations • Halo exchange • What is a halo ? • Why and when is it necessary to exchange a halo ? • Data transpose • What is a data transpose ? • Why and when is it necessary to transpose data ? • Collective and Reduction operations

N W E S 2D array layout with halos Maxj Inner halo Lnj Private data 1 Outer halo Minj Lni 1 Mini Maxi

Need to access neighboring data in order to perform local computation • In general any stencil type discrete operator dfdx(i) = (f(i+1) - f(i-1)) / (x(i+1)-x(i-1)) • Halo width depends on the operator Halo exchange: Why and when? Lni 1

Halo exchange 051 PE topology: npex=3, npey=3 North North East North West Local pe West East South East South West South How many neighbor PEs must local PE exchange data with to get data from the shaded area (outer halo)?

npey npex T1 Z Z Z T2 Y Y Y npey X X X npex Data Transposition 051 PE topology: npex=4, npey=4 npey npex

Tw = cost / word Time Message length What is MPI ? • A Message Passing Interface • Communications through messages can be • Cooperative send / receive (democratic) • One sided get / put (autocratic) • Bindings defined for FORTRAN, C, C++ • For parallel computers, clusters, heterogeneous networks • Full featured (but can be used in simple fashion) • MPI_gather, MPI_allgather • MPI_scatter, MPI_alltoall • MPI_bcast, • MPI_reduce, MPI_allreduce • mpi_sum • mpi_min, mpi_max • Include 'mpif.h' • Call MPI_INIT(ierr) • Call MPI_FINALIZE(ierr) • Call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr) • Call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr) • Call MPI_SEND(buffer,count,datatype,destination,tag,comm,ierr) • Call MPI_RECV(buffer,count,datatype,source,tag,comm,status,ierr)

The RPN_COMM toolkitMichel Valin • NO INCLUDE FILE NEEDED (like mpif.h) • Higher level of abstraction • Initialization / termination of communications • Topology determination • Point to point operations • Halo exchange • (Direct message to NSWE neighbor) • Collective operations • Transpose • Gather / distribute • Data reduction • Equivalent calls to most frequently used MPI routines • MPI_[something] => RPN_COMM_[something]

checktopo -gni 62 -gnj 25 -gnk 58 -npx 4 -npy 2 -pil 7 -hblen 10 lni=16 lnj=7 lni=16 lnj=8 lni=14 lnj=7 lni=15 lnj=8 lni=16 lnj=9 lni=16 lnj=8 lni=16 lnj=9 lni=16 lnj=9 lni=16 lnj=9 lni=16 lnj=9 lni=15 lnj=9 lni=16 lnj=9 lni=14 lnj=9 lni=15 lnj=9 Partitioning Global Data Gni=62 Gnj=25 PE topology: npex=4, npey=3 Valin (Gni + npex – 1) / npex Thomas Dimensions of largest subdomain NOT affected

DMP Scalability Scaling up with an optimum subdomain dimension Size: 500 x 50 on vector processor systems Scaling up on a fixed size problem sze Time to solution should decrease linearly with the # of CPUs Size: 100 x 50 on cache systems Time to solution should remain the same

MC2 Performance on NEC SX4and Fujitsu VPP700 Grid:513 x 433 x 41 Flop Rate / PE (MFlops/sec.) Number of PEs SX4: npx=2 VPP700: npx=1

1000 VPP700 Forecast days / day SX-4 100 10 100 Number of PEs IFS Performance on NEC SX4and Fujitsu VPP700 Amdahl's law for parallel programming The speedup factor is influenced very much by the residual serial (non parallelizable) work. As the number of processors grows, so does the damage caused by non parallelizable work.

Any algorithms requiring global communications One should THINK LOCAL SL transport on a global configuration lat-lon grid point model – numerical poles (GEM) 2-time-level fully implicit discretization leading to an elliptic problem: direct solver requires data transpose Any algorithms producing inherent load imbalance Scalability: limiting factors

DMP - General remarks • More difficult but more powerful programming paradigm • Easily combined with SMP (on all MPI processes) • Distributed memory parallelism does not happen, it must be DESIGNED. • One does not parallelizes a code, the code must be rebuilt (and often redesigned) taking into account the constraints imposed upon the dataflow by message passing. Arraydimensioning and loop indexing are likely to be VERY HEAVVILY IMPACTED. • One may get lucky and HPF or an automatic parallelizing compiler will solve the problem (if one believes in miracles, Santa Claus, the tooth fairy or all of them).

Web sites and Books • http://pollux.cmc.ec.gc.ca/~armnmfv/MPI_workshop • http://www.llnl.gov/ , OpenMP, threads, MPI, ... • http://hpcf.nersc.gov/ • http://www.idris.fr/ , en français, OpenMP, MPI, F90 • Using MPI, Gropp et al, ISBN 0-262-57204-8 • MPI, The Compl. Ref., Snir et al, ISBN 0-262-69184-1 • MPI, The Compl. Ref. vol 2, Gropp et al, ISBN 0-262-57123-4

The Need for Speed: Parallelization in GEM