Global Arrays Tutorial: Develop Parallel Software

Overview of the Global ArraysParallel Software Development Toolkit Jarek Nieplocha1 , Bruce Palmer1, Manojkumar Krishnan1, Vinod Tipparaju1, P. Saddayappan2 1Pacific Northwest National Laboratory 2Ohio State University

Outline • Writing, Building, and Running GA Programs • Basic Calls • Intermediate Calls • Advanced Calls Global Arrays Tutorial

Writing, Building and Running GA programs • Writing GA programs • Compiling and linking • Running GA programs • For detailed information • GA Webpage • GA papers, APIs, user manual, etc. • (Google: Global Arrays) • http://www.emsl.pnl.gov/docs/global/ • GA User Manual • http://www.emsl.pnl.gov/docs/global/user.html • GA API Documentation • GA Webpage => User Interface • http://www.emsl.pnl.gov/docs/global/userinterface.html • GA Support/Help • hpctools@pnl.gov or hpctools@emsl.pnl.gov • 2 Mailing lists: GA User Forum, GA Announce Global Arrays Tutorial

Writing GA Programs • GA Definitions and Data types • C programs include files:ga.h, macdecls.h' • Fortran programs should include the files: mafdecls.fh, global.fh. • GA Initialize, GA_Terminate --> initializes and terminates GA library #include <stdio.h> #include "mpi.h“ #include "ga.h" #include "macdecls.h" int main( int argc, char **argv ) { MPI_Init( &argc, &argv ); GA_Initialize(); printf( "Hello world\n" ); GA_Terminate(); MPI_Finalize(); return 0; } Global Arrays Tutorial

Writing GA Programs • GA requires the following functionalities from a message passing library (MPI/TCGMSG) • initialization and termination of processes • Broadcast, Barrier • a function to abort the running parallel job in case of an error • The message-passing library has to be, • initialized before the GA library • terminated after the GA library is terminated • GA Compatible with MPI #include <stdio.h> #include "mpi.h“ #include "ga.h" #include "macdecls.h" int main( int argc, char **argv ) { MPI_Init( &argc, &argv ); GA_Initialize(); printf( "Hello world\n" ); GA_Terminate(); MPI_Finalize(); return 0; } Global Arrays Tutorial

Compiling and Linking GA Programs • 2 ways • GA Makefile in global/testing • Your Makefile • GA Makefile in global/testing • To compile and link your GA based program, for example “app.c” (or “app.f”, ..) • Copy to $GA_DIR/global/testing, and type • make app.x or gmake app.x • Compile any test program in GA testing directory, and use the appropriate compile/link flags in your program Global Arrays Tutorial

Compiling and Linking GA Programs (cont.) • Your Makefile • Please refer to the INCLUDES, FLAGS and LIBS variables, which will be printed at the end of successful GA installation on your platform • You can use these variables in your Makefile • For example: gcc $(INCLUDES) $(FLAGS) –o ga_test ga_test.c $(LIBS) INCLUDES = -I./include -I/msrc/apps/mpich-1.2.6/gcc/ch_shmem/include LIBS = -L/msrc/home/manoj/GA/cvs/lib/LINUX -lglobal -lma -llinalg -larmci -L/msrc/apps/mpich-1.2.6/gcc/ch_shmem/lib -lmpich –lm For Fortran Programs: FLAGS= -g -Wall -funroll-loops -fomit-frame-pointer -malign-double -fno-second-underscore -Wno-globals For C Programs: FLAGS = -g -Wall -funroll-loops -fomit-frame-pointer -malign-double -fno-second-underscore -Wno-globals • NOTE: Please refer to GA user manual chapter 2 for more information Global Arrays Tutorial

Running GA Programs • Example: Running a test program “ga_test” on 2 processes • mpirun -np 2 ga_test • Running GA program is same as MPI Global Arrays Tutorial

Setting GA Memory Limits • GA offers an optional mechanism that allows a programmer to limit aggregate memory consumption used by GA for storing Global Array data • Good for systems with limited-memory (e.g. BlueGene/L) • GA uses Memory Allocator (MA) library for internal temporary buffers • allocated dynamically, and deallocated when operation completes. • Users should specify memory usage of their application during GA initialization • Can be specified in MA_Init(..) call after GA_Initialize() ! Initialization of MA and setting GA memory limits call ga_initialize() if (ga_uses_ma()) then status = ma_init(MT_DBL, stack, heap+global) else status = ma_init(mt_dbl,stack,heap) call ga_set_memory_limit(ma_sizeof(MT_DBL,global,MT_BYTE)) endif if(.not. status) ... !we got an error condition here Global Arrays Tutorial

GA Basic Operations • GA programming model is very simple. • Most of the parallel programs can be written with these basic calls • GA_Initialize, GA_Terminate • GA_Nnodes, GA_Nodeid • GA_Create, GA_Destroy • GA_Put, GA_Get • GA_Sync Global Arrays Tutorial

GA Initialization/Termination • There are two functions to initialize GA: • Fortran • subroutine ga_initialize() • subroutine ga_initialize_ltd(limit) • C • void GA_Initialize() • void GA_Initialize_ltd(size_t limit) • To terminate a GA program: • Fortran subroutine ga_terminate() • C void GA_Terminate() integer limit - amount of memory in bytes per process [input] program main #include “mafdecls.h” #include “global.fh” integer ierr c call mpi_init(ierr) call ga_initialize() c write(6,*) ‘Hello world’ c call ga_terminate() call mpi_finilize() end Global Arrays Tutorial

Parallel Environment - Process Information • Parallel Environment: • how many processes are working together (size) • what their IDs are (ranges from 0 to size-1) • To return the process ID of the current process: • Fortran integer function ga_nodeid() • C int GA_Nodeid() • To determine the number of computing processes: • Fortran integer function ga_nnodes() • C int GA_Nnodes() • To determine the coordinates of a processor: • Fortran subroutine nga_proc_topology(ga, proc, coordinates) • C void NGA_Proc_topology(int g_a, int proc, int coordinates[]) integer g_a [input] integer proc [input] integer coordinates(ndim) coordinates in processor grid [output] Global Arrays Tutorial

Parallel Environment - Process Information (EXAMPLE) program main #include “mafdecls.h” #include “global.fh” integer ierr,me,nproc call mpi_init(ierr) call ga_initialize() me = ga_nodeid() size = ga_nnodes() write(6,*) ‘Hello world: My rank is ’ + me + ‘ out of ‘ + ! size + ‘processes/nodes’ call ga_terminate() call mpi_finilize() end Sample Output: mpirun –np 4 helloworld Hello world: My rank is 0 out of 4 processes/nodes Hello world: My rank is 1 out of 4 processes/nodes Hello world: My rank is 2 out of 4 processes/nodes Hello world: My rank is 3 out of 4 processes/nodes Global Arrays Tutorial

GA Data Types • C Data types • C_INT - int • C_LONG - long • C_FLOAT - float • C_DBL - double • C_DCPL - double complex • Fortran Data types • MT_F_INT - integer (4/8 bytes) • MT_F_REAL - real • MT_F_DBL - double precision • MT_F_DCPL - double complex Global Arrays Tutorial

Creating/Destroying Arrays • To create an array with regular distribution: • Fortran logical function nga_create(type, ndim, dims, name, chunk, g_a) • C int NGA_Create(int type, int ndim, int dims[], char *name, int chunk[]) character*(*) name - a unique character string [input] integer type - GA data type [input] integer dims() - array dimensions [input] integer chunk() - minimum size that dimensions should be chunked into [input] integer g_a - handle for future references [output] dims(1) = 5000 dims(2) = 5000 chunk(1) = -1 !Use defaults chunk(2) = -1 if (.not.nga_create(MT_F_DBL,2,dims,’Array_A’,chunk,g_a)) + call ga_error(“Could not create global array A”,g_a) Global Arrays Tutorial

Creating/Destroying Arrays (cont.) • To create an array with irregular distribution: • Fortran logical function nga_create_irreg (type, ndim, dims, array_name, map, nblock, g_a) • C int NGA_Create_irreg(int type, int ndim, int dims[], int nblock[], int map[]) character*(*) name - a unique character string [input] integer type - GA datatype [input] integer dims (ndim) - array dimensions [input] integer nblock(ndim) - no. of blocks each dimension is divided into [input] integer map(s) - starting index of each block [input] integer g_a - integer handle for future references [output] Global Arrays Tutorial

Creating/Destroying Arrays (cont.) • Example of irregular distribution: • The distribution is specified as a Cartesian product of distributions for each dimension. The array indices start at 1. • The figure demonstrates distribution of a 2-dimensional array 8x10 on 6 (or more) processors. block[2]={3,2}, the size of map array is s=5 and array map contains the following elements map={1,3,7, 1, 6}. • The distribution is nonuniform because, P1 and P4 get 20 elements each and processors P0,P2,P3, and P5 only 10 elements each. 5 5 P0 P3 2 P1 P4 4 P2 P5 2 block(1) = 3 block(2) = 2 map(1) = 1 map(2) = 3 map(3) = 7 map(4) = 1 map(5) = 6 if (.not.nga_create_irreg(MT_F_DBL,2,dims,’Array_A’,map,block,g_a)) + call ga_error(“Could not create global array A”,g_a) Global Arrays Tutorial

Creating/Destroying Arrays (cont.) • To duplicate an array: • Fortran logical function ga_duplicate(g_a, g_b, name) • C int GA_Duplicate(int g_a, char *name) • Global arrays can be destroyed by calling the function: • Fortran subroutine ga_destroy(g_a) • C void GA_Destroy(int g_a) integer g_a, g_b; character*(*) name; name - a character string [input] g_a - Integer handle for reference array [input] g_b - Integer handle for new array [output] call nga_create(MT_F_INT,dim,dims, + ‘array_a’,chunk,g_a) call ga_duplicate(g_a,g_b,‘array_b’) call ga_destroy(g_a) Global Arrays Tutorial

Put/Get • Put copies data from the local array to the global array section: • Fortran subroutine nga_put(g_a, lo, hi, buf, ld) • C void NGA_Put(int g_a, int lo[], int hi[], void *buf, int ld[]) • Get copies data from a global array section to the local array: • Fortran subroutine nga_get(g_a, lo, hi, buf, ld) • C void NGA_Get(int g_a, int lo[], int hi[], void *buf, int ld[]) integer g_a global array handle [input] integer lo(),hi() limits on data block to be moved [input] double precision/complex/integer buf local buffer [output] integer ld() array of strides for local buffer [input] Global Arrays Tutorial

Put/Get (cont.) • Example of get operation: • transfer data from the (7:15,1:8) section of a 2-dimensional 15 x10 global array into a local buffer which is a 10 x10 array • lo={7,1}, hi={15,8}, ld={10} global lo local hi double precision buf(10,10) : : call nga_get(g_a,lo,hi,buf,ld) Global Arrays Tutorial

Accumulate • Accumulate combines the data from the local array with data in the global array section: • Fortran subroutine nga_acc(g_a, lo, hi, buf, ld, alpha) • C void NGA_Acc(int g_a, int lo[], int hi[], void *buf, int ld[], void *alpha) integer g_a array handle [input] integer lo(), hi() limits on data block to be moved [input] double precision/complex buf local buffer [input] integer ld() array of strides for local buffer [input] double precision/complex alpha arbitrary scale factor [input] global local ga(i,j) = ga(i,j)+alpha*buf(k,l) Global Arrays Tutorial

Sync • Sync is a collective operation • It acts as a barrier, which synchronizes all the processes and ensures that all the Global Array operations are complete at the call • The functions are: • Fortran subroutine ga_sync() • C void GA_Sync() sync Global Arrays Tutorial

Message-Passing Wrappers to Reduce/Broadcast Operations root • To broadcast from process root to all other processes: • Fortran subroutine ga_brdcst(type, buf, lenbuf, root) • C void GA_Brdcst(void *buf, int lenbuf, int root) integer type [input] ! message type for broadcast byte buf(lenbuf) [input/output] integer lenbuf [input] integer root [input] double precision buf(lenbuf) : size = 8*lenbuf call ga_brdcst(msg_id,buf,size,0) Global Arrays Tutorial

Message-Passing Wrappers to Reduce/Broadcast Operations (cont.) • To sum elements across all nodes and broadcast result to all nodes: • Fortran • Subroutine ga_igop(type, x, n, op) • Subroutine ga_dgop(type, x, n, op) • C • Void GA_Igop(long x[], int n, char *op) • Void GA_Dgop(double x[], int n, char *op) integer type [input] <type> x(n) vector of elements [input/output] character*(*) OP operation [input] integer n number of elements [input] double precision rbuf(len) : call ga_dgop(msg_id,rbuf,len,’+’) Global Arrays Tutorial

Shared Object Shared Object get copy to local memory copy to shared object put compute/update local memory local memory local memory Global Array Model of Computations Global Arrays Tutorial

Locality Information • Discover array elements held by each processor • Fortran nga_distribution(g_a,proc,lo,hi) • C void NGA_Distribution(int g_a, int proc, int *lo, int *hi) integer g_a array handle [input] integer proc processor ID [input] Integer lo(ndim) lower index [output] Integer hi(ndim) upper index [output] do iproc = 1, nproc write(6,*) ‘Printing g_a info for processor’,iproc call nga_distribution(g_a,iproc,lo,hi) do j = 1, ndim write(6,*) j,lo(j),hi(j) end do dnd do Global Arrays Tutorial

Example: Matrix Multiply global arrays representing matrices = • nga_put nga_get = • dgemm local buffers on the processor Global Arrays Tutorial

Matrix Multiply (a better version) more scalable! (less memory, higher parallelism) = • atomic accumulate get = • dgemm local buffers on the processor Global Arrays Tutorial

Basic Array Operations • Whole Arrays: • To set all the elements in the array to zero: • Fortran subroutine ga_zero(g_a) • C void GA_Zero(int g_a) • To assign a single value to all the elements in array: • Fortran subroutine ga_fill(g_a, val) • C void GA_Fill(int g_a, void *val) • To scale all the elements in the array by factorval: • Fortran subroutine ga_scale(g_a, val) • C void GA_Scale(int g_a, void *val) Global Arrays Tutorial

0 1 2 3 4 5 0 1 2 6 7 8 3 4 5 6 7 8 Basic Array Operations (cont.) • Whole Arrays: • To copy data between two arrays: • Fortran subroutine ga_copy(g_a, g_b) • C void GA_Copy(int g_a, int g_b) • Arrays must be same size and dimension • Distribution may be different call ga_create(MT_F_INT,ndim,dims, ‘array_A’,chunk_a,g_a) call nga_create(MT_F_INT,ndim,dims, ‘array_B’,chunk_b,g_b) ... Initialize g_a .... call ga_copy(g_a, g_b) “g_a” “g_b” Global Arrays g_a and g_b distributed on a 3x3 process grid Global Arrays Tutorial

0 1 2 0 1 2 3 4 5 3 4 5 6 7 8 6 7 8 Basic Array Operations (cont.) • Patch Operations: • The copy patch operation: • Fortran subroutine nga_copy_patch(trans, g_a, alo, ahi, g_b, blo, bhi) • C void NGA_Copy_patch(char trans, int g_a, int alo[], int ahi[], int g_b, int blo[], int bhi[]) • Number of elements must match “g_a” “g_b” Copy Global Arrays Tutorial

Basic Array Operations (cont.) • Patches (Cont): • To set only the region defined by lo and hi to zero: • Fortran subroutine nga_zero_patch(g_a, lo, hi) • C void NGA_Zero_patch(int g_a, int lo[] int hi[]) • To assign a single value to all the elements in a patch: • Fortran subroutine nga_fill_patch(g_a, lo, hi, val) • C voidNGA_Fill_patch(int g_a, int lo[] int hi[], void *val) • To scale the patch defined by lo and hi by the factor val: • Fortran subroutine nga_scale_patch(g_a, lo, hi, val) • C voidNGA_Scale_patch(int g_a, int lo[] int hi[], void *val) • The copy patch operation: • Fortran subroutine nga_copy_patch(trans, g_a, alo, ahi, g_b, blo, bhi) • C void NGA_Copy_patch(char trans, int g_a, int alo[], int ahi[], int g_b, int blo[], int bhi[]) Global Arrays Tutorial

Scatter/Gather • Scatter puts array elements into a global array: • Fortran subroutine nga_scatter(g_a, v, subscrpt_array, n) • C void NGA_Scatter(int g_a, void *v, int *subscrpt_array[], int n) • Gather gets the array elements from a global array into a local array: • Fortran subroutine nga_gather(g_a, v, subscrpt_array, n) • C void NGA_Gather(int g_a, void *v, int *subscrpt_array[], int n) integer g_a array handle [input] double precision v(n) array of values [input/output] Integer n number of values [input] integer subscrpt_array location of values in global array [input] Global Arrays Tutorial

Scatter/Gather (cont.) • Example of scatter operation: • Scatter the 5 elements into a 10x10 global array • Element 1 v[0] = 5 subsArray[0][0] = 2 subsArray[0][1] = 3 • Element 2 v[1] = 3 subsArray[1][0] = 3 subsArray[1][1] = 4 • Element 3 v[2] = 8 subsArray[2][0] = 8 subsArray[2][1] = 5 • Element 4 v[3] = 7 subsArray[3][0] = 3 subsArray[3][1] = 7 • Element 5 v[4] = 2 subsArray[4][0] = 6 subsArray[4][1] = 3 • After the scatter operation, the five elements would be scattered into the global array as shown in the figure. integer subscript(ndim,nlen) : call nga_scatter(g_a,v,subscript,nlen) Global Arrays Tutorial

Cluster Information • To return the total number of nodes that the program is running on: • Fortran integer function ga_cluster_nnodes() • C int GA_Cluster_nnodes() • To return the node ID of the process: • Fortran integer function ga_cluster_nodeid() • C int GA_Cluster_nodeid() Global Arrays Tutorial

Cluster Information (cont.) • To return the number of processors available on node inode: • Fortran integer function ga_cluster_nprocs(inode) • C int GA_Cluster_nprocs(int inode). • To return the processor ID associated with node inode and the local processor ID iproc: • Fortran integer function ga_cluster_procid(inode, iproc) • C int GA_Cluster_procid(int inode, int iproc) integer inode [input] inode [input] integer inode,iproc [input] inode,iproc [input] Global Arrays Tutorial

Cluster Information (cont.) • Example: • 2 nodes with 4 processors each. Say, there are 7 processes created. • Assume 4 processes on node 0 and 3 processes on node 1. • In this case: • number of nodes=2, node id is either 0 or 1 (e.x., nodeid of process 2 is 0) • number of processes in node 0 is 4 and node 1 is 3 • The global rank of each process and the local rank (rank of the process within the node.i.e.cluster_procid) is shown in the paranthesis. Global Arrays Tutorial

Read and Increment • Read_inc remotely updates a particular element in an integer global array: • Fortran integer function nga_read_inc(g_a, subscript, inc) • C long NGA_Read_inc(int g_a, int subscript[], long inc) • Applies to integer arrays only • Example: can be used as a global counter integer g_a [input] integer subscript(ndim), inc [input] c Create task counter call nga_create(MT_F_INT,one,one,chunk,g_counter) call ga_zero(g_counter) : itask = nga_read_inc(g_counter,one,one) ... Translate itask into task ... Global Arrays Tutorial

Access • To provide direct access to local data in the specified patch of the array owned by the calling process: • Fortran subroutine nga_access(g_a, lo, hi, index, ld) • C void NGA_Access(int g_a, int lo[], int hi[], void *ptr, int ld[]) • Processes can access the local position of the global array • Process “0” can access the specified patch of its local position of the array • Avoids memory copy 0 1 2 Access: gives a pointer to this local patch 3 4 5 6 7 8 call nga_create(MT_F_DBL,2,dims,’Array’,chunk,g_a) : call nga_distribution(g_a,me,lo,hi) call nga_access(g_a,lo,hi,index,ld) call do_subroutine_task(dbl_mb(index),ld(1)) subroutine do_subroutine_task(a,ld1) double precision a(ld1,*) Global Arrays Tutorial

Locality Information (cont.) • Global Arrays support abstraction of a distributed array object • Object is represented by an integer handle • A process can access its portion of the data in the global array • To do this, the following steps need to be taken: • Find the distribution of an array, which part of the data the calling process own • Access the data • Operate on the data: read/write • Release the access to the data Global Arrays Tutorial

Non-blocking Operations • The non-blocking APIs are derived from the blocking interface by adding a handle argument that identifies an instance of the non-blocking request. • Fortran • subroutine nga_nbput(g_a, lo, hi, buf, ld, nbhandle) • subroutine nga_nbget(g_a, lo, hi, buf, ld, nbhandle) • subroutine nga_nbacc(g_a, lo, hi, buf, ld, alpha, nbhandle) • subroutine nga_nbwait(nbhandle) • C • void NGA_NbPut(int g_a, int lo[], int hi[], void *buf,int ld[], ga_nbhdl_t* nbhandle) • void NGA_NbGet(int g_a, int lo[], int hi[], void *buf, int ld[], ga_nbhdl_t* nbhandle) • void NGA_NbAcc(int g_a, int lo[], int hi[], void *buf, int ld[], void *alpha, ga_nbhdl_t* nbhandle) • int NGA_NbWait(ga_nbhdl_t* nbhandle) integer nbhandle - non-blocking request handle [output/input] Global Arrays Tutorial

Non-Blocking Operations double precision buf1(nmax,nmax) double precision buf2(nmax,nmax) : call nga_nbget(g_a,lo1,hi1,buf1,ld1,nb1) ncount = 1 do while(.....) if (mod(ncount,2).eq.1) then ... Evaluate lo2, hi2 call nga_nbget(g_a,lo2,hi2,buf2,nb2) call nga_wait(nb1) ... Do work using data in buf1 else ... Evaluate lo1, hi1 call nga_nbget(g_a,lo1,hi1,buf1,nb1) call nga_wait(nb2) ... Do work using data in buf2 endif ncount = ncount + 1 end do Global Arrays Tutorial

SUMMA Matrix Multiplication Issue NB Get A and B blocks do (until last chunk) issue NB Get to the next blocks wait for previous issued call compute A*B (sequential dgemm) NB atomic accumulate into “C” matrix done Computation Comm. (Overlap) C=A.B A B Advantages: - Minimum memory - Highly parallel - Overlaps computation and communication - latency hiding - exploits data locality - patch matrix multiplication (easy to use) - dynamic load balancing = patch matrix multiplication Global Arrays Tutorial

SUMMA Matrix Multiplication:Improvement over PBLAS/ScaLAPACK Global Arrays Tutorial

Processor Groups • To create a new processor group: • Fortran integer function ga_pgroup_create(list, size) • C intGA_Pgroup_create(int *list, int size) • To assign a processor groups: • Fortran subroutine ga_set_pgroup(g_a, p_handle) • C voidGA_Set_pgroup(int g_a, int p_handle) integer g_a - global array handle [input] integer p_handle - processor group handle [input] list[size] list of processor IDs in group [input] size number of processors in group [input] Global Arrays Tutorial

Processor Groups group A group B group C world group Global Arrays Tutorial

Processor Groups (cont.) • To set the default processor group • Fortran subroutine ga_pgroup_set_default(p_handle) • C voidGA_Pgroup_set_default(int p_handle) • To access information about the processor group: • Fortran • integer function ga_pgroup_nnodes(p_handle) • integer function ga_pgroup_nodeid(p_handle) • C • intGA_Pgroup_nnodes(int p_handle) • intGA_Pgroup_nodeid(int p_handle) integer p_handle - processor group handle [input] Global Arrays Tutorial

Global Arrays Tutorial: Develop Parallel Software