Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

U P C collective functions Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

Overview • Background • Collective operations in the UPC language • The V1.0 UPC collectives specification • Relocalization operations • Computational operations • Performance and implementation issues • Extensions • Other work

Background • UPC is an extension of C that provides a partitioned shared memory programming model. • The V1.1 UPC spec was adopted on March 25. • Processes in UPC are called threads. • Each thread has a private (local) address space. • All threads share a global address space that is partitioned among the threads. • A shared object that resides in thread i’s partition is said to have affinity to thread i. • If thread i has affinity to a shared object x, it is expected that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.

int i; shared [5] int A[10*THREADS]; 0 5 10 9 7 shared 15 20 25 local 3 i i i i=3; A[0]=7; A[i]=A[0]+2; th0 th1 th2 UPC programming model

Collective operations in UPC • If any thread calls a collective function, then all threads must also call that function. • Collectives arguments are single-valued: corresponding function arguments have the same value. • V1.1 UPC contains several collective functions: • upc_notify and upc_wait • upc_barrier • upc_all_alloc • upc_all_lock_alloc • These collectives provide synchronization and memory allocation across all threads.

0 5 10 15 p p p th0 th1 th2 shared void *upc_all_alloc(nblocks, nbytes); shared [5] char *p; This function allocates shared [nbytes] char[nblocks*nbytes] shared local p=upc_all_alloc(4,5); p=upc_all_alloc(4,5); p=upc_all_alloc(4,5);

The V1.0 UPC Collectives Spec • First draft by Wiebel and Greenberg, March 2002. • Spec discussed at May, 2002, and SC’02 UPC workshops. • Many helpful comments from Dan Bonachea and Brian Wibecan. • V1.0 will be released shortly.

Collective functions • Initialization • upc_all_init • “Relocalization” collectives change data affinity. • upc_all_broadcast • upc_all_scatter • upc_all_gather • upc_all_gather_all • upc_all_exchange • upc_all_permute • “Computational” collectives for reduction and sorting. • upc_all_reduce • upc_all_prefix_reduce • upc_all_sort

shared local dst dst dst src src src th0 th1 th2 void upc_all_broadcast(dst, src, blk); Thread 0 sends the same block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk]; } blk

shared local dst dst dst src src src th0 th1 th2 void upc_all_scatter(dst, src, blk); Thread 0 sends a unique block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk*THREADS];

shared local dst dst dst src src src th0 th1 th2 void upc_all_gather(dst, src, blk); Each thread sends a block of data to thread 0. shared [] char dst[blk*THREADS]; shared [blk] char src[blk*THREADS];

shared local dst dst dst src src src th0 th1 th2 void upc_all_gather_all(dst, src, blk); Each thread sends one block of data to all threads.

shared local dst dst dst src src src th0 th1 th2 void upc_all_exchange(dst, src, blk); Each thread sends a unique block of data to each thread.

shared local perm perm perm src dst dst dst src src 1 2 0 th0 th1 th2 void upc_all_permute(dst, src, perm, blk); Thread i sends a block of data to thread perm(i).

Computational collectives • Reduce and prefix reduce • One function for each C scalar type, e.g., upc_all_reduceI(…) returns an integer • Operations • +, *, &, |, XOR, &&, ||, min, max • user-defined binary function • Sort • User-defined comparison function void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *));

n Thread 0 receives UPC_OP src[i]. i=0 1 2 4 8 16 32 64 128 256 512 1024 2048 src src src i i i th0 th1 th2 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); int i; shared [3] int src[4*THREADS]; 0 3 6 1 64 4 16 256 8 128 2 32 shared 1024 512 2048 S S 448 56 S 3591 9 4095 local i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

k Thread k receives UPC_OP src[i]. i=0 0 0 3 3 6 6 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared [*] int src[3*THREADS], dst[3*THREADS]; 3 7 63 63 32 3 511 7 1 2 4 8 16 127 64 128 256 15 127 1 15 31 255 31 255

Performance and implementation issues • “Push” or “pull”? • Synchronization semantics • Effects of data distribution

shared local dst dst dst src src src th0 th1 th2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 0 1 2

shared local i i i dst dst dst src src src th0 th1 th2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 0 1 2 2 0 1

Synchronization semantics • When are function arguments ready? • When are function results available?

shared local dst dst dst src src src Synchronization semantics • Arguments with affinity to thread i are ready when thread i calls the function; results with affinity to thread i are ready when thread i returns. • This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready. 0 1 2

Synchronization semantics • Require the implementation to provide barriers at function entry and exit. • This is convenient for the programming but it is likely to adversely affect performance. void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier; }

Synchronization semantics • V1.0 spec: Synchronization is a user responsibility. #define numelems 10 shared [] int A[numelems]; shared [numelems] int B[numelems*THREADS]; void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } . . // Initialize A. . . upc_barrier; upc_all_broadcast( B, A, sizeof(int)*numelems ); upc_barrier;

Performance and implementation issues • Data distribution affects both performance and implementation.

k Thread k receives UPC_OP src[i]. i=0 0 0 1 1 2 2 1 8 64 2 16 128 4 32 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared int src[3*THREADS], dst[3*THREADS]; 1 2 16 128 4 256 8 32 64 1 15 3 3 63 15 127 127 255 511 7 31 31 255 63 7

Extensions • Strided copying • Vectors of offsets for src and dst arrays • Variable-sized blocks • Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];

More sophisticated synchronization semantics • Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.

What’s next? • The V1.0 collective spec will be adopted in the next few weeks. • A reference implementation will be available from MTU immediately afterwards.

U P C projects at MTU U P C michigan tech home page • MuPC run time system for UPC • UPC memory model (Chuck Wallace) • UPC programmability (Phil Merkey) • UPC test suite (Phil Merkey) http://www.upc.mtu.edu

Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

Presentation Transcript

University of Gujrat Department of Computer Science

Prof. Leszek Lilien (Chair) Department of Computer Science Western Michigan University Kalamazoo, Michigan, USA

Sean J.Kirkpatrick , Ph.D. Department of Biomedical Engineering Michigan Technological University

Michigan Technological University

Department of Computer Science, Wayne State University

Department of Electrical Engineering and Computer Science, University of Michigan, USA.

Rutgers University Computer Science Department

Department of Computer Science University of Virginia

Department of Computer Science, Princeton University

Alexandria University Faculty of Science Computer Science Department

Department of Computer Science, University of Sheffield

University of Pisa Department of Computer Science

Michigan Technological University – College of Engineering

Steve Donaldson Department of Mathematics and Computer Science Samford University

Columbia University Department of Computer Science

Concordia University Department of Computer Science

Columbia University Department of Computer Science

Department of Computer Science, University of Sheffield