300 likes | 421 Vues
U. P. C. collective functions. Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu. Overview. Background Collective operations in the UPC language The V1.0 UPC collectives specification Relocalization operations Computational operations
E N D
U P C collective functions Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu
Overview • Background • Collective operations in the UPC language • The V1.0 UPC collectives specification • Relocalization operations • Computational operations • Performance and implementation issues • Extensions • Other work
Background • UPC is an extension of C that provides a partitioned shared memory programming model. • The V1.1 UPC spec was adopted on March 25. • Processes in UPC are called threads. • Each thread has a private (local) address space. • All threads share a global address space that is partitioned among the threads. • A shared object that resides in thread i’s partition is said to have affinity to thread i. • If thread i has affinity to a shared object x, it is expected that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.
int i; shared [5] int A[10*THREADS]; 0 5 10 9 7 shared 15 20 25 local 3 i i i i=3; A[0]=7; A[i]=A[0]+2; th0 th1 th2 UPC programming model
Collective operations in UPC • If any thread calls a collective function, then all threads must also call that function. • Collectives arguments are single-valued: corresponding function arguments have the same value. • V1.1 UPC contains several collective functions: • upc_notify and upc_wait • upc_barrier • upc_all_alloc • upc_all_lock_alloc • These collectives provide synchronization and memory allocation across all threads.
0 5 10 15 p p p th0 th1 th2 shared void *upc_all_alloc(nblocks, nbytes); shared [5] char *p; This function allocates shared [nbytes] char[nblocks*nbytes] shared local p=upc_all_alloc(4,5); p=upc_all_alloc(4,5); p=upc_all_alloc(4,5);
The V1.0 UPC Collectives Spec • First draft by Wiebel and Greenberg, March 2002. • Spec discussed at May, 2002, and SC’02 UPC workshops. • Many helpful comments from Dan Bonachea and Brian Wibecan. • V1.0 will be released shortly.
Collective functions • Initialization • upc_all_init • “Relocalization” collectives change data affinity. • upc_all_broadcast • upc_all_scatter • upc_all_gather • upc_all_gather_all • upc_all_exchange • upc_all_permute • “Computational” collectives for reduction and sorting. • upc_all_reduce • upc_all_prefix_reduce • upc_all_sort
shared local dst dst dst src src src th0 th1 th2 void upc_all_broadcast(dst, src, blk); Thread 0 sends the same block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk]; } blk
shared local dst dst dst src src src th0 th1 th2 void upc_all_scatter(dst, src, blk); Thread 0 sends a unique block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk*THREADS];
shared local dst dst dst src src src th0 th1 th2 void upc_all_gather(dst, src, blk); Each thread sends a block of data to thread 0. shared [] char dst[blk*THREADS]; shared [blk] char src[blk*THREADS];
shared local dst dst dst src src src th0 th1 th2 void upc_all_gather_all(dst, src, blk); Each thread sends one block of data to all threads.
shared local dst dst dst src src src th0 th1 th2 void upc_all_exchange(dst, src, blk); Each thread sends a unique block of data to each thread.
shared local perm perm perm src dst dst dst src src 1 2 0 th0 th1 th2 void upc_all_permute(dst, src, perm, blk); Thread i sends a block of data to thread perm(i).
Computational collectives • Reduce and prefix reduce • One function for each C scalar type, e.g., upc_all_reduceI(…) returns an integer • Operations • +, *, &, |, XOR, &&, ||, min, max • user-defined binary function • Sort • User-defined comparison function void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *));
n Thread 0 receives UPC_OP src[i]. i=0 1 2 4 8 16 32 64 128 256 512 1024 2048 src src src i i i th0 th1 th2 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); int i; shared [3] int src[4*THREADS]; 0 3 6 1 64 4 16 256 8 128 2 32 shared 1024 512 2048 S S 448 56 S 3591 9 4095 local i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);
k Thread k receives UPC_OP src[i]. i=0 0 0 3 3 6 6 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared [*] int src[3*THREADS], dst[3*THREADS]; 3 7 63 63 32 3 511 7 1 2 4 8 16 127 64 128 256 15 127 1 15 31 255 31 255
Performance and implementation issues • “Push” or “pull”? • Synchronization semantics • Effects of data distribution
shared local dst dst dst src src src th0 th1 th2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 0 1 2
shared local i i i dst dst dst src src src th0 th1 th2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 0 1 2 2 0 1
Synchronization semantics • When are function arguments ready? • When are function results available?
shared local dst dst dst src src src Synchronization semantics • Arguments with affinity to thread i are ready when thread i calls the function; results with affinity to thread i are ready when thread i returns. • This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready. 0 1 2
Synchronization semantics • Require the implementation to provide barriers at function entry and exit. • This is convenient for the programming but it is likely to adversely affect performance. void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier; }
Synchronization semantics • V1.0 spec: Synchronization is a user responsibility. #define numelems 10 shared [] int A[numelems]; shared [numelems] int B[numelems*THREADS]; void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } . . // Initialize A. . . upc_barrier; upc_all_broadcast( B, A, sizeof(int)*numelems ); upc_barrier;
Performance and implementation issues • Data distribution affects both performance and implementation.
k Thread k receives UPC_OP src[i]. i=0 0 0 1 1 2 2 1 8 64 2 16 128 4 32 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared int src[3*THREADS], dst[3*THREADS]; 1 2 16 128 4 256 8 32 64 1 15 3 3 63 15 127 127 255 511 7 31 31 255 63 7
Extensions • Strided copying • Vectors of offsets for src and dst arrays • Variable-sized blocks • Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];
More sophisticated synchronization semantics • Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.
What’s next? • The V1.0 collective spec will be adopted in the next few weeks. • A reference implementation will be available from MTU immediately afterwards.
U P C projects at MTU U P C michigan tech home page • MuPC run time system for UPC • UPC memory model (Chuck Wallace) • UPC programmability (Phil Merkey) • UPC test suite (Phil Merkey) http://www.upc.mtu.edu