StarSs Programming Overview and Hands-On Examples

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

StarSs overview SMPSs SMPSs examples Single node hands-on Hybrid model MPI/SMPSs Programming examples MPI/SMPSs hands-on Slides available in: http://marsa.ac.upc.edu/prace/course_material_starss /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial_PRACE_starss.ppt Agenda

StarSs overview

StarSs A “node” level programming model C/Fortran + directives Nicely integrates in hybrid MPI/StarSs Natural support for heterogeneity Programmability Incremental parallelization/restructure Abstract/separate algorithmic issues from resources Disciplined programming Portability “Same” source code runs on “any” machine Optimized task implementations will result in better performance. “Single source” for maintained version of a application Performance Asynchronous (data-flow) execution and locality awareness Intelligent Runtime: specific for each type of target platform. Automatically extracts and exploits parallelism Matches computations to resources StarSs StarSs GridSs CellSs CompSs (Java) SMPSs ClusterSs GPUSs ClearSpeedSs

NANOS++ ~2008 History / Strategy CompSs ~2007 CellSs ~2006 GPUSs ~2009 GridSs ~2002 SMPSs V2 ~2009 PERMAS ~1994 NANOS ~1996 SMPSs V1 ~2007

StarSs: a sequential program … void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

1 2 3 4 5 7 8 6 20 18 17 19 9 10 11 12 13 14 15 16 Color/number: order of task instantiation Some antidependences covered by flow dependences not drawn StarSs: … taskified … #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); Compute dependences @ task instantiation time for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

Decouple how we write form how it is executed 1 1 1 2 2 4 5 3 Write 6 6 6 7 Execute 2 2 2 3 7 8 7 8 StarSs: … and executed in a data-flow model #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); Color/number: a possible order of task execution

StarSs: the potential of data access information • Flat global address space seen by programmer • Flexibility to dynamically traverse dataflow graph “optimizing” • Concurrency. Critical path • Memory access: data transfers performed by run time • Opportunities for • Prefetch • Reuse • Eliminate antidependences (rename) • Replication management • Coherency/consistency handled by the runtime

StarSs: … reductions #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) reduction(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 1 1 1 2 2 3 3 2 4 4 4 5 2 2 2 3 5 6 5 6 Color/number: possible order of task execution

StarSs: just a few directives #pragma css task[input( parameters) ] \ [output ( parameters)] \ [inout ( parameters)] \ [target device([cell, smp, cuda] ) ] \ [implements ( task_name) ] \ [reduction ( parameters) ] \ [highpriority ] #pragma css wait on (data_address) #pragma css barrier #pragma css mutex lock ( variable) #pragma css mutex unlock( variable) parameters: parameter[ , parameter]* parameter:variable_name{[dimension]}*

StarSs Syntax Examples: task selection #pragma css task input(A, B) inout(C)‏ void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ... #pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B ) { .. #pragma css task input(A[size][size], B[size][size], size) inout(C[size][size])‏ void block_addmultiply( float *C, float *A, float *B, int size ) { .. Note: task annotations can be before function code or before function declaration, in a header file. Examples: waiting for data #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ... are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error); #pragma css wait on (sq_error)‏ if (sq_error >0.0000001)‏{ ...

StarSs Syntax Examples: reductions #pragma css task input (vec, n) inout(results) reduction (results) void sum_task(int *vec, int n, int *results) { int i; int local_sol=0; for (i = 0; i < n; i++) local_sol += vec[i]; #pragma css mutex lock (results) *results = *results + local_sol; #pragma css mutex unlock (results) } Examples: reductions. The parameter used in the mutex can be anything that identifies the access uniquely #pragma css task input (vec, n, id) inout(results) reduction (results) void sum_task(int *vec, int n, int *results, int id) { int i; int local_sol=0; for (i = 0; i < n; i++) local_sol += vec[i]; #pragma css mutex lock (id) *results = *results + local_sol; #pragma css mutex unlock (id) }

StarSs Syntax MPI tasks #pragma css task input(partner, bufsend) output(bufrecv) target(comm_thread) void communication(int partner, double bufsend[BLOCK_SIZE], double bufrecv[BLOCK_SIZE]) { int ierr; MPI_Request request[2]; MPI_Status status[2]; ierr = MPI_Isend(bufsend, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[0]); ierr = MPI_Irecv(bufrecv, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[1]); MPI_Waitall(2, request, status); }

StarSs Syntax in Fortran subroutine example()‏ ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ implicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS)‏ real, intent (inout) :: C(BS,BS)‏ end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE)‏ ... !$CSS FINISH ... end subroutine !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ ... end subroutine

StarSs Syntax in Fortran: with MPI subroutine example()‏ ... interface !$css task TARGET(COMM_THREAD) subroutine checksum(i, u1, d1, d2, d3) implicit none integer, intent(in) :: i, d1, d2, d3 double complex, intent(in) :: u1(d1*d2*d3) end subroutine end interface ...

SMPSs

FU FU FU SMPSs implementation Slave threads IFU DEC REN IQ ISS REG Main thread RET J.M. Perez, et al, “A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures” Cluster 2008

SMPss: Compiler phase app.c SMPSS-CC Code translation (mcc)‏ app.tasks (tasks list)‏ app.o pack smpss-cc_app.c smpss-cc_app.o C compiler (gcc, icc, ...)‏

SMPss: Linker phase app.o app.c app.c SMPSS-CC glue code generator app.tasks smpss-cc-app.c smpss-cc-app.c exec-registration.c unpack exec-adapters.c C compiler (gcc, icc,...)‏ smpss-cc_app.o app-adapters.cc app-adapters.c exec-adapters.o exec-registration.o Linker libSMPSS.so exec

SMPSs Programming examples

Programing examples

Where information on size and address comes from Barriers for timing purposes Simple examples #pragma css task input(A, B)inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} #pragma css task input(A[BS][BS], B[BS][BS])inout(C[BS][BS])‏ void dgemm( float *C, float *A, float *B ) { ...} #pragma css task input(A, B)inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} main() { ( ... #pragma css barrier t = time(); dgemm(A,B,C); dgemm(D,E,F); dgemm(C,F,G); Dgemm(C,E,H); #pragma css barrier t=time() – t; prinft (“result G = %f; time = %d\n”, G[0][0]); } 1 2 4 3

Simple examples #pragma css task input(A, B) inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} main() { ( ... dgemm(A,B,C); dgemm(D,E,F); dgemm(C,F,G); dgemm(A,D,H); dgemm(C,H,I); #pragma css waiton (F) prinft (“result F = %f\n”, F[0][0]); dgemm(H,G,C); #pragma css barrier prinft (“result C = %f\n”, C[0][0]); } 1 2 4 5 3 6

BS NB BS NB BS BS MxM on matrix stored by blocks int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j]); } #pragma css task input(A, B)inout(C) static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j]; } Will work on matrices of any size Will work on any number of cores/devices

static void dgem_64x64(volatile float *blkC, volatile float *blkA, volatile float *blkB) { unsigned int i; volatile float *ptrA, *ptrB, *ptrC; vector float a0, a1, a2, a3; vector float a00, a01, a02, a03; vector float a10, a11, a12, a13; ….. for(i=0; i<M; i+=4){ ptrA = &blkA[i*M]; ptrB = &blkB[0]; ptrC = &blkC[i*M]; a0 = *((volatile vector float *)(ptrA)); a1 = *((volatile vector float *)(ptrA+M)); a2 = *((volatile vector float *)(ptrA+2*M)); a3 = *((volatile vector float *)(ptrA+3*M)); a00 = spu_shuffle(a0, a0, pat0); a01 = spu_shuffle(a0, a0, pat1); a02 = spu_shuffle(a0, a0, pat2); a03 = spu_shuffle(a0, a0, pat3); a10 = spu_shuffle(a1, a1, pat0); a11 = spu_shuffle(a1, a1, pat1); …… a33 = spu_shuffle(a3, a3, pat3); Loads4RegSetA(0); Ops4RegSetA(); Loads4RegSetB(8); StageCBA(0,0); StageACB(8,0); StageBAC(16,0); StageCBA(24,0); StageACB(32,0); StageBAC(40,0); StageMISC(0,0); StageCBAmod(0,4); StageACB(8,4); StageBAC(16,4); StageCBA(24,4); StageACB(32,4); StageBAC(40,4); StageMISC(4,4); StageCBAmod(0,8); StageACB(8,8); ….. BS NB BS NB BS BS #define StageCBA(OFFSET,OFFB) \ { \ ALIGN8B; \ SPU_FMA(c0_0B,a00,b0_0B,c0_0B); c0_0C = *((volatile vector float *)(ptrC+OFFSET+16)); \ SPU_FMA(c1_0B,a10,b0_0B,c1_0B); c1_0C = *((volatile vector float *)(ptrC+M+OFFSET+16)); \ SPU_FMA(c2_0B,a20,b0_0B,c2_0B); c2_0C = *((volatile vector float *)(ptrC+2*M+OFFSET+16)); \ SPU_FMA(c3_0B,a30,b0_0B,c3_0B); SPU_LNOP; \ SPU_FMA(c0_1B,a00,b0_1B,c0_1B); c3_0C = *((volatile vector float *)(ptrC+3*M+OFFSET+16)); \ SPU_FMA(c1_1B,a10,b0_1B,c1_1B); c0_1C = *((volatile vector float *)(ptrC+OFFSET+20)); \ SPU_FMA(c2_1B,a20,b0_1B,c2_1B); c1_1C = *((volatile vector float *)(ptrC+M+OFFSET+20)); \ SPU_FMA(c3_1B,a30,b0_1B,c3_1B); SPU_LNOP; \ SPU_FMA(c0_0B,a01,b1_0B,c0_0B); c2_1C = *((volatile vector float *)(ptrC+2*M+OFFSET+20)); \ ……. MxM @ CellSs int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) dgem_64x64( C[i][j], A[i][k], B[k][j]); } #ifdef SPU_CODE #include <blas_s.h> #endif #pragma css task input(A[64][64], B[64][64])inout(C[64][64]) void dgemm_64x64(double *C, double *A, double *B); Leverage existing kernels, libraries,…

BS NB BS NB BS BS MxM @ GPUSs using CUBLAS kernel int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j], BS); } #pragma css task input(A[NB][NB], B[NB][NB], NB)\ inout(C[NB][NB])target device(cuda) void mm_tile (float *A, float *B, float *C, int NB) { unsigned char TR = 'T', NT = 'N'; float DONE = 1.0, DMONE = -1.0; float *d_A, *d_B, *d_C; cublasSgemm (NT, NT, NB, NB, NB, DMONE, A, NB, B, NB, DONE, C, NB); }

Cholesky factorization Cholesky factorization Common matrix operation used to solve normal equations in linear least squares problems. Calculates a triangular matrix (L) from a symetric and positive defined matrix A. Cholesky(A) = L L · Lt = A Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)‏ It can be decomposed in block operations

Cholesky factorization In each iteration red and blue blocks are updated SPOTRF: Computes the Cholesky factorization of the diagonal block . STRSM: Computes the column panel SSYRK: Computes the row panel SGEMM: Updates the rest of the matrix block_syrk

Cholesky factorization DIM DIM 64 64 main (){ ... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); } // Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } } ... for (int i = 0; i < DIM; i++)‏ { for (int j = 0; j < DIM; j++)‏ { #pragma css wait on (A[i][j]) print_block(A[i][j]); } } ... } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ #pragma css task inout(A[64][64])‏ void spotrf_tile(float *A)‏

Cholesky factorization

N Queens Compute the number of solutions to the problem of locating N queens on an N  N board, without any of them killing any other

N Queens: sequential version void nqueens(int n, int j, char *a, int *results) { int i; if (n == j) { /* put good solution in heap, return pointer to it. */ if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } (*results)++; } else { /* try each possible position for queen <j> */ for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens(n, j + 1, a, results); } } } } ... nqueens (n, 0, a, &total_res); ... int ok(int n, char *a) { int i, j; char p, q; for (i = 0; i < n; i++) { p = a[i]; for (j = i + 1; j < n; j++) { q = a[j]; if (q == p || q == p - (j - i) || q == p + (j - i)) return 0; } } return 1; }

Queens strategy in SMPSs Sequential execution CUT level Solution Task

N Queens • Manual memory allocation void nqueens(int n, int j, char *a, int depth) { int i; for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { b= malloc (BOARD_SIZE*SIZEOF(int)); mcopy (b, a, j*SIZEOF(int)); nqueens_ser_task(n, j + 1, b, &total_res); } } } } #pragma css task input (n, j, a[n]) inout (results)\ reduction (results) void nqueens_ser_task(int n, int j, char *a, int *results) { int i; int local_sols=0; if (n == j) { #pragma css mutex lock (&find_solution) if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } #pragma css mutex unlock (&find_solution) local_sols++; } else { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a, &local_sols); } } } #pragma css mutex lock (results) *results = *results + local_sols; #pragma css mutex unlock (results) }

N Queens • Memory allocation: automatic renames the array void nqueens(int n, int j, char *a, char *b, int depth) { int i; for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { nqueens_ser_task(n, j + 1, b, &total_res); } } } } #pragma css task input (n, j, a[n]) inout (results)\ reduction (results) void nqueens_ser_task(int n, int j, char *a, int *results) { int i; int local_sols=0; if (n == j) { #pragma css mutex lock (&find_solution) if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } #pragma css mutex unlock (&find_solution) local_sols++; } else { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a, &local_sols); } } } #pragma css mutex lock (results) *results = *results + local_sols; #pragma css mutex unlock (results) } #pragma css task input (j, i, n)\ inout (a[n]) highpriority void add_queen_task(char *a, int j, int i, int n) { a[j] = i; }

Stream Stream is one of the HPC Challenge benchmark suite http://icl.cs.utk.edu/hpcc/ Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel. Original OpenMP version: inserts barriers after each operation has processed all elements of the array Two StarSs versions: with and without barriers Without barriers the correctness is guaranteed thanks to the data-dependence analysis Initialize: a, b, c Copy: c = a Scale: b = s*c Add: c = a + b Triad: a = b + s*c

Stream: mimicking original version main (){ #pragma css start tuned_initialization(); #pragma css barrier scalar = 3.0; for (k=0; k<NTIMES; k++)} tuned_STREAM_Copy(); #pragma css barrier tuned_STREAM_Scale(scalar); #pragma css barrier tuned_STREAM_Add(); #pragma css barrier tuned_STREAM_Triad(scalar); #pragma css barrier } #pragma css finish } #pragma css task input (a) output (c) void copy_task(double a[BSIZE], double c[BSIZE]) { int j; for (j=0; j < BSIZE; j++) c[j] = a[j]; } void tuned_STREAM_Copy() { int j; for (j=0; j<N; j+=BSIZE) copy_task (&a[j], &c[j]); } #pragma css task input (c, scalar) output (b) void scale_task (double b[BSIZE], double c[BSIZE], double scalar) { int j; for (j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Scale(double scalar) { int j; for (j=0; j<N; j+=BSIZE) scale_task (&b[j], &c[j], scalar); }

Stream #pragma css task input (a, b) output (c) void add_task (double a[BSIZE], double b[BSIZE], double c[BSIZE]) { int j; for (j=0; j < BSIZE; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Add() { int j; for (j=0; j<N; j+=BSIZE) add_task(&a[j], &b[j], &c[j]); } #pragma css task input (b, c, scalar) output (a) void triad_task (double a[BSIZE], double b[BSIZE], double c[BSIZE], double scalar) { int j; for (j=0; j < BSIZE; j++) a[j] = b[j]+scalar*c[j]; } void tuned_STREAM_Triad(double scalar) { int j; for (j=0; j<N; j+=BSIZE) triad_task (&a[j], &b[j], &c[j], scalar); }

Stream: version without barriers Version without barriers: SMPSs/CellSs data dependence analysis guarantees correctness main (){ #pragma css start tuned_initialization(); #pragma css barrier scalar = 3.0; for (k=0; k<NTIMES; k++)} tuned_STREAM_Copy(); tuned_STREAM_Scale(scalar); tuned_STREAM_Add(); tuned_STREAM_Triad(scalar); } #pragma css finish } same iteration Init next iteration Copy Add Scale Triad 1 2 3 4 5 6 7 8 9 11 12 10 13 14 15 16 18 17 19 20

Molecular dynamics: Argon Molecular dynamics: Argon simulation Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K All elestrostatic forces observed for each of the atoms against the others are considered (Fi)‏ The second Newton law is then applied to each atom Fi=m*ai The initial velocities are random but reasonable for argon atoms at 300K To maintain a constant temperature in all the process the Berendsen algorithm is applied

Molecular dynamics: Argon simulation program argon ... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo enddo !$CSS FINISH end program argon ... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE)‏ real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface

Molecular dynamics: Argon simulation program argon ... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo enddo !$CSS FINISH end !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz) ! subroutine code end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ ! subroutine code end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ ! subroutine code end subroutine

Hands-on Copy files from: /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial.tar.gz

Extensions to OpenMP

Extensions to OpenMP tasking Dependences: not all arguments in directionality clause Heterogeneous devices Different implementations Separation dependences/transfers E. Ayguade, et all, “A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures” LNCS, IWOMP 2009, IJPP2010

Inline directives: saves manual outlining !!! Tasks have no name  not multiple implementations Sparse LU (data by blocks)

Annotated function declaration: ALL instances become tasks Heterogeneous Cholesky

Array sections

Hybrid MPI/SMPSs

StarSs Programming Overview and Hands-On Examples