Chuck Koelbel Department of Computer Science Rice University chk@rice

COMP 422, Guest Lecture: High Performance Fortran Chuck Koelbel Department of Computer Science Rice University chk@rice.edu COMP 422 Lecture 21 February 2008

1991: Supercomputing ‘91 Birds of a Feather session Ken Kennedy, Geoffrey Fox, David Loveman propose standardizing a parallel version of Fortran 1992-1993: High Performance Fortran Forum (HPFF) meetings 30-40 people defining the language 1993: HPF 1.0 Standard published High Performance Fortran Handbook shortly thereafter 1994: Further HPFF meetings lead to HPF 1.1 1995-1996: HPFF 2.0 meetings HPF 2.0 released January 1997 1994-now: HPF 1.x compilers A Quick History of HPF

Automaticparallelization(?) ??? Vectorization(?) HPF Goal: Portable Parallel Programming Single (Vector) Processor Cray-11 processorFORTRAN 77 Shared Memory MIMD Convex C38008 processorsPCF Fortran / OpenMP Distributed Memory MIMD Intel Paragon2048 processorsIntel message passing / MPI SIMD Connection Machine CM-265,536 processorsCM Fortran / Fortran 90

Trivial (ignore directives) ??? Vectorization? HPF Goal: Portable Parallel Programming Single (Vector) Processor Cray-11 processorFORTRAN 77 Shared Memory MIMD Convex C38008 processorsPCF Fortran / OpenMP Distributed Memory MIMD Intel Paragon2048 processorsIntel message passing / MPI SIMD Connection Machine CM-265,536 processorsCM Fortran / Fortran 90

Why would you??? Easy, but uses extra copies No way in Hell HPF Goal: Portable Parallel Programming Single (Vector) Processor Cray-11 processorFORTRAN 77 Shared Memory MIMD Convex C38008 processorsPCF Fortran / OpenMP Distributed Memory MIMD Intel Paragon2048 processorsIntel message passing / MPI SIMD Connection Machine CM-265,536 processorsCM Fortran / Fortran 90

Scalarization Owner-computes rule(?) Owner-computes rule(?) HPF Goal: Portable Parallel Programming Single (Vector) Processor Cray-11 processorFORTRAN 77 Shared Memory MIMD Convex C38008 processorsPCF Fortran / OpenMP Distributed Memory MIMD Intel Paragon2048 processorsIntel message passing / MPI SIMD Connection Machine CM-265,536 processorsCM Fortran / Fortran 90

What’s the Owner-Computes Rule? • A compilation strategy suggested by • Callahan & Kennedy (thought paper), Fox (vague musings) • Vienna Fortran (Gerndt thesis), Kali (Koelbel thesis), Id Nouveau (Roger thesis), Fortran D (Tseng thesis), … • CM Fortran, MasPar Fortran, DEC Fortran, COMPASS compilers • Given • A (sequential or parallel) program using (large) data structures • A partitioning of the data structures among processors • Produce a Single Program Multiple Data (SPMD) program (i.e. same program text on each processor)that • Allocates only a part of the data • Restricts the computation to assign to its own data • Inserts communication and synchronization as needed for non-local data

Other HPF Goals and Constraints • Aimed at scientific programming, circa 1990 • FORTRAN 77, Fortran 90 • Arrays as the main data structure • Mostly regular computations • Defined by committee • Consensus where possible, votes elsewhere • Good: Not tied to particular architectures • Bad: Can’t exploit a particular architecture • Trying to raise the level of programming on HPC systems • Good: Programmers would thank us • Bad: Compiler writers wouldn’t

HPF Overview • Fortran 90 • Base language (controversial in 1992) • Data Mapping • Align • Distribute • Data Parallel Statements and Functions • Forall statement • Independent • HPF Library • Advanced Topic • Extrinsic interfaces

Fortran 90 • Major update of FORTRAN 77 (and FORTRAN 66) • Officially released in 1992 • Since supplanted by Fortran 1995 and Fortran 2003 • All of FORTRAN 77 • Arrays, DO, IF-THEN-ELSE, CALL, DOUBLE PRECISION, … • The ugly stuff: Storage and sequence association • Some really ugly stuff was “deprecated” • Important additions for HPF • Array assignments, arithmetic, and intrinsics • Explicit subprogram interfaces • Many other new features • POINTER, ALLOCATE, CASE, derived types, free source form, …

Treat arrays as whole objects Array assignment LHS and RHS must conform in size (or assign scalar to array) Array expressions Elementwise arithmetic Elementwise intrinsics Array sections Array intrinsics Queries Reductions Transformations REAL A(100,100), B(100,100) REAL U(100), V(100), X INTEGER IDX(100), I A = B U = 0.0 A = A * B + 1.0 V = MATMUL(A,U) U = SIN(V) U(1:50) = V(2:100:2) A(N:1,1) = V(N:1:-1) U = U(IDX(1:100)) I = SIZE(U) X = SUM(A(2:99,2:99)) V = CSHIFT(V,-1) A(1:10,1:10)=RESHAPE(V,(/10,10/)) Fortran 90 Array Features

Fortran 90 Interfaces • Old “implicit” interfaces still there • Like the name says, no declarations needed • Sequence association: arguments passed in column-major order • Explicit interfaces added • Type and number of dimensions checked at compile time, size at runtime (maybe) • Required only for using certain features (assumed-shape arrays) SUBROUTINE SOLVE(A,LDA,N,IPVT,B) INTEGER LDA,N,IPVT(1) REAL A(LDA,1),B(1) • INTERFACE • SUBROUTINE SOLVE(A,IPVT,B) • REAL A(:,:), B(:) • INTEGER IPVT(:) • END SUBROUTINE • END INTERFACE REAL X(2048,2048), Y(2048) INTEGER IDX(2048) CALL SOLVE(X,2048,100,IDX,Y) CALL SOLVE(X(100,100),IDX(100),Y(100))

HPF Data Mapping Model • Accomplished by directives • Structured comments beginning with !HPF$ • Did not change output semantics of the program • Ignore them on single-processor machines • Different implementations possible • Two-level scheme • ALIGN arrays to fine-grain TEMPLATE • Allows “the same” mappings, handles corner cases & different sizes • DISTRIBUTE onto PROCESSORS • Splits up all aligned data structures

DISTRIBUTE Directive • !HPF$ DISTRIBUTEarray(pat1, pat2, …) [ONTOproc] • Patterns (pat1, …) applied orthogonally • BLOCK: N/P contiguous elements • CYCLIC: Every Pth element • CYCLIC(m): Every Pth block of m elements • *: Don’t distribute this dimension • Number of distributed dimensions must match proc shape • Acts like a declaration • See also, Fortran DIMENSION statement

Examples of DISTRIBUTE REAL W(12,12),X(12,12), REAL Y(12,12),Z(12,12) !HPF$ DISTRIBUTE W(BLOCK,*) !HPF$ DISTRIBUTE X(*,CYCLIC) !HPF$ DISTRIBUTE Y(BLOCK,BLOCK) !HPF$ DISTRIBUTE Z(CYCLIC(2),CYCLIC(3))

ALIGN Directive !HPF$ ALIGNarray1(i1,…) WITHarray2(lin-fcn1,…) i1, … were dummy variables lin-fcn1, … could be Integer constant Integer linear function of one dummy *, meaning all elements of dimension Array2 could be an array or a template No circular ALIGN chains allowed Acts like a declaration See also, Fortran DIMENSION statement

Examples of ALIGN + DISTRIBUTE

More examples of ALIGN + DISTRIBUTE

Implementing Data Mapping • Memory management • Allocate local section • Generate globallocal addressing • A(I)  Alocal[I-LBOUNDlocal(A)] • Generate localglobal mapping • Alocal[I]  A(I+LBOUNDlocal(A)) • Bounds reduction on array assignments • Run only over (part of) local section • May generate as local or global indices REAL MY_A(0:99) PARAMETER (MY_LO=MY_ID*100,…) MY_LB=MAX(2,MY_LO+1)-MY_LO-1 MY_UB=MIN(99,MY_HI+1)-MY_LO-1 DO I=MY_LB,MY_UB A(I) = 6.66 ENDDO REAL A(1000) !HPF$ DISTRIBUTE A(BLOCK) ONTO P(1:10) A(2:99) = 6.66

Dynamic Mapping • Some programs need to change distributions mid-stream • Dynamic memory allocation • Alternating direction methods (first by rows, then by columns) • Dependences on input data (data sizes, irregular meshes) • Support by dynamic data mapping directives • DYNAMIC - marks array for (possible) remapping • REALIGN - “executable” version of ALIGN • REDISTRIBUTE - “executable” version of DISTRIBUTE • Use caused all-to-all data movement • Not popular with either users or vendors

Prescriptive “Wherever it was, move it here! I’ll move it back when I’m done” System will move data on entry and exit (if needed) Descriptive “It’s supposed to be here!” System will check, and move data if needed Transcriptive “Whatever!” System will deal with data where it is Warning: This will create lots of run-time overhead SUBROUTINE EXAMPLE(A,B,C) REAL A(100), B(100), C(100) !HPF$ DISTRIBUTE A(BLOCK) !HPF$ DISTRIBUTE B *(BLOCK) !HPF$ DISTRIBUTE C * A = B A = C B = C END SUBROUTINE REAL A(100), B(100), C(100) !HPF$ DISTRIBUTE A(CYCLIC) !HPF$ DISTRIBUTE B(BLOCK) !HPF$ DISTRIBUTE C(CYCLIC) CALL EXAMPLE(A,B,C) Mapping in Subprogram Interfaces

HPF 2.0 - More DISTRIBUTE Patterns • Added SHADOW directive for regular computations • Added two new options primarily for irregular computations • GEN_BLOCK: blocks of differing sizes • INDIRECT: value-based mapping

HPF Parallel Statements and Directives • Parallelism came from access to many elements of large data structures • E.g. update all elements of an array • Several methods • Fortran 90 array operations • FORALL statement • Syntactic sugar for array assignment • Also allows non-rectangular regions and a few other tricks • INDEPENDENT assertion • “This loop has no carried dependences” • If used correctly, doesn’t change output semantics • EXTRINSIC subroutines • Escape from data parallelism

FORALL Statement Initially, a = [0, 1, 2, 3, 4] b = [0, 10, 20, 30, 40] c = [-1, -1, -1, -1, -1] FORALL ( i = 2:4 ) a(i) = a(i-1) + a(i+1) c(i) = b(i) * a(i+1) END FORALL Afterwards, a = [0, 2, 4, 6, 4] b = [0, 10, 20, 30, 40] c = [-1, 40, 120, 120, -1] Adopted (word-for-word) by Fortran 95 and later

jlo 1 11 8 5 jhi 3 11 8 6 Why Use FORALL? • Assignments to (extended) array sections FORALL ( i = 1:4, j = 2:4 )a(i,j)= i*4+j-3 FORALL ( i = 1:4 )a(i,i)= a(i,i) * scale FORALL ( i = 1:4 ) FORALL ( j=i:4 )a(i,j)= a(i,j) / a(i,i) END FORALL FORALL ( i=1:4 ) FORALL (j=ilo(i):ihi(i))x(j) = x(j)*y(i) END FORALL

PURE Subprograms • PURE functions (and subroutines) had no side effects • Had to be declared so with explicit interface • Purity ensured by long list of syntactic conditions • No globals or dummy parameters on left-hand side • No calls to non-PURE functions or subroutines • No HPF remapping, no I/O, …. • This made them safe to call from FORALL • Which was the whole idea • Useful to call them from FORALL for certain things • Internal control flow (e.g. element-wise iteration) • Local variables Adopted (word-for-word) by Fortran 95 and later

INDEPENDENT Directive Initially, a = [0, 2, 4, 6, 1, 3, 5, 7] b = [6, 5, 4, 3, 2, 3, 4, 5] c = [-1,-1,-1,-1,-1,-1,-1,-1] !HPF$ INDEPENDENT DO j = 1, 3 a(j) = a(b(j)) c(a(j)) = a(j)*b(a(j)) END DO Afterwards, a = [3, 1, 6, 6, 1, 3, 5, 7] b = [6, 5, 4, 3, 2, 3, 4, 5] c = [6,-1,12,-1,-1,18,-1,-1]

Why Use INDEPENDENT? • Express application-dependent information ! “Colors” don't mix with each other !HPF$ INDEPENDENT,NEW(ix, i1,i2,f12) DO i = 1, ncolor DO ix = color_beg(i), color_end(i) i1= icolor(ix,1) i2= icolor(ix,2) f12= w(i1)-w(i2) x(i1)= x(i1) + f12 x(i2)= x(i2) - f12 END DO END DO

iblue 5 5 7 6 1 1 1 3 x ired 2 2 4 4 6 6 True 8 8 4 4 6 6 8 8 4 4 6 6 8 8 8 8 8 8 False INDEPENDENT Is an Assertion • !HPF$ INDEPENDENT, NEW (j, n1) • DO i = 1, nblue • n1 = iblue(i) • DO j = ibegin(n1), ibegin(n1+1)-1 • x(n1) = x(n1) + y(j)*x(ired(j)) • END DO • END DO

Implementing an INDEPENDENT Loop • Implement data mapping • See slide 19 • Choose reference(s) to partition loop bounds • Left-hand-side owner-computes rule • Majority owner-computes rule • Minimize communication (if you can) • Derive communication sets • Apply subscript expression to partitioned bounds • Intersect with data range the processor owns • Anything left over must be sent/received • Allocate buffers (or ghost regions) • Insert communication operations • RHS references: before loop, LHS references: after loop

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y !HPF$ ALIGN iblue(i) WITH ired(*) !HPF$ ALIGN ibegin(i) WITH ired(*) . . . !HPF$ INDEPENDENT, NEW (j, n1) DO i = 1, nblue n1 = iblue(i) DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y(j)*x(ired(j)) END DO END DO DO i = 1, nblue If iblue(i) local then add to list check for non-local y(j) check for non-local x(j) End if END DO Communicate nonlocal x, y to x_buf, y_buf DO i = local iterations n1 = iblue(i)-my_lo DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y_buf(j)*x_buf(j) END DO END DO An INDEPENDENT Loop

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y !HPF$ ALIGN iblue(i) WITH ired(*) !HPF$ ALIGN ibegin(i) WITH ired(*) . . . !HPF$ INDEPENDENT, NEW (j, n1) DO i = 1, nblue n1 = iblue(i) DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y(j)*x(ired(j)) END DO END DO DO i = 1, nblue If iblue(i) local then add to list End if END DO Synchronize after last uses of x and y DO i = local iterations n1 = iblue(i)-my_lo DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+ y(j)*x(ired(j)) END DO END DO An INDEPENDENT Loop (on Shared Memory)

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y !HPF$ ALIGN iblue(i) WITH ired(*) !HPF$ ALIGN ibegin(i) WITH ired(*) . . . DO iter = 1, 100 !HPF$ INDEPENDENT, NEW (j, n1) DO i = 1, nblue n1 = iblue(i) DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y(j)*x(ired(j)) END DO END DO END DO DO i = 1, nblue If iblue(i) local then add to list check for non-local y(j) check for non-local x(j) End if END DO Communicate nonlocal y to y_buf DO iter = 1, 100 Communicate nonlocal x to x_buf DO i = local iterations n1 = iblue(i)-my_lo DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y_buf(j)*x_buf(j) END DO END DO END DO A Complex INDEPENDENT Loop

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y !HPF$ ALIGN iblue(i) WITH ired(*) !HPF$ ALIGN ibegin(i) WITH ired(*) . . . DO iter = 1, 100 !HPF$ INDEPENDENT, NEW (j, n1) DO i = 1, nblue n1 = iblue(i) DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+y(j)*x(ired(j)) END DO END DO END DO DO i = 1, nblue If iblue(i) local then add to list End if END DO DO iter = 1, 100 Synchronize DO i = local iterations n1 = iblue(i)-my_lo DO j = ibegin(n1),ibegin(n1+1)-1 x(n1) = x(n1)+ y(j)*x(ired(j)) END DO END DO END DO A Complex INDEPENDENT Loop (on Shared Memory)

HPF Library • HPF included a substantial library • MODULE HPF_LIBRARY • USE ( HPF_LIBRARY ) • We thought of it as extending the Fortran 90 intrinsics • Fortran 95 accepted some of our ideas as new intrinsics • Criteria for HPF_LIBRARY • User requested function (or closely related one) • Can be implemented in parallel, or needed for parallel programs • But parallel implementation not easy and/or not portable

HPF Library Categories • Reductions • For all associative & commutative built-ins, e.g. PARITY • Scans • Prefix and suffix, for each reduction • Scatters • With reductions at collision • Sorting • SORT_UP, SORT_DOWN, GRADE_UP, GRADE_DOWN • System and data distribution inquiry • NUMBER_OF_PROCESSORS, … • Elemental bit operations • E.g. POPCNT, also known as the NSA instruction

X = S U M ( A ) n  x  a i i  1 Y = S U M _ P R E F I X ( B ) j  y  b j i i  1 Z = S U M _ S C A T T E R ( C , Z , I N D )  z  c j i i  ind  j i HPF Library Examples

EXTRINSIC Subprograms • An escape hatch into lower-level programming paradigms (e.g. MPI, machine-specific hacks) • General mechanism and syntax EXTRINSIC(model-name) • Particular cases defined as examples • HPF_LOCAL • Constituted a contract • On the caller (HPF) side: • Explicit interface with the EXTRINSIC directive • Remapping occurs (if needed) to meet distribution specifications • System synchronizes all processors before the call • System calls “local” routine on every processor • On the callee (non-HPF) side: • INTENT(IN) and INTENT(OUT) must be obeyed • If variables are replicated, callee must make them consistent before return • Processors can access their own section of distributed arrays

EXTRINSIC Example Global HPF Local HPF

HPF’s Legacy • HPF didn’t take over the world as we once hoped • MPI started later, but came out with implementations sooner • Lesson: Early (portable, efficient) implementation needed. • (Perceived) Lack of support for irregular applications • Lesson: Listen to users! • Performance inconsistency among first HPF implementations • Lesson: Environment needed (both conceptual & software)! • But it was important • JAHPF was critical to success of Earth Simulator • “14.9 Tflops Three-dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator” won Gordon Bell prize for Language in 2002 • HPCS languages all have data parallel support • Including distributions inspired by HPF

Chuck Koelbel Department of Computer Science Rice University chk@rice