160 likes | 309 Vues
This document presents a comprehensive study on affine partitioning for optimizing parallelism and locality in programming, particularly in loop transformations. It explores various transformation strategies including skewing, fusion, and fission, alongside their implications for performance enhancement across arbitrary loop nesting structures. Results demonstrate effective speedup on high-performance processors through systematic application of these techniques. The study underscores the importance of instruction-level optimization and highlights best practices for achieving maximum parallelism and minimizing communication overhead.
E N D
Affine Partitioning for Parallelism & Locality Amy Lim Stanford University http://suif.stanford.edu/
INTERCHANGE FOR i FOR j FOR j FOR i A[i,j]= A[i,j] = REVERSAL FOR i=1 to n FOR i= n downto 1 A[i]= A[i] = SKEWING FOR i=1 TO n FOR i=1 TO n FOR j=1 TO n FOR k=i+1 to i+n A[i,j] = A[i,k-i] = FUSION/FISSION FOR i = 1 TO n FOR i = 1 TO n A[i] = A[i] = FOR i = 1 TO n B[i] = B[i] = REINDEXING FOR i = 1 to n A[1] = B[0] A[i] = B[i-1] FOR i = 1 to n-1 C[i] = A[i+1] A[i+1] = B[i] C[i] = A[i+1] C[n] = A[n+1] Traditional approach: is it legal & desirable to apply one transform? Useful Transforms for Parallelism&Locality
Affine mappings [Lim & Lam, POPL 97, ICS 99] Domain: arbitrary loop nesting, affine loop indices; instruction optimized separately Unifies Permutation Skewing Reversal Fusion Fission Statement reordering Supports blocking across all (non-perfectly nested) loops Optimal:Max. deg. of parallelism & min. deg. of synchronization Minimize communication by aligning the computation and pipelining Question: How to combine the transformations?
Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)
Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors
Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A L B L EPSS L
A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) S1 i S2 j
Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max(1,1+p) to min(n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.
Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Cjwhich maps an instance of statement jto a processor: ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik) with the objective of maximizing the rank of Cj F1(i1) Array Loops F2(i2) C1(i1) C2(i2) Processor ID Maximum Parallelism & No Communication
ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik) Rewrite partition constraints as systems of linear equations use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and l use Fourier-Motzkin algorithm to eliminate Farkas multipliers l and get systems of linear equations AC =0 Find solutions using linear algebra techniques the null space for matrix A is a solution of C with maximum rank. Algorithm
PipeliningAlternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N (parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))
Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Tjwhich maps an instance of statement jto a time stage: ij, ik Bjij 0, Bkik 0 ( ij ik) (Fxj ( ij) = Fxk ( ik)) Tj (ij) Tk (ik) lexicographically with the objective of maximizing the rank of Tj Finding the Maximum Degree of Pipelining F1(i1) Array Loops F2(i2) T1(i1) T2(i2) Time Stage
Key Insight • Choice in time mapping => (pipelined) parallelism • Degrees of parallelism = rank(T) - 1
Putting it All Together • Find maximum outer-loop parallelism with minimum synchronization • Divide into strongly connected components • Apply processor mapping algorithm (no communication) to program • If no parallelism found, • Apply time mapping algorithm to find pipelining • If no pipelining found (found outer sequential loop) • Repeat process on inner loops • Minimize communication • Use a greedy method to order communicating pairs • Try to find communication-free, or neighborhood only communication by solving similar equations • Aggregate computations of consecutive data to improve spatial locality
Use of Affine Partitioning in Locality Opt. • Promotes array contraction • Finds independent threads and shortens the live ranges of variables • Supports blocking of imperfectly nested loops • Finds largest fully permutable loop nest via affine partitioning • Fully permutable loop nest -> blockable