1 / 39

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism Part II. Chapter 5 of Allen and Kennedy. Seen So Far. Uncovering potential vectorization in loops by Loop Interchange Scalar Expansion Scalar and Array Renaming Safety and Profitability of these transformations. Today’s Talk. More transformations

blossom
Télécharger la présentation

Enhancing Fine-Grained Parallelism Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Fine-Grained ParallelismPart II Chapter 5 of Allen and Kennedy

  2. Seen So Far... • Uncovering potential vectorization in loops by • Loop Interchange • Scalar Expansion • Scalar and Array Renaming • Safety and Profitability of these transformations

  3. Today’s Talk... • More transformations • Node Splitting • Recognition of Reductions • Index-Set Splitting • Run-time Symbolic Resolution • Loop Skewing • Unified framework to generate vector code

  4. Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm

  5. DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Node Splitting

  6. Node Splitting Algorithm • Takes a constant loop independent antidependence D • Add new assignment x: T$=source(D) • Insert x before source(D) • Replace source(D) with T$ • Make changes in the dependence graph

  7. Not always profitable For example DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = A(I) + 32 ENDDO Node Splitting gives DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = A(I) + 32 ENDDO Recurrence still not broken Antidependence was not critical Node Splitting: Profitability

  8. Node Splitting • Determining minimal set of critical antidependences is in NP-C • Perfect job of Node Splitting difficult • Heuristic: • Select antidependences • Delete it to see if acyclic • If acyclic, apply Node Splitting

  9. Recognition of Reductions • Sum Reduction, Min/Max Reduction, Count Reduction • Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO • Not directly vectorizable

  10. Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO Recognition of Reductions

  11. Recognition of Reductions • After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO ENDDO • Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO

  12. Recognition of Reductions • Useful for vector machines with 4 stage pipeline • Recognize Reduction and Replace by the efficient version

  13. Recognition of Reductions • Properties of Reductions • Reduce Vector/Array to one element • No use of Intermediate values • Reduction operates on vector and nothing else

  14. Recognition of Reductions • Reduction recognized by • Presence of self true, output and anti dependences • Absence of other true dependences DO I = 1, N S = S + A(I) ENDDO DO I = 1, N S = S + A(I) T(I) = S ENDDO

  15. Index-set Splitting • Subdivide loop into different iteration ranges to achieve partial parallelization • Threshold Analysis [Strong SIV, Weak Crossing SIV] • Loop Peeling [Weak Zero SIV] • Section Based Splitting [Variation of loop peeling]

  16. Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO ENDDO Vectorize this

  17. Threshold Analysis • Crossing thresholds DO I = 1, 100 A(100-I) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 50 DO i = I, I+49 A(101-i) = A(i) + B ENDDO ENDDO Vectorize to.. DO I = 1, 100, 50 A(101-I:51-I) = A(I:I+49) + B ENDDO

  18. Loop Peeling • Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)

  19. DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO J Loop bound by recurrence due to B Only a portion of B is responsible for it Partition second loop into loop that uses result of S1 and loop that does not DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO Section-based Splitting

  20. DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO S3 now independent of S1 and S2 Loop distribute to DO I = 1, N DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO Section-based Splitting

  21. DO I = 1, N DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO Vectorized to A(N/2+1:N,2:N+1) = B(N/2+1:N,1:N) + D DO I = 1, N B(1:N/2,I) = A(1:N/2,I) + C A(1:N/2,I+1) = B(1:N/2,I) + D ENDDO Section-based Splitting

  22. Run-time Symbolic Resolution • “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF

  23. Run-time Symbolic Resolution • Identifying minimum number of breaking conditions to break a recurrence is in NP-C • Heuristic: • Identify when a critical dependence can be conditionally eliminated via a breaking condition

  24. Loop Skewing • Reshape Iteration Space to uncover parallelism DO I = 1, N DO J = 1, N (=,<) S: A(I,J) = A(I-1,J) + A(I,J-1) (<,=) ENDDO ENDDO Parallelism not apparent

  25. Loop Skewing • Dependence Pattern before loop skewing

  26. Loop Skewing • Do the following transformation called loop skewing j=J+I or J=j-I DO I = 1, N DO j = I+1, I+N (=,<) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) (<,<) ENDDO ENDDO Note: Direction Vector Changes

  27. Loop Skewing • The accesses to A have the following pattern 1,1 = 0,1 + 1,0 S(1,2) 1,2 = 0,2 + 1,1 S(1,3) 1,3 = 0,3 + 1,2 S(1,4) 1,4 = 0,4 + 1,3 S(1,5) 2,1 = 1,1 + 2,0 S(2,3) 2,2 = 1,2 + 2,1 S(2,4) 2,3 = 1,3 + 2,2 S(2,5) 2,4 = 1,4 + 2,3 S(2,6)

  28. Loop Skewing • Dependence pattern after loop skewing

  29. Loop Skewing DO I = 1, N DO j = I+1, I+N S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO ENDDO Loop interchange to.. DO j = 2, N+N DO I = max(1,j-N), min(N,j-1) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO ENDDO Vectorize to.. DO j = 2, N+N FORALL I = max(1,j-N), min(N,j-1) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) END FORALL ENDDO

  30. Loop Skewing • Disadvantages: • Varying vector length • Not profitable if N is small • If vector startup time is more than speedup time, this is not profitable • Vector bounds must be recomputed on each iteration of outer loop • Apply loop skewing if everything else fails

  31. Putting It All Together • Good Part • Many transformations imply more choices to exploit parallelism • Bad Part • Choosing the right transformation • How to automate transformation selection process? • Interference between transformations

  32. Putting It All Together • Example of Interference DO I = 1, N DO J = 1, M S(I) = S(I) + A(I,J) ENDDO ENDDO Sum Reduction gives.. DO I = 1, N S(I) = S(I) + SUM (A(I,1:M)) ENDDO While Loop Interchange and Vectorization gives.. DO J = 1, N S(1:N) = S(1:N) + A(1:N,J) ENDDO

  33. Putting It All Together • Any algorithm which tries to tie all transformations must • Take a global view of transformed code • Know the architecture of the target machine • Goal of our algorithm • Finding ONE good vector loop [works good for most vector register architectures]

  34. Unified Framework • Detection: finding ALL loops for EACH statement that can be run in vector • Selection: choosing best loop for vector execution for EACH statement • Transformation: carrying out the transformations necessary to vectorize the selected loop

  35. Unified Framework: Detection procedure mark_loop(S,D) for each edge e in D deletable by scalar expansion, array and scalar renaming, node splitting or symbolic resolution do begin add e to deletable_edges; delete e from D; end mark_gen(S,1,D); for each statement x in S with no vector loop marked do begin attempt Index-Set Splitting and loop skewing; mark vector loops found; end //Restore deletable edges from deletable_edges to D end mark_loop

  36. Unified Framework: Detection procedure mark_gen(S,k,D) //Variation of codegen; Doesn’t do vectorization; Only marks vector loops for i =1 to mdo begin if Si is cyclic then if outermost carried dependence is at level p>kthen //Loop Shifting mark all loops at level < p as vector forSi; else ifSi is a reduction mark loop k as vector; mark Si reduction; else begin //Recur at deeper level mark_gen(Si,k+1,Di); end else mark statements in Sias vector for loops k and deeper; end end end mark_gen

  37. Selection and Transformation procedure transform_code(R,k,D) //Variation of codegen; for i =1 to mdo begin ifk is the index of the best vector loop for some statement in Rithen ifRi is cyclic then select_and_apply_transformation(Ri,k,D); //retry vectorization on new dependence graph transform_code(Ri,k,D); else generate a vector statement for Ri in loop k; end else begin //Recur at deeper level //Generate level k DO and ENDDO statements transform_code(Ri,k+1,D); end end end transform_code

  38. Selection of Transformations procedure select_and_apply_transformation(Ri,k,D) if loop k does not carry a dependence in Rithen shift loop k to innermost position; else ifRi is a reduction at level k then replace with reduction and adjust dependences; else //transform and adjust dependences if array renaming possible then apply array renaming and adjust dependences; else if node-splitting possible then apply node-splitting and adjust dependences; else if scalar expansion possible then apply scalar expansion and adjust dependences; else apply loop skewing or index-set splitting and adjust dependencies; end end end select_and_apply_transformation

  39. Performance on Benchmark

More Related