Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained ParallelismPart II Chapter 5 of Allen and Kennedy

Seen So Far... • Uncovering potential vectorization in loops by • Loop Interchange • Scalar Expansion • Scalar and Array Renaming • Safety and Profitability of these transformations

Today’s Talk... • More transformations • Node Splitting • Recognition of Reductions • Index-Set Splitting • Run-time Symbolic Resolution • Loop Skewing • Unified framework to generate vector code

Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm

DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Node Splitting

Node Splitting Algorithm • Takes a constant loop independent antidependence D • Add new assignment x: T$=source(D) • Insert x before source(D) • Replace source(D) with T$ • Make changes in the dependence graph

Not always profitable For example DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = A(I) + 32 ENDDO Node Splitting gives DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = A(I) + 32 ENDDO Recurrence still not broken Antidependence was not critical Node Splitting: Profitability

Node Splitting • Determining minimal set of critical antidependences is in NP-C • Perfect job of Node Splitting difficult • Heuristic: • Select antidependences • Delete it to see if acyclic • If acyclic, apply Node Splitting

Recognition of Reductions • Sum Reduction, Min/Max Reduction, Count Reduction • Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO • Not directly vectorizable

Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO Recognition of Reductions

Recognition of Reductions • After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO ENDDO • Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO

Recognition of Reductions • Useful for vector machines with 4 stage pipeline • Recognize Reduction and Replace by the efficient version

Recognition of Reductions • Properties of Reductions • Reduce Vector/Array to one element • No use of Intermediate values • Reduction operates on vector and nothing else

Recognition of Reductions • Reduction recognized by • Presence of self true, output and anti dependences • Absence of other true dependences DO I = 1, N S = S + A(I) ENDDO DO I = 1, N S = S + A(I) T(I) = S ENDDO

Index-set Splitting • Subdivide loop into different iteration ranges to achieve partial parallelization • Threshold Analysis [Strong SIV, Weak Crossing SIV] • Loop Peeling [Weak Zero SIV] • Section Based Splitting [Variation of loop peeling]

Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO ENDDO Vectorize this

Threshold Analysis • Crossing thresholds DO I = 1, 100 A(100-I) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 50 DO i = I, I+49 A(101-i) = A(i) + B ENDDO ENDDO Vectorize to.. DO I = 1, 100, 50 A(101-I:51-I) = A(I:I+49) + B ENDDO

Loop Peeling • Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)

DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO J Loop bound by recurrence due to B Only a portion of B is responsible for it Partition second loop into loop that uses result of S1 and loop that does not DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO Section-based Splitting

DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO S3 now independent of S1 and S2 Loop distribute to DO I = 1, N DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO Section-based Splitting

DO I = 1, N DO J = N/2+1, N S3: A(J,I+1) = B(J,I) + D ENDDO ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I) = A(J,I) + C ENDDO DO J = 1, N/2 S2: A(J,I+1) = B(J,I) + D ENDDO ENDDO Vectorized to A(N/2+1:N,2:N+1) = B(N/2+1:N,1:N) + D DO I = 1, N B(1:N/2,I) = A(1:N/2,I) + C A(1:N/2,I+1) = B(1:N/2,I) + D ENDDO Section-based Splitting

Run-time Symbolic Resolution • “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF

Run-time Symbolic Resolution • Identifying minimum number of breaking conditions to break a recurrence is in NP-C • Heuristic: • Identify when a critical dependence can be conditionally eliminated via a breaking condition

Loop Skewing • Reshape Iteration Space to uncover parallelism DO I = 1, N DO J = 1, N (=,<) S: A(I,J) = A(I-1,J) + A(I,J-1) (<,=) ENDDO ENDDO Parallelism not apparent

Loop Skewing • Dependence Pattern before loop skewing

Loop Skewing • Do the following transformation called loop skewing j=J+I or J=j-I DO I = 1, N DO j = I+1, I+N (=,<) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) (<,<) ENDDO ENDDO Note: Direction Vector Changes

Loop Skewing • The accesses to A have the following pattern 1,1 = 0,1 + 1,0 S(1,2) 1,2 = 0,2 + 1,1 S(1,3) 1,3 = 0,3 + 1,2 S(1,4) 1,4 = 0,4 + 1,3 S(1,5) 2,1 = 1,1 + 2,0 S(2,3) 2,2 = 1,2 + 2,1 S(2,4) 2,3 = 1,3 + 2,2 S(2,5) 2,4 = 1,4 + 2,3 S(2,6)

Loop Skewing • Dependence pattern after loop skewing

Loop Skewing DO I = 1, N DO j = I+1, I+N S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO ENDDO Loop interchange to.. DO j = 2, N+N DO I = max(1,j-N), min(N,j-1) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO ENDDO Vectorize to.. DO j = 2, N+N FORALL I = max(1,j-N), min(N,j-1) S: A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) END FORALL ENDDO

Loop Skewing • Disadvantages: • Varying vector length • Not profitable if N is small • If vector startup time is more than speedup time, this is not profitable • Vector bounds must be recomputed on each iteration of outer loop • Apply loop skewing if everything else fails

Putting It All Together • Good Part • Many transformations imply more choices to exploit parallelism • Bad Part • Choosing the right transformation • How to automate transformation selection process? • Interference between transformations

Putting It All Together • Example of Interference DO I = 1, N DO J = 1, M S(I) = S(I) + A(I,J) ENDDO ENDDO Sum Reduction gives.. DO I = 1, N S(I) = S(I) + SUM (A(I,1:M)) ENDDO While Loop Interchange and Vectorization gives.. DO J = 1, N S(1:N) = S(1:N) + A(1:N,J) ENDDO

Putting It All Together • Any algorithm which tries to tie all transformations must • Take a global view of transformed code • Know the architecture of the target machine • Goal of our algorithm • Finding ONE good vector loop [works good for most vector register architectures]

Unified Framework • Detection: finding ALL loops for EACH statement that can be run in vector • Selection: choosing best loop for vector execution for EACH statement • Transformation: carrying out the transformations necessary to vectorize the selected loop

Unified Framework: Detection procedure mark_loop(S,D) for each edge e in D deletable by scalar expansion, array and scalar renaming, node splitting or symbolic resolution do begin add e to deletable_edges; delete e from D; end mark_gen(S,1,D); for each statement x in S with no vector loop marked do begin attempt Index-Set Splitting and loop skewing; mark vector loops found; end //Restore deletable edges from deletable_edges to D end mark_loop

Unified Framework: Detection procedure mark_gen(S,k,D) //Variation of codegen; Doesn’t do vectorization; Only marks vector loops for i =1 to mdo begin if Si is cyclic then if outermost carried dependence is at level p>kthen //Loop Shifting mark all loops at level < p as vector forSi; else ifSi is a reduction mark loop k as vector; mark Si reduction; else begin //Recur at deeper level mark_gen(Si,k+1,Di); end else mark statements in Sias vector for loops k and deeper; end end end mark_gen

Selection and Transformation procedure transform_code(R,k,D) //Variation of codegen; for i =1 to mdo begin ifk is the index of the best vector loop for some statement in Rithen ifRi is cyclic then select_and_apply_transformation(Ri,k,D); //retry vectorization on new dependence graph transform_code(Ri,k,D); else generate a vector statement for Ri in loop k; end else begin //Recur at deeper level //Generate level k DO and ENDDO statements transform_code(Ri,k+1,D); end end end transform_code

Selection of Transformations procedure select_and_apply_transformation(Ri,k,D) if loop k does not carry a dependence in Rithen shift loop k to innermost position; else ifRi is a reduction at level k then replace with reduction and adjust dependences; else //transform and adjust dependences if array renaming possible then apply array renaming and adjust dependences; else if node-splitting possible then apply node-splitting and adjust dependences; else if scalar expansion possible then apply scalar expansion and adjust dependences; else apply loop skewing or index-set splitting and adjust dependencies; end end end select_and_apply_transformation

Performance on Benchmark

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism Part II

Presentation Transcript

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Eliminating Fine Grained Timers in Xen

FILA: Fine-grained Indoor Localization

Fine-Grain Parallelism

PowerSpy: Fine Grained Power Profiler

Creating Coarse-grained Parallelism for Loop Nests

Fine-Grained Authorization in Databases

Fine-grained and Coarse-grained Word Sense Disambiguation

Fine-Grained Soft Semantic Constraints

Part II Continues Enhancing a Presentation

Fine-Grained Layered Multicast

Fine-Grained Failover Using Connection Migration

A Framework for Fine Grained Origins

parXXL : A Fine Grained Development Environment on Coarse Grained Architectures

Enhancing Fine-Grained Parallelism

Fine-Grained Soils:

Enhancing Fine-Grained Parallelism Part II

Fine Grained Auditing