Coarse-Grain Parallelism

Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy

Introduction • Previously, our transformations targeted vector and superscalar architectures. • In this lecture, we worry about transformations for symmetric multiprocessor machines. • The difference between these transformations tends to be one of granularity.

Review • SMP machines have multiple processors all accessing a central memory. • The processors are unrelated, and can run separate processes. • Starting processes and synchonrization between proccesses is expensive.

Synchonrization • A basic synchonrization element is the barrier. • A barrier in a program forces all processes to reach a certain point before execution continues. • Bus contention can cause slowdowns.

Single Loops • The analog of scalar expansion is privatization. • Temporaries can be given separate namespaces for each iteration. DO I == 1,N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO PARALLEL DO I = 1,N PRIVATE t S1 t = A(I) S2 A(I) = B(I) S3 B(I) = t ENDDO

Privatization Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x. Privatizability can be stated as a data-flow problem: We can also do this by declaring a variable x private if its SSA graph doesn’t contain a phi function at the entry.

Array Privatization We need to privatize array variables. For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier. DO I = 1,100 S0 T(1)=X L1 DO J = 2,N S1 T(J) = T(J-1)+B(I,J) S2 A(I,J) = T(J) ENDDO ENDDO So for this fragment, T(1) is the only exposed variable.

Array Privatization • Using this analysis, we get the following code: PARALLEL DO I = 1,100 PRIVATE t S0 t(1) = X L1 DO J = 2,N S1 t(J) = t(J-1)+B(I,J) S2 A(I,J)=t(J) ENDDO ENDDO

Loop Distribution • Loop distribution eliminates carried dependencies. • Consequently, it often creates opportunity for outer-loop parallelism. • We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings. • Attempt other transformations before attempting this one.

Alignment • Many carried dependencies are due to array alignment issues. • If we can align all references, then dependencies would go away, and parallelism is possible. DO I = 2,N A(I) = B(I)+C(I) D(I) = A(I-1)*2.0 ENDDO DO I = 1,N+1 IF (I .GT. 1) A(I) = B(I)+C(I) IF (I .LE. N) D(I+1) = A(I)*2.0 ENDDO

Alignment • There are other ways to align the loop: DO I = 2,N J = MOD(I+N-4,N-1)+2 A(J) = B(J)+C D(I)=A(I-1)*2.0 ENDDO D(2) = A(1)*2.0 DO I = 2,N-1 A(I) = B(I)+C(I) D(I+1) = A(I)*2.0 ENDDO A(N) = B(N)+C(N)

Alignment • If an array is involved in a recurrence, then alignment isn’t possible. • If two dependencies between the same statements have different dependency distances, then alignment doesn’t work. • We can fix the second case by replicating code: DO I = 1,N A(I+1) = B(I)+C ! Replicated Statement IF (I .EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1)+t ENDDO DO I = 1,N A(I+1) = B(I)+C X(I) = A(I+1)+A(I) ENDDO

Alignment Theorem:Alignment, replication, and statement reordering are sufficient to eliminate all carried dependencies in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index • We can establish this constructively. • Let G = (V,E,) be a weighted graph. v  V is a statement, and (v1, v2) is the dependence distance between v1 and v2. Let o: V Z give the offset of vertices. • G is said to be carry free if o(v1) + (v1, v2) = o(v2).

Alignment Procedure procedure Align(V,E,,0) While V is not empty remove element v from V for each (w,v)  E if w  V W  W  {w} o(w)  o(v) - (w,v) else if o(w) != o(v) - (w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W  W  {w’} o(w)’  o(v) - (w,v) for each (v,w)  E if w  V W  W {w} o(w)  o(v) + (v,w) else if o(w) != o(v) + (v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W  W  {v’} o(v’)  o(w) - (v,w) end align

Loop Fusion • Loop distribution was a method for separating parallel parts of a loop. • Our solution attempted to find the maximal loop distribution. • The maximal distribution often finds parallelizable components to small for efficient parallelizing. • Two obvious solutions: • Strip mine large loops to create larger granularity. • Perform maximal distribution, and fuse together parallelizable loops.

Fusion Safety Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. DO I = 1,N S1 A(I) = B(I)+C ENDDO DO I = 1,N S2 D(I) = A(I+1)+E ENDDO DO I = 1,N S1 A(I) = B(I)+C S2 D(I) = A(I+1)+E ENDDO

Fusion Safety • We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph. • Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them. Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2.

Fusion Profitability Parallel loops should generally not be merged with sequential loops. Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelism-inhibiting if after merging L1 and L2, the dependence is carried by the combined loop. DO I = 1,N S1 A(I+1) = B(I) + C ENDDO DO I = 1,N S2 D(I) = A(I) + E ENDDO DO I = 1,N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO

Typed Fusion • We start off by classifying loops into two types: parallel and sequential. • We next gather together all edges that inhibit efficient fusion, and call them bad edges. • Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: • Bad Edge Constraint: vertices joined by a bad edge aren’t fused. • Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused

Typed Fusion Procedure procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p  fused else p  next[maxBadPrev[n]] if p != 0 then x  node[p] num[n]  num[x] update_successors(n,t) fuse x and n and call the result n else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion

Typed Fusion Example Original loop graph Graph annotated (maxBadPrev,p)  num After fusing parallel loops After fusing sequential loops

Cohort Fusion • Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel. • We can do this as follows: • Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} • Put a barrier at the end of each identified cohort • Run TypedFusion again to fuse the parallel loops in each cohort

Coarse-Grain Parallelism

Coarse-Grain Parallelism

Presentation Transcript

Coarse Dispersion

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism

COARSE GRAINS : GRAIN SORGHUM OATS BARLEY

Fine-Grain Parallelism

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Snoop Filtering and Coarse-Grain Memory Tracking

The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism

Creating Coarse-grained Parallelism for Loop Nests

Design Space Exploration for a Coarse Grain Accelerator

Operating System Support for Fine-Grain Parallelism on Multicore Architectures

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence

Architecture of Datapath-oriented Coarse-grain Logic and Routing for FPGAs

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Acurate determination of parameters for coarse grain model

Evaluation of Fracture toughness of fine- and coarse-grain graphite

PARALLELISM PARALLELISM PARALLELISM

Body Bias Grain Size Exploration for a Coarse Grained Reconfigurable Accelerator

Architecture of Datapath-oriented Coarse-grain Logic and Routing for FPGAs