Optimizing Compilers: Enhancing Fine-Grained Parallelism Techniques

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • Loop Interchange • Scalar Expansion • Scalar Renaming • Array Renaming • Node Splitting

We can fail here Recall Vectorization procedure…. procedurecodegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering); for i = 1 to m do begin if piis cyclic then begin generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement; end else generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi; end end

Can we do better? • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering • If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops • Goal in Chapter 5: To explore other transformations to exploit parallelism

Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: • Loop interchange • Scalar Expansion

Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B •DV: (=, <) ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B •DV: (<, =) ENDDO ENDDO • leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

Loop Interchange • Loop interchange is a reordering transformation • Why? • Think of statements being parameterized with the corresponding iteration vector • Loop interchange merely changes the execution order of these statements. • It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S <some statement> ENDDO ENDDO • If interchanged, S(2, 1) will execute before S(1, 2)

Loop Interchange: Safety • Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO ENDDO • Direction vector (<, >) • If we interchange loops, we violate the dependence

Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Loop Interchange: Safety • A dependence is interchange-sensitiveif it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level. • Example: Interchange-Sensitive? • Example: Interchange-Insensitive?

Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • The direction matrix for the loop nest is: < < = < = > • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. • Follows from Theorem 5.1 and Theorem 2.3

Loop Interchange: Profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO • Not suitable for vector register machines

Loop Interchange: Profitability • For Vector machines, we want to vectorize loops with stride-one memory access • Since Fortran stores in column-major order: • useful to vectorize the I-loop • Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

Loop Interchange: Profitability • MIMD machines with vector execution units: want to cut down synchronization costs • Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

Scalar Expansion DO I = 1, N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO • Scalar Expansion: DO I = 1, N S1 T$(I) = A(I) S2 A(I) = B(I) S3 B(I) = T$(I) ENDDO T = T$(N) • leads to: S1 T$(1:N) = A(1:N) S2 A(1:N) = B(1:N) S3 B(1:N) = T$(1:N) T = T$(N)

Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N)

Scalar Expansion: Safety • Scalar expansion is always safe • When is it profitable? • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. • However, we want to predict when expansion is profitable • Dependences due to reuse of memory location vs. reuse of values • Dependences due to reuse of values must be preserved • Dependences due to reuse of memory location can be deleted by expansion

Scalar Expansion: Drawbacks • Expansion increases memory requirements • Solutions: • Expand in a single loop • Strip mine loop before expansion • Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO

Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm

DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Node Splitting

Node Splitting • Determining minimal set of critical antidependences is in NP-C • Perfect job of Node Splitting difficult • Heuristic: • Select antidependences • Delete it to see if acyclic • If acyclic, apply Node Splitting

Optimizing Compilers: Enhancing Fine-Grained Parallelism Techniques

Optimizing Compilers: Enhancing Fine-Grained Parallelism Techniques

Presentation Transcript

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Eliminating Fine Grained Timers in Xen

FILA: Fine-grained Indoor Localization

Fine-Grain Parallelism

PowerSpy: Fine Grained Power Profiler

Creating Coarse-grained Parallelism for Loop Nests

Fine-Grained Authorization in Databases

Fine-grained and Coarse-grained Word Sense Disambiguation

Fine-Grained Soft Semantic Constraints

Fine-Grained Layered Multicast

Fine-Grained Failover Using Connection Migration

A Framework for Fine Grained Origins

parXXL : A Fine Grained Development Environment on Coarse Grained Architectures

Enhancing Fine-Grained Parallelism

Fine-Grained Soils:

Deriving Linearizable Fine-Grained Concurrent Objects

Enhancing Fine-Grained Parallelism Part II

Fine Grained Auditing