Lecture 15 Loop Transformations

Lecture 15Loop Transformations Chapter 11.10-11.11 CS243: Loop Optimization and Array Analysis

Loop Optimization CS243: Loop Optimization and Array Analysis • Domain • Loops: Change the order in which we iterate through loops • Goals • Minimize inner loop dependences that inhibit software pipelining • Minimize loads and stores • Parallelism • SIMD Vector today, in general multiprocessor as well • Minimize cache misses • Minimize register spilling • Tools • Loop interchange • Fusion • Fission • Outer loop unrolling • Cache Tiling • Vectorization • Algorithm for putting it all together

Loop Interchange for j = 1 to n for i = 1 to n A[j][i] = A[j-1][i-1] * b[i] CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n A[j][i] = A[j-1][i-1] * b[i] • Should I interchange the two loops? • Stride-1 accesses are better for caches • But one more load in the inner loop • But one less register needed to hold the result of the loop

Loop Interchange i j CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n A[j][i] = A[j+1][i-1] * b[i] Distance Vector is (deltai, deltaj) = (1, -1) Direction Vector is (>, <) • Dependence represents that one ref, aw must happen before another ar • To permute loops, permute direction vectors in the same manner • Permutation is legal iff all permuted direction vectors are lexicographically positive • Special case: Fully permutable loop nest • Either dependence “carried” by a loop outside of the nest or all components > or = • All the loops in the nest can be arbitrarily permuted • (>, >, <) Inner two loops are fully permutable • (>=, =, >) All three loops are fully permutable

Loop Interchange i j CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to i … • How do I interchange for j = 1 to n for i = j to n • In general ugly but doable

Non Perfectly Nested loops CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 • Can’t always interchange • Can be expensive when you can

Loop Fusion for i = 1 to n for j = 1 to n S1 S2 CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 • Moving S2 across “j” iterations but not any of “i” iterations • Pretend to fuse • Legal as long as there is no direction vector from S2 to S1 with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …), • That would imply that S2 is now before S1

Loop Fusion CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] • Legal as long as there is no direction vector from the read to the write with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …) • (=, 1) so can’t fusion

Loop Fusion for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 1, n-1 … = a[i][j+1] … = a[i][n+1] for i = 1 to n a[i][1] = … for j = 2 to n { a[i][j] = … … = a[i][j] } … = a[i][n-1] for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 2, n … = a[i][j] … = a[i][n+1] CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] If the first “+” direction is always a small literal constant, can skew the loop and allow fusion Bonus: can get rid of a load and maybe a store

Loop Fission for i = 1 to n for j = 1 to n S1 for i = 1 to n for j = 1 to n S2 CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 • Moving S2 across all later “i” iterations • Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops

Loop Fission for i = 1 to n for j = 1 to n = a[i-1][j] for i = 1 to n for j = 1 to n a[i][j] = Dep from write to read of (1) CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n = a[i-1][j] for j = 1 to n a[i][j] = • Moving S2 across all later “i” iterations • Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops

Inner Loop Fission for i = 1 to n for j = 1 to n = h[i]; … … = h[i+25]; for j = 1 to n … = h[26]; … … = h[50]; CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n … = h[i]; … = h[i+1]; … … = h[i+49]; … = h[i+50]; • Legal as long as there is no dependence from an S2 to an S1 where the first “>” is in the “j” loop.

Inner Loop Fission S1 = S2 = > S3 CS243: Loop Optimization and Array Analysis for j = 1 to n S1 S2 S3 • Looking at edges carried by the inner most loops • Strongly Connected Components can not be fissioned • Everything else can be fissoned as long as loops are emitted in topological order

Outer Loop Unrolling CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many loads in the inner loop? How many MACs?

Outer Loop Unrolling CS243: Loop Optimization and Array Analysis for i = 1 to n by 2 for j = 1 to n by 2 for k = 1 to n c[i][j] += a[i][k] * b[k][j]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1]; • Is it legal?

Outer Loop Unrolling for i = 1 to 2 by 2 for j = 1 to 2 by 2 for k = 1 to 2 c[i][j] += a[i][k] * b[k]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1]; CS243: Loop Optimization and Array Analysis If n = 2 • Original order was • (1, 1, 1) (1, 1, 2) (1, 2, 1) (1, 2, 2) (2, 1, 1) (2, 1, 2) (2, 2, 1) (2, 2, 2) • New order is • (1, 1, 1) (1, 2, 1) (2, 1, 1) (2, 2, 1) (1, 1, 2) (1, 2, 2) (2, 1, 2) (2, 2, 2) • Equivalent to permuting the loops into for k = 1 to 2 for i= 1 to 2 for j = 1 to 2 • If loops are fully permutable can also outer loop unroll

Unrolling Trapezoidal Loops for i=1 to n by 2 for j = 1 to i i j • Ugly • We unroll two level trapezoidal loops but the details are very ugly CS243: Loop Optimization and Array Analysis

Trapezoidal Example for(i = 0; i <= (n + -2); i = i + 2) { lstar = (i * 2) + 2; ustar = (n - (i + 1)) + -1; if(((i * 2) + 2) < (n - (i + 1))) { for(r2d_i = i; r2d_i <= (i + 1); r2d_i = r2d_i + 1){ for(j = r2d_i * 2; j <= ((i * 2) + 1); j = j + 1){ a[r2d_i][j] = a[r2d_i][j] + 1; } } for(j0 = lstar; ustar >= j0; j0 = j0 + 1) { a[i][j0] = a[i][j0] + 1; a[i + 1][j0] = a[i + 1][j0] + 1; }; for(r2d_i0 = i; r2d_i0 <= (i + 1); r2d_i0 = r2d_i0 + 1) { for(j1 = n - (i + 1); j1 < (n - r2d_i0); j1 = j1 + 1) { a[r2d_i0][j1] = a[r2d_i0][j1] + 1; }; } } else { for(r2d_i1 = i; r2d_i1 <= (i + 1); r2d_i1 = r2d_i1 + 1) { for(j2 = r2d_i1 * 2; j2 < (n - r2d_i1); j2 = j2 + 1) { a[r2d_i1][j2] = a[r2d_i1][j2] + 1; } } } } if(n > i) { for(j3 = i * 2; j3 < (n - i); j3 = j3 + 1) { a[i][j3] = a[i][j3] + 1; }; } CS243: Loop Optimization and Array Analysis for (i=0; i<n; i++) { for (j=2*i; j<n-i; j++) { a[i][j] += 1; } }

Cache Tiling CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many cache misses?

Cache Tiling CS243: Loop Optimization and Array Analysis for jb= 1 to n by b for kb = 1 to n by b for i= 1 to n for j = jb to jb+b for k = kb to kb + b c[i][j] += a[i][k] * b[k][j]; • How many cache misses? • Order b reuse for each array • If loops are fully permutable can cache tile

Vectorization: SIMD for i = 1 to n by 8 for j = 1 to n a[j][i:i+7] = 0; CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n a[j][i] = 0; N-way parallel where N is the SIMD width of the machine

Vectorization: SIMD CS243: Loop Optimization and Array Analysis for i= 1 to n for j = 1 to n for k = 1 to n S1 …. SM • We have moved later iterations of S1 ahead of earlier iterations of S2, …, SM, etc • Legal as long as no dependence from a latter S to an earlier S where that dependence is carried by the vector loop • E.g legal to vectorize ‘j’ above if no dependence from a latter S to an earlier S with direction (=, >, *)

Putting It All Together CS243: Loop Optimization and Array Analysis • Three phase algorithm • Use fission and fusion to build perfectly nested loops • We prefer fusion but not obvious that that is right • Enumerate possibilities for unrolling, interchanging, cache tiling and vectorizing • Use inner loop fission if necessary to minimize register pressure

Phase 2 CS243: Loop Optimization and Array Analysis Choose a loop to vectorize All references that refer to vector loop must be stride-1 For each possible inner loop Compute best possible unrollings for each outer Compute best possible ordering and tiling To compute best possible unrolling Try all combinations of unrolling up to a max product of 16 For each possible unrolling Estimate the machine cycles for the inner loop (ignoring cache) Estimate the register pressure Don’t unroll more if too much register pressure To compute best possible ordering and tiling Consider only loops with “reuse” Choose best three Iterate over all orderings of three with a binary search on cache tile size Note and record the total cycle time for this configuration. Pick the best Estimating cycles Could compile every combination, but …

Machine Modeling CS243: Loop Optimization and Array Analysis • Recall that software pipelining had resource limits and latency limits • Map high level IR to machine resources • Unroll high level IR operations • Remove duplicate loads and stores • Count machine resources • Build a latency graph of unrolled operations • Iterate over inner loop cycles and find worst cycle • Assume performance is worst of two limits • Model register pressure • Count loop invariant loads and stores • Count address streams • Count cross iteration cse’s = a[i] + a[i-2] • Add machine dependent constant

Cache Modeling CS243: Loop Optimization and Array Analysis • Given a loop ordering and a set of tile factors • Combine array references that differ by constant, e.g. a[i][j] and a[i+1][j+1] • Estimate capacity of all array references, multiply by fudge factor for interference, stop increasing block sizes if capacity is larger than cache • Estimate quantity of data that must be brought into cache

Phase 3: Inner loop fission CS243: Loop Optimization and Array Analysis • Does inner loop use too many registers • Break down into SCCs • Pick biggest SCC • Does it use too many registers • If yes, too bad • If no, search for other SCCs to merge in • Pick one with most commonality • Keep merging while enough registers

Extra: Reductions CS243: Loop Optimization and Array Analysis for i = 1 to n for j = 1 to n a[j] += b[i][j]; Can I unroll for i = 1 to n by 2 for j = 1 to n a[j] += b[i][j]; a[j] += b[i+1][j]; • Legal • Integer: yes • Floating point: maybe

Extra: Outer Loop Invariants CS243: Loop Optimization and Array Analysis for i for j a[i][j] += b[i] * cos(c[j]) Can replace with for j t[j] = cos(c[j]) for i for j a[i][j] += b[i] * t[j]; • Need to integrate with model • Model must assume that invariant computation will be replaced with loads

Lecture 15 Loop Transformations

Lecture 15 Loop Transformations

Presentation Transcript

Lecture 15

Lecture 15

Lecture #15

Loop Transformations Chapter 14

Lecture 6: Log transformations

Lecture 15

Lecture 35: Loop Optimizations

Lecture 33: Loop Optimizations

Loop Transformations and Locality

Lecture 8: JMP and LOOP

Loop Transformations

Lecture 6II While Loop

ECE 1754 Loop Transformations by: Eric LaForest

Lecture 6 While-Loop programming

Lecture 4a Geometry and transformations

Lecture 6II While Loop

CSC461: Lecture 15 Transformations

Lecture 15

Lecture 8: Geometric transformations

Lecture 15

Loop Transformations