Using the Iteration Space Visualizer in Loop Parallelization

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU http://winpar.elis.rug.ac.be/ppt/isv

Overview ISV – A 3D Iteration Space Visualizer : view the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values • Detectthe parallelism • Estimate the speedup • Derive a loop transformation • Find Statement-level parallelism • Future development

0 3 0 0 0 0 1 2 3 1 1 3 0 3 0 1 1 0 0 3 2 1 3 2 1 0 0 0 0 0 0 0 shared memory execution trace A(1) = A(0) A(2) = A(1) A(3) = A(2) A(2) = A(1) A(1) = A(0) A(3) = A(2) 1. Dependence program DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO

1.1 Example1 ISV directive visualize

Node Iteration Flow dependence Edge Dependence order between nodes iteration • Color Dependence type:FLOW: Write Read ANTI: Read WriteOUTPUT: Write Write 1.2 Visualize the Dependence • A dependence is visualized in an iteration space dependence graph

1.3 Parallelism? • Stepwise view sequential execution • No parallelism found • However, many programs have parallelism…

2. Potential Parallelism • Time(sequential) = number of iterations • Dataflow: iterations are executed as soon as its data are readyTime(dataflow) = number of iterations on the longest critical path • The potential parallelism is denoted byspeedup = Time(sequential)/Time(dataflow)

2.1 Example 2

Diophantine Equations + Loop bounds (polytope) = Iteration Space Dependencies

Speedup:13.3 2.2 Irregular dependence • Dependences have non-uniform distance • Parallelism Analysis:200 iterations over 15 data flow steps Problem: How to exploit it?

3. Visualize parallelism Find answers to these questions • What is the dependence pattern? • Is there a parallel loop? (How to find?) • What is the maximal parallelism?(How to exploit it?) • Is the load of parallel tasks balanced?

3.1 Example 3

3.2 3D Space

3.3 Loop parallelizable? • The I, J, K loops are in the 3D space: 32 iterations Simulate sequential execution • Which loop can be parallel?

3.4 Loop parallelization • Interactively try the parallelization Interactively check a parallel loop I • The blinking dependence edges prevent the parallelization of the given loop I.

3.5 Parallel execution • Let ISV find the correct parallelization Automatically check the parallel loop • It takes 16 time steps Simulateparallel execution

3.6 Dataflow execution • Sequential execution takes 32 time steps Simulatedata flow execution • Dataflow execution only takes 4 times steps • Potential speedup=8.

3.7 Graph partitioning • Dataflow speedup = 8 Iterating throughpartitions: the connected components • All the partitions are load balanced

4. Loop Transformation Potential parallelism Transformation Real parallelism

4.1 Example 4

4.2 The iteration space • Sequentially 25 iterations

4.3 Loop Parallelizable? • check loop I • check loop J

4.4 Dataflow execution • Totally 9 steps • Potential speedup: 25/9=2.78 • Wave front effect:all iterations on the same wave are on the same line

4.5 Zoom-in on the I-space

4.6 Speedup vs program size • Zoom-in previews parallelism in part of a loop without modifying the program • Executing the programs of different size n estimates a speedup of n2/(2n-1)

4.7 How to obtain the potential parallelism Here we already have these metrics: • Sequential time steps = N2 • Dataflow time step = 2N-1 potential speedup = N2/(2N-1) How to obtain the potential speedup of a loop? Transformation.

reversal interchange skewing 4.8 Unimodular transformation (UT) Unimodular matrix A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing • The new loop execution order is determined by the transformed index. The iteration space remains unit step size • Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop New loop index Old loop index

The plane iteration 4.9 Hyperplane transformation • Interactively define a hyper-plane • Observe the plane iteration matches the dataflow simulation plane = dataflow • Based on the plane, ISV calculates a unimodular transformation

4.10 The derived UT The transformed iteration space and the generated loop

4.11 Verify the UT • ISV checks if the transformation is valid • Observe that the parallel loop execution in the transformed loop matches the plane execution parallel = plane

5. Statement-level parallelism • Unimodular transformations work at iteration level • The statement dependence within the loop body is hidden in the iteration space graph • How to exploit parallelism at statement level? Statement to iteration

5.1 Example 5 SSV: statement space visualization

5.2 Iteration-level parallelism • The iteration space is 2D. • There are N2=16 iterations • The dataflow execution has 2N-1=7 time steps. • The potential speedup is: 16/7 = 2.29

5.3 Parallelism in statements • The (statement) iteration space is 3D • There are 2N2=32 statements • The dataflow execution still has 2N-1=7 time steps. • The potential speedup is: 32/7 = 4.58

5.4 Comparison • Here: doubles the potential speedup at iteration level

5.5 Define the partition planes • partitions • hyper-planes

What is validity? Show the execution order on top of the dependence arrows.(for 1 plane or all together, depending on the density of the slide)

5.6 Invalid UT • The invalid unimodular transformation derived from hyper-plane is refused by ISV • Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph

The base vectors The unimodular matrix 6. Pseudo distance method The pseudo distance method: • Extract base vectors from the dependent iterations • Examine if the base vectors generates all the distances • Calculate the unimodular transformation based on the base vectors

Another way to find parallelism automatically The iteration space is a grid,non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors. Finding these base vectors allows usto extend existing parallelizationto the non-uniform case.

6.1 Dependence distance • (1,0,-1) • (0,1,1)

6.2 The Transformation • The transforming matrix discovered by pseudo distance method 1 1 0 -1 0 1 1 0 0 • The distance vectors are transformed(1,0,-1) (0,1,0)(0,1,1) (0,0,1) • The dependent iterations have the samefirst index, implies the outermost loop is parallel.

The transforming matrix discovered by pseudo distance method 1 1 0 -1 0 1 1 0 0 6.3 Compare the UT matrices • An invalid transforming matrix discovered by the hyper-plane method 1 0 0 -1 1 0 1 0 1 The same first column means the transformed outermost loops have the same index.

6.4 The transformed space • The outermost loop is parallel • There are 8 parallel tasks • The load of tasks is not balanced • The longest task takes 7 time steps

7. Non-perfectly nested loop • What is it? • The unimodular transformations only work for perfectly nested loops • For non-perfectly nested loop, the iteration space is constructed with extended indices • N fold non-perfectly nested loop to a N+1 fold perfectly nested loop

7.1 Perfectly nested Loop? Non-perfectly nested loop: DO I1 = 1,3 A(I1) = A(I1-1) DO I2 = 1,4 B(I1,I2) = B(I1-1,I2)+B(I1,I2-1) ENDDO ENDDO Perfectly nested loop: DO I1 = 1,3 DO I2 = 1,5 DO I3 = 0,1 IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1) ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1) ENDDO ENDDO ENDDO

7.2 Exploit parallelism with UT

8. Applications Programs Catagory Depth Form Pattern Transformation Example 1 Tutorial 1 Perfect Uniform N/A Example 2 Tutorial 2 Perfect Non-uniform N/A Example 3 Tutorial 3 Perfect Uniform Wavefront UT Example 4 Tutorial 2 Perfect Uniform Wavefront UT Example 5 Tutorial 2+1 Perfect Uniform Stmt Partitioning UT Example 6 Tutorial 2+1 Non-perfect Uniform Wavefront UT Matrix multiplication Algorithm 3 Perfect Uniform Parallelization Gauss-Jordan Algorithm 3 Perfect Non-Uniform Parallelization FFT Algorithm 3 Perfect Non-Uniform Parallelization Cholesky Benchmark 4 Non-perfect Non-Uniform Partitioning UT TOMCATV Benchmark 3 Non-perfect Uniform Parallelization Flow3D CFD App. 3 Perfect Uniform Wavefront UT

9. Future considerations • Weighted dependence graph • More semantics on data locality: data space graph, data communication graph data reuse iteration space graph, • More loop transformation: Affine (statement) iteration space mappings Automatic statement distribution Integration with Omega library

Using the Iteration Space Visualizer in Loop Parallelization