170 likes | 296 Vues
This paper presents an innovative approach to the automatic parallelization of stencil computations, which are vital for processing large data sets and executing multiple time iterations. By utilizing effective tiling strategies, we aim to improve data locality while addressing load imbalance issues in skewed iteration spaces. Our methods include overlapped and split tiling techniques that facilitate concurrent execution without compromising data dependencies. We evaluate these strategies using experimental setups across a dual-processor environment, demonstrating significant performance benefits in stencil computations.
E N D
Effective Automatic Parallelization of Stencil Computations* Sriram Krishnamoorthy1 Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1 1The Ohio State University 2Lousiana State University * Work supported by NSF
Introduction • Stencil computations • Sweep through large data set • Multiple time iterations • Simple load balanced schedule • Tiling – essential to improve data locality • Dependences between tiles • Pipelined execution • Skewed iteration spaces – load imbalance • Solution: Adjust tiling – re-enable concurrent execution
Motivation FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3 t i
Notation • Iteration space B: n-dim polyhedron • Dependences D: n-dim vectors • Hyperplanes H: • n-dim normal vectors • Tile bounded by pairs of hyperplanes
Approach • Concurrent start in non-tiled iteration space • Identify hyperplanes inhibiting concurrent start in tiled space • Replace one face for each inhibiting pair • Overlapped Tiling – Replace “back-face” • Split Tiling – Replace “front-face”
Concurrent Start: Before Tiling Condition: A boundary that does not carry any dependence
Inter-tile Dependences • Shift vectors • Tile traversal order • Normal to all other hyperplanes • Hyperplane carries dependence • A dependence “pokes” through • Inter-tile dependence vector • Shift vector • Corresponding hyperplane carries dependence
Concurrent Start Inhibition • Concurrent start in original iteration space along a boundary • But that boundary carries an inter-tile dependence A boundary has concurrent start S_j is an inter-tile dependence That boundary carries Inter-tile dependence
Companion Hyperplane • Hyperplane that destroys the inter-tile dependence • Swivel a hyperplane “backward” • Dependences carried by original hyperplane are “neutralized” • Incoming dependences become non-incoming • Outgoing dependences become non-outgoing
Overlapped Tiling • Replace “back face” with companion hyperplane • Additional region is shared with preceding tile • Region of preceding tile that caused the dependence • Each new tile independent of preceding tile (“do-all” parallelism) • Increased computation cost; communication volume
Split Tiling • Replace “front face” with companion hyperplane • Tile split into independent and dependent regions • Execute independent region followed by dependent region • Increased #communications
Experimental Evaluation • Cluster • 2.8 GHz dual-processor Opteron 254 • 1MB L2 cache; 4GB RAM • Linux 2.6.9; Intel compiler (icc) –O3 • Comparison • Two pipelined schedules – along space and time • 1000 time steps • 1 – 32 processors
Pipelined Execution: Parameters 64000 elements; 32 processors Space tile size : 1000 Time tile size : 16
Weak Scaling • Problem size = #procs * 20000 • Horizontal line – Linear Scaling
Conclusion • Time tiling stencils – crucial for data locality • Might inhibit concurrent execution • Presented: Two approaches to enabling concurrent execution • Ongoing work: Modeling relative benefits of the two approaches