Code Transformations to Improve Memory Parallelism

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999

Motivation and Solutions • Memory system is the bottleneck in ILP-based system • Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality • Lack of enough independent load misses in a single instruction window • Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam • Automate code transformation • Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)

Unroll-and-jam

Apply code transformations in a compiler • Automatic unroll-and-jam transformation • Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) • Dependence analysis of limit memory parallelism • Cache-line dependences • Address dependences • Window constraints • Experimental methodology • Environment: Rice Simulator for ILP Multiprocessors • Workload: Latbench,five scientific applications • Incorporate miss clustering by hand • Results • 9-39% reduction in multiprocessor execution time • 11-48% reduction in uniprocessor execution time

Strengths • Good performance • Weaknesses • Transformations is lack of validity

Questions to discuss: • What hardware supports are needed to overlap multiple read misses? • Why use unroll-and-jam instead of strip-mine and interchange code transformation? • How do you think of the future work? • V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism. http://www.ece.rice.edu/~rsim/pubs/TR9910.ps

Code Transformations to Improve Memory Parallelism