70 likes | 194 Vues
In the paper "Enhancing Memory Parallelism" presented at MICRO-32, the authors Vijay S. Pai and Sarita Adve address the bottleneck of the memory system in ILP-based architectures. They propose solutions to overlap multiple read misses within the same instruction window using techniques like read miss clustering and automatic unroll-and-jam transformations. The study employs the Rice Simulator for ILP multiprocessors with experiments demonstrating a 9-39% reduction in multiprocessor execution time and an 11-48% reduction in uniprocessor execution time. The paper explores the hardware support needed for these transformations and highlights their strengths as well as weaknesses.
E N D
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999
Motivation and Solutions • Memory system is the bottleneck in ILP-based system • Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality • Lack of enough independent load misses in a single instruction window • Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam • Automate code transformation • Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)
Apply code transformations in a compiler • Automatic unroll-and-jam transformation • Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) • Dependence analysis of limit memory parallelism • Cache-line dependences • Address dependences • Window constraints • Experimental methodology • Environment: Rice Simulator for ILP Multiprocessors • Workload: Latbench,five scientific applications • Incorporate miss clustering by hand • Results • 9-39% reduction in multiprocessor execution time • 11-48% reduction in uniprocessor execution time
Strengths • Good performance • Weaknesses • Transformations is lack of validity
Questions to discuss: • What hardware supports are needed to overlap multiple read misses? • Why use unroll-and-jam instead of strip-mine and interchange code transformation? • How do you think of the future work? • V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism. http://www.ece.rice.edu/~rsim/pubs/TR9910.ps