1 / 6

Code Transformations to Improve Memory Parallelism

In the paper "Enhancing Memory Parallelism" presented at MICRO-32, the authors Vijay S. Pai and Sarita Adve address the bottleneck of the memory system in ILP-based architectures. They propose solutions to overlap multiple read misses within the same instruction window using techniques like read miss clustering and automatic unroll-and-jam transformations. The study employs the Rice Simulator for ILP multiprocessors with experiments demonstrating a 9-39% reduction in multiprocessor execution time and an 11-48% reduction in uniprocessor execution time. The paper explores the hardware support needed for these transformations and highlights their strengths as well as weaknesses.

zahina
Télécharger la présentation

Code Transformations to Improve Memory Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999

  2. Motivation and Solutions • Memory system is the bottleneck in ILP-based system • Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality • Lack of enough independent load misses in a single instruction window • Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam • Automate code transformation • Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)

  3. Unroll-and-jam

  4. Apply code transformations in a compiler • Automatic unroll-and-jam transformation • Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) • Dependence analysis of limit memory parallelism • Cache-line dependences • Address dependences • Window constraints • Experimental methodology • Environment: Rice Simulator for ILP Multiprocessors • Workload: Latbench,five scientific applications • Incorporate miss clustering by hand • Results • 9-39% reduction in multiprocessor execution time • 11-48% reduction in uniprocessor execution time

  5. Strengths • Good performance • Weaknesses • Transformations is lack of validity

  6. Questions to discuss: • What hardware supports are needed to overlap multiple read misses? • Why use unroll-and-jam instead of strip-mine and interchange code transformation? • How do you think of the future work? • V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism. http://www.ece.rice.edu/~rsim/pubs/TR9910.ps

More Related