1 / 6

Code Transformations to Improve Memory Parallelism

Code Transformations to Improve Memory Parallelism. Vijay S. Pai and Sarita Adve MICRO-32, 1999. Motivation and Solutions. Memory system is the bottleneck in ILP-based system

danica
Télécharger la présentation

Code Transformations to Improve Memory Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999

  2. Motivation and Solutions • Memory system is the bottleneck in ILP-based system • Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality • Lack of enough independent load misses in a single instruction window • Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam • Automate code transformation • Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)

  3. Unroll-and-jam

  4. Apply code transformations in a compiler • Automatic unroll-and-jam transformation • Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) • Dependence analysis of limit memory parallelism • Cache-line dependences • Address dependences • Window constraints • Experimental methodology • Environment: Rice Simulator for ILP Multiprocessors • Workload: Latbench,five scientific applications • Incorporate miss clustering by hand • Results • 9-39% reduction in multiprocessor execution time • 11-48% reduction in uniprocessor execution time

  5. Strengths • Good performance • Weaknesses • Transformations is lack of validity

  6. Questions to discuss: • What hardware supports are needed to overlap multiple read misses? • Why use unroll-and-jam instead of strip-mine and interchange code transformation? • How do you think of the future work? • V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism. http://www.ece.rice.edu/~rsim/pubs/TR9910.ps

More Related