1 / 22

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application. Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu. In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010. Paper presentation by

Télécharger la présentation

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. LakshminarayanaHyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by SankalpShivaprakash

  2. Motivation • Memory latency hiding through multithread prefetching schemes • Per-warp training and Stride promotion • Inter-thread Prefetching • Adaptive Throttling • Propose software and hardware prefetching mechanisms for a GPGPU architecture • Scalable to large number of threads • Robustness through feedback and throttling mechanisms to avoid degraded performance

  3. Memory Latency Hiding techniques • Multithreading • Thread level and Warp level context switching • Utilization of complex cache memory hierarchies • Using L1, L2, DRAMs than accessing Global Memory each time • Prefetching • Insufficient thread-level parallelism • Memory request merging Thread1 Thread2 Thread1 Thread3

  4. Prefetching – Parallel Architectures • Reason for prefetching: Consider warp1 and warp2 having three instructions(Add, Sub, Load) • Without prefetch: • With prefetch: • Prefetch1: Fetching for Load2 • Prefetch2: Fetching for Load3 Warp2 Warp1 Load1 for Warp1 Idle-Load2 for Warp2 Warp2 Warp1 Warp3 Load1 for Warp1

  5. Prefetching (Contd) • Software Prefetching • Prefetching into Registers • Prefetching into Cache • Congestion in Cache if not controlled and accurate • Data could get polluted

  6. Prefetching (Contd) • Hardware Prefetching • Stream Prefetcher • Monitors the direction of access in a memory region • Once a constant access direction is detected, launch prefetches in that direction • Stride Prefetcher • Tracks the difference in address between two accesses • Launches prefetch requests using the delta once a constant difference is detected • GHB Prefetcher (Global History Buffer) • Stores miss addresses in an n-entry FIFO table(GHB table) • Each miss address points to another entry(right) which can detect stream, stride and irregular repeating address patterns *Characterize Aggressiveness 0 1000 1000 δ= 1000 2000 1000

  7. Many-Thread aware prefetching MT-SWP • Conventional Stride Prefetching • Inter-thread Prefetching(IP)

  8. Many-Thread aware prefetching MT-HWP Scalable versions of the traditional training policies, for PC based stride prefetchers • Per warp training • Strong stride behavior exists within a warp • Stride information trained per warp is stored in a PWS (Per Warp Stride) Table

  9. Many-Thread aware prefetching MT-HWP • Stride Promotion • Considering the stride pattern is the same across all warps for a given PC, PWS is monitored for three accesses • If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS • Inter-thread Prefetching • Monitor stride pattern across threads at the same PC, for 3 memory accesses • If found same, stride information is stored in the IP table

  10. Many-Thread aware prefetching MT-HWP • Implementation • When there are hits in both GS and IP, GS is given preference because • Strides within warp are more common than those across warps • Trained for a longer period

  11. Useful vs. Harmful Prefetching • MTAML-Minimum Tolerable Average Memory Latency • Minimum average number of cycles per memory request that does not lead to stalls • MTAML_pref

  12. Useful vs. Harmful Prefetching • Comparison of MTAML and measured average latency (AVG Latency) • 1. AVG Latency < MTAML & • AVG Latency(PREF)< MTAML_pref • 2. AVG Latency > MTAML: • Prefetching beneficial provided • AVG Latency (PREF) is less than MTAML_pref • 3. Prefetching might turn out useful/Harmful • Measured AVG Latency(PREF) ignores successively prefetched memory operations • Greater contention seen when the number of warps increase and delay increased 2 1 3

  13. Useful vs. Harmful Prefetching • Harmful prefetch requests could be due to: • Queuing Delays • DRAM row-buffer conflicts • Wasting of off-chip bandwidth due to early eviction • Wasting of off-chip bandwidth due to inaccurate prefetches

  14. Metrics for Adaptive Prefetch Throttling • Early Eviction Rate • Merge Ratio • Avoids : • Consumption of system bandwidth • Delay requests • Occupation of Cache by unnecessary prefetches Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps

  15. Metrics for Adaptive Prefetch Throttling • Monitoring of Early Eviction and Merge Ratio

  16. Methodology • Baseline processor used is NVIDIA’s 8800GT • Applications to simulator is generated using GPUOcelot, a binary translator framework for PTX

  17. Methodology

  18. Results and Discussion

  19. Results and Discussion

  20. Results and Discussion

  21. Conclusion • The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it • The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone • Scalability and robustness was given importance • The study does not consider complex cache memory hierarchies • Overhead of prefetching is not clearly substantiated

  22. Thank You

More Related