many thread aware prefetching mechanisms for gpgpu application n.
Skip this Video
Loading SlideShow in 5 Seconds..
Many-Thread Aware Prefetching Mechanisms for GPGPU Application PowerPoint Presentation
Download Presentation
Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

208 Vues Download Presentation
Télécharger la présentation

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. LakshminarayanaHyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by SankalpShivaprakash

  2. Motivation • Memory latency hiding through multithread prefetching schemes • Per-warp training and Stride promotion • Inter-thread Prefetching • Adaptive Throttling • Propose software and hardware prefetching mechanisms for a GPGPU architecture • Scalable to large number of threads • Robustness through feedback and throttling mechanisms to avoid degraded performance

  3. Memory Latency Hiding techniques • Multithreading • Thread level and Warp level context switching • Utilization of complex cache memory hierarchies • Using L1, L2, DRAMs than accessing Global Memory each time • Prefetching • Insufficient thread-level parallelism • Memory request merging Thread1 Thread2 Thread1 Thread3

  4. Prefetching – Parallel Architectures • Reason for prefetching: Consider warp1 and warp2 having three instructions(Add, Sub, Load) • Without prefetch: • With prefetch: • Prefetch1: Fetching for Load2 • Prefetch2: Fetching for Load3 Warp2 Warp1 Load1 for Warp1 Idle-Load2 for Warp2 Warp2 Warp1 Warp3 Load1 for Warp1

  5. Prefetching (Contd) • Software Prefetching • Prefetching into Registers • Prefetching into Cache • Congestion in Cache if not controlled and accurate • Data could get polluted

  6. Prefetching (Contd) • Hardware Prefetching • Stream Prefetcher • Monitors the direction of access in a memory region • Once a constant access direction is detected, launch prefetches in that direction • Stride Prefetcher • Tracks the difference in address between two accesses • Launches prefetch requests using the delta once a constant difference is detected • GHB Prefetcher (Global History Buffer) • Stores miss addresses in an n-entry FIFO table(GHB table) • Each miss address points to another entry(right) which can detect stream, stride and irregular repeating address patterns *Characterize Aggressiveness 0 1000 1000 δ= 1000 2000 1000

  7. Many-Thread aware prefetching MT-SWP • Conventional Stride Prefetching • Inter-thread Prefetching(IP)

  8. Many-Thread aware prefetching MT-HWP Scalable versions of the traditional training policies, for PC based stride prefetchers • Per warp training • Strong stride behavior exists within a warp • Stride information trained per warp is stored in a PWS (Per Warp Stride) Table

  9. Many-Thread aware prefetching MT-HWP • Stride Promotion • Considering the stride pattern is the same across all warps for a given PC, PWS is monitored for three accesses • If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS • Inter-thread Prefetching • Monitor stride pattern across threads at the same PC, for 3 memory accesses • If found same, stride information is stored in the IP table

  10. Many-Thread aware prefetching MT-HWP • Implementation • When there are hits in both GS and IP, GS is given preference because • Strides within warp are more common than those across warps • Trained for a longer period

  11. Useful vs. Harmful Prefetching • MTAML-Minimum Tolerable Average Memory Latency • Minimum average number of cycles per memory request that does not lead to stalls • MTAML_pref

  12. Useful vs. Harmful Prefetching • Comparison of MTAML and measured average latency (AVG Latency) • 1. AVG Latency < MTAML & • AVG Latency(PREF)< MTAML_pref • 2. AVG Latency > MTAML: • Prefetching beneficial provided • AVG Latency (PREF) is less than MTAML_pref • 3. Prefetching might turn out useful/Harmful • Measured AVG Latency(PREF) ignores successively prefetched memory operations • Greater contention seen when the number of warps increase and delay increased 2 1 3

  13. Useful vs. Harmful Prefetching • Harmful prefetch requests could be due to: • Queuing Delays • DRAM row-buffer conflicts • Wasting of off-chip bandwidth due to early eviction • Wasting of off-chip bandwidth due to inaccurate prefetches

  14. Metrics for Adaptive Prefetch Throttling • Early Eviction Rate • Merge Ratio • Avoids : • Consumption of system bandwidth • Delay requests • Occupation of Cache by unnecessary prefetches Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps

  15. Metrics for Adaptive Prefetch Throttling • Monitoring of Early Eviction and Merge Ratio

  16. Methodology • Baseline processor used is NVIDIA’s 8800GT • Applications to simulator is generated using GPUOcelot, a binary translator framework for PTX

  17. Methodology

  18. Results and Discussion

  19. Results and Discussion

  20. Results and Discussion

  21. Conclusion • The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it • The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone • Scalability and robustness was given importance • The study does not consider complex cache memory hierarchies • Overhead of prefetching is not clearly substantiated

  22. Thank You