Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. LakshminarayanaHyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by SankalpShivaprakash

Motivation • Memory latency hiding through multithread prefetching schemes • Per-warp training and Stride promotion • Inter-thread Prefetching • Adaptive Throttling • Propose software and hardware prefetching mechanisms for a GPGPU architecture • Scalable to large number of threads • Robustness through feedback and throttling mechanisms to avoid degraded performance

Memory Latency Hiding techniques • Multithreading • Thread level and Warp level context switching • Utilization of complex cache memory hierarchies • Using L1, L2, DRAMs than accessing Global Memory each time • Prefetching • Insufficient thread-level parallelism • Memory request merging Thread1 Thread2 Thread1 Thread3

Prefetching – Parallel Architectures • Reason for prefetching: Consider warp1 and warp2 having three instructions(Add, Sub, Load) • Without prefetch: • With prefetch: • Prefetch1: Fetching for Load2 • Prefetch2: Fetching for Load3 Warp2 Warp1 Load1 for Warp1 Idle-Load2 for Warp2 Warp2 Warp1 Warp3 Load1 for Warp1

Prefetching (Contd) • Software Prefetching • Prefetching into Registers • Prefetching into Cache • Congestion in Cache if not controlled and accurate • Data could get polluted

Prefetching (Contd) • Hardware Prefetching • Stream Prefetcher • Monitors the direction of access in a memory region • Once a constant access direction is detected, launch prefetches in that direction • Stride Prefetcher • Tracks the difference in address between two accesses • Launches prefetch requests using the delta once a constant difference is detected • GHB Prefetcher (Global History Buffer) • Stores miss addresses in an n-entry FIFO table(GHB table) • Each miss address points to another entry(right) which can detect stream, stride and irregular repeating address patterns *Characterize Aggressiveness 0 1000 1000 δ= 1000 2000 1000

Many-Thread aware prefetching MT-SWP • Conventional Stride Prefetching • Inter-thread Prefetching(IP)

Many-Thread aware prefetching MT-HWP Scalable versions of the traditional training policies, for PC based stride prefetchers • Per warp training • Strong stride behavior exists within a warp • Stride information trained per warp is stored in a PWS (Per Warp Stride) Table

Many-Thread aware prefetching MT-HWP • Stride Promotion • Considering the stride pattern is the same across all warps for a given PC, PWS is monitored for three accesses • If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS • Inter-thread Prefetching • Monitor stride pattern across threads at the same PC, for 3 memory accesses • If found same, stride information is stored in the IP table

Many-Thread aware prefetching MT-HWP • Implementation • When there are hits in both GS and IP, GS is given preference because • Strides within warp are more common than those across warps • Trained for a longer period

Useful vs. Harmful Prefetching • MTAML-Minimum Tolerable Average Memory Latency • Minimum average number of cycles per memory request that does not lead to stalls • MTAML_pref

Useful vs. Harmful Prefetching • Comparison of MTAML and measured average latency (AVG Latency) • 1. AVG Latency < MTAML & • AVG Latency(PREF)< MTAML_pref • 2. AVG Latency > MTAML: • Prefetching beneficial provided • AVG Latency (PREF) is less than MTAML_pref • 3. Prefetching might turn out useful/Harmful • Measured AVG Latency(PREF) ignores successively prefetched memory operations • Greater contention seen when the number of warps increase and delay increased 2 1 3

Useful vs. Harmful Prefetching • Harmful prefetch requests could be due to: • Queuing Delays • DRAM row-buffer conflicts • Wasting of off-chip bandwidth due to early eviction • Wasting of off-chip bandwidth due to inaccurate prefetches

Metrics for Adaptive Prefetch Throttling • Early Eviction Rate • Merge Ratio • Avoids : • Consumption of system bandwidth • Delay requests • Occupation of Cache by unnecessary prefetches Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps

Metrics for Adaptive Prefetch Throttling • Monitoring of Early Eviction and Merge Ratio

Methodology • Baseline processor used is NVIDIA’s 8800GT • Applications to simulator is generated using GPUOcelot, a binary translator framework for PTX

Methodology

Results and Discussion

Conclusion • The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it • The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone • Scalability and robustness was given importance • The study does not consider complex cache memory hierarchies • Overhead of prefetching is not clearly substantiated

Thank You

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Presentation Transcript

Prefetching for RC

Prefetching

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Performance and Power Aware CMP Thread Allocation

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures

Power-Aware Hardware Prefetching

Application Aware Prioritization Mechanisms for On-Chip Networks

Neighborhood Aware Power Saving Mechanisms for ad hoc networks

Thread application

Locality Aware Mechanisms for Large-scale Networks

Thread Management in Application Servers

Odyssey Agile, Application-Aware Adaptation for Mobility

Application-level Prefetching

Improving The Average-Case Using Worst-Case Aware Prefetching

GPGPU

Agile Application-Aware Adaptation for Mobility

Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks

A Taxonomy of Data Prefetching Mechanisms