Power Efficient DRAM Speculation

Power Efficient DRAM Speculation Nidhi Aggarwal†, Jason F. Cantin‡, Mikko H. Lipasti†, and James E. Smith† †University of Wisconsin-Madison 1415 Engineering Drive Madison, WI 53705 ‡ International Business Machines, 11400 Burnet Road Austin, TX 78758 The 14th Annual International Symposium on High Performance Computer Architecture February 20th, 2008

Overview Power Efficient DRAM Speculation: • Utilizes Region Coherence Arrays to identify requests likely to result in cache-to-cache transfers • Does not access DRAM speculatively for these requests • Reduces DRAM power and energy consumption HPCA 2008

Problem DRAM power consumption is a growing problem • Large and increasing portion of the total system power in the mid-range and high-end markets • E.g., DRAM power in Niagara ~ 22% of system power Many systems access DRAM speculatively for performance + Reduces latency • Wastes DRAM power • Wastes DRAM bandwidth HPCA 2008

Opportunity Not all requests use data from DRAM • Depending on: • Number, size, and associativity of caches in the system • Number of processors • Amount of sharing in the application and OS • Protocols optimized for cache-to-cache transfers, e.g., IBM Power6 There is no need to access DRAM if a request will not use the data Coarse-Grain Coherence Tracking can help detect these requests HPCA 2008

Example DRAM Read in progress Address Data Example: Useful DRAM Read Example: Unused DRAM Read Memory Controller DRAM latency is overlapped with the snoop. DRAM power is wasted. Data Response: Hit Response: Miss Processor A Processor B Request Data

Unused DRAM Reads 29% of DRAM requests are unused reads HPCA 2008

Background Coarse-Grain Coherence Tracking: • Memory is divided into coarse-grain regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A structure is added to each processor’s cache hierarchy to monitor the coherence of regions • Region Coherence Arrays(Cantin et al. ISCA’05) • RegionScout Filters (Moshovos, ISCA’05) HPCA 2008

Background • The RCA is used to avoid broadcast snoops • Requests to data not currently shared • Reduces latency, snoop traffic • The RCA is also used to filter broadcast snoops from other processors • Reduces power, tag lookup bandwidth • Though RCAs were designed to detect non-shared data, they also accurately detect shared data HPCA 2008

Regions have “unknown” external state if there is not a valid entry in the RCA Regions have “externally-clean” state if other processors may have clean copies of lines Regions have “externally-dirty” state if other processors may have modifiable copies of lines Terminology HPCA 2008

Unused DRAM Reads 29% of DRAM reads unused, and to externally-dirty regions HPCA 2008

Unused DRAM Reads 76% 59% 36% 33% 15% HPCA 2008

Approach Utilize information from Region Coherence Arrays to identify requests likely to obtain data from other processors’ caches • Set a bit in the memory request to inform the memory controller not to speculatively access DRAM Buffer requests in the memory controller until snoop response arrives Use snoop response to validate prediction • If other processor will provide the data, drop the request • If not, perform DRAM read, incurring a latency penalty HPCA 2008

Power-Efficient DRAM Speculation DRAM Read in progress Address Data Example: Unused Read, predicted to be unused Example: Useful Read, predicted to be unused Memory Controller Read buffered Data Latency Added Power saved Response: Hit Response: Miss Processor A Processor B Request Data

Policies • Baseline: All read requests speculatively access DRAM • Base-NoSpec: No read requests speculatively access DRAM • Shen-CRP: Read requests do not speculatively access DRAM if there is a tag • match on an invalid frame in the cache • PEDS-DKD: “Delay Known Dirty” –Read requests speculatively access DRAM • unless the region state is externally-dirty • PEDS-DLD: “Delay Likely Dirty” —Read requests speculatively access DRAM • unless the region is externally-dirty, or was externally-dirty in the past • (special state added to the RCA) • PEDS-DNC: “Delay Not Clean” –Only requests to a region that is externally-clean • (or has been) speculatively access DRAM. Special state added to • RCA • PEDS-DAS: “Delay All Snoops” –No broadcast reads speculatively access DRAM HPCA 2008

Overhead One additional bit in the memory request packet • Tag good/bad candidates for a speculative DRAM access One additional region state for some policies • PEDS-DLD and PEDS-DNC Space in memory controller queues to buffer requests until the snoop response arrives • Optional HPCA 2008

Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors (1.5GHz) • Two-level cache hierarchy with split L1, unified L2 caches • Separate address and data networks, shared memory controller • RCA with same # of sets / associativity as L2 cache, 512B regions DRAMSim: • Detailed DRAM timing/power model • Models DRAM power at the rank level • 8GB Micron DDR200, dual channel HPCA 2008

Workloads Scientific Benchmarks • Barnes • Ocean • Raytrace • Radiosity Multiprogrammed Workloads • SPECint95rate • SPECint2000rate Commercial Workloads • TPC-W • TPC-B • TPC-H • SPECweb99 • SPECjbb2000 HPCA 2008

Comparison –Reads Performed ~33% reduction ~15% reduction ~28% reduction HPCA 2008

Comparison –DRAM Power ~31% reduction HPCA 2008

~10% less reduction due to latency Comparison –DRAM Energy HPCA 2008

Comparison –Execution Time 7.4% increase RCA Alone HPCA 2008

Comparison –Time Between Requests 3.7x 2.3x 2.3x HPCA 2008

Future Work Add more bits to memory requests to enable the memory controller to better prioritize requests Combining PEDS with other DRAM power management techniques, e.g.: • A Comprehensive Approach to DRAM Power Management, Hur and Lin, HPCA’08 • Memory Controller Policies for DRAM Power Management, Fan, Ellis, and Lebeck ISLPED’01 Combining PEDS with DRAM scheduling techniques • Memory Access Scheduling, Rixner et al., ISCA’00 HPCA 2008

Conclusion Power Efficient DRAM Speculation: Reduces DRAM power consumption • Filters unnecessary DRAM reads • Reduces DRAM utilization  less dynamic power • Increases time between requests  less standby power Small performance impact • Few memory requests delayed unnecessarily • Fewer DRAM reads  less contention for other requests Reduces DRAM energy consumption HPCA 2008

Something to think about… DRAM power may soon dominate system power • And thus cooling costs, operating costs, and battery life This does not bode well for micro-architectural techniques • Speculative DRAM accesses • Prefetching • Run-ahead execution • Hardware/Software Parallelization Research needs to focus more on this • Beyond using existing low-power modes • Beyond filtering speculative accesses • Beyond just inserting another level of cache

The End HPCA 2008

Power Efficient DRAM Speculation

Power Efficient DRAM Speculation

Presentation Transcript

Power = “Efficient” Events

Introducing Speculation

Data Speculation

A Comprehensive Approach to DRAM Power Management

DRAM

Efficient Power Generation

Power Efficient Computing

Respec : Efficient Online Multiprocessor Replay via Speculation and External Determinism

Speculation

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

POWER EFFICIENT ESCELATOR

Speculation

Power Efficient Cache Coherence

HW Speculation

FUTURES: SPECULATION

Low Voltage Low Power Dram

Speculation Spillovers

Respec : Efficient Online Multiprocessor Replay via Speculation and External Determinism

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

DRAM Market

DRAM Market

DRAM Market