Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1st 2004

Billion-Transistor Chips • Partitioned architectures: small computational • units connected by a communication fabric • Small computational units with limited functionality  • fast clocks, low design effort, low power • Numerous computational units  high parallelism University of Utah

The Communication Bottleneck • Wire delays do not scale down at the same rate as • logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01] • 30 cycle delay to go across the chip in 10 years • 1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04] University of Utah

Cache Design Centralized Cache L1D 6 cyc RAM Access Address Transfer 6 cyc Data 6 cyc Transfer 18-cycle access (12 cycles for communication) University of Utah

Cache Design Centralized Cache Decentralized Cache L1D L1D L1D 6 cyc RAM Access Address Transfer 6 cyc Data 6 cyc Transfer L1D L1D 18-cycle access (12 cycles for communication) University of Utah

Research Goals • Identify bottlenecks in cache access • Design cluster prefetch, a latency hiding mechanism • Evaluate and compare centralized and • decentralized designs University of Utah

Outline • Motivation • Evaluation platform • Cluster prefetch • Centralized vs. decentralized caches • Conclusions University of Utah

Clustered Microarchitectures • Centralized front-end • Dynamically steered • (dependences & load) • O-o-o issue and 1-cycle • bypass within a cluster • Hierarchical interconnect Instr Fetch L1D lsq crossbar ring University of Utah

Simulation Parameters • Simplescalar-based simulator • In-flight instruction window of 480 • 16 clusters, each with 60 registers, 30 issue • queue entries, and one FU of each kind • Inter-cluster latencies between 2-10 • Primary focus on SPEC-FP programs University of Utah

Steps Involved in Cache Access Instr Fetch L1D RAM Access Memory Disambiguation lsq Instr Dispatch Data Transfer Effective Address Transfer Effective Address Computation University of Utah

Lifetime of a Load University of Utah

Load Address Prediction Cache Access Cycle 68 Dispatch at cycle 0 Data Transfer Cycle 94 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 University of Utah

Load Address Prediction Cache Access Cycle 68 Dispatch at cycle 0 Data Transfer Cycle 94 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 Cache Access Cycle 0 Data Transfer Cycle 26 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 Address Predictor University of Utah

Memory Dependence Speculation • To allow early cache access, loads must issue • before resolving earlier store addresses • High-confidence store address predictions are • employed for disambiguation • Stores that have never forwarded results within • the LSQ are ignored • Cluster Prefetch: Combination of Load Address • Prediction and Memory Dependence Speculation University of Utah

Implementation Details • Centralized table that maintains stride and last • address; stride is determined by five consecutive • accesses and cleared in case of five mispredicts • Separate centralized table that maintains a single • bit per entry to indicate stores that pose conflicts • Each mispredict flushes all subsequent instrs • Storage overhead: 18KB University of Utah

Performance Results Overall IPC improvement: 21% University of Utah

Results Analysis • Roughly half the programs improved IPC by >8% • Load address prediction rate: 65% • Store address prediction rate: 79% • Stores likely to not pose conflicts: 59% • Avg. number of mispredicts: 12K per 100M instrs University of Utah

Decentralized Cache L1D L1D • Replicated Cache Banks • Loads do not travel far • Stores & cache refills are • broadcast • Memory disambiguation is • not accelerated • Overheads: interconnect for • broadcast and cache refill, • power for redundant writes, • distributed LRU, etc. lsq lsq lsq lsq L1D L1D University of Utah

Comparing Centralized & Decentralized L1D L1D L1D IPCs without cluster prefetch lsq lsq lsq 1.43 1.52 IPCs with cluster prefetch 1.73 1.79 lsq lsq L1D L1D University of Utah

Sensitivity Analysis • Results verified for processor models with • varying resources and interconnect latencies • Evaluations on SPEC-Int: address prediction rate • is only 38%  modest speedups: • twolf (7%), parser (9%) • crafty, gcc, vpr (3-4%) • rest (< 2%) University of Utah

Related Work • Modest speedups with decentralized caches: • Racunas and Patt [ICS ’03], for dynamic clustered • processors; Gibert et al. [MICRO ’02] , for VLIW • clustered processors • Gibert et al. [MICRO ’03]: compiler-managed L0 • buffers for critical data University of Utah

Conclusions • Address prediction and memory dependence • speculation can hide latency to cache banks; • prediction rate of 66% for SPEC-FP and • IPC improvement of 21% • Additional benefits from decentralization are • modest • Future work: build better predictors, impact on • power consumption [WCED ’04] University of Utah

Title • Bullet University of Utah

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures