html5-img
1 / 25

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures. Rajeev Balasubramonian School of Computing, University of Utah. July 1 st 2004. Billion-Transistor Chips. Partitioned architectures: small computational units connected by a communication fabric

teva
Télécharger la présentation

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1st 2004

  2. Billion-Transistor Chips • Partitioned architectures: small computational • units connected by a communication fabric • Small computational units with limited functionality  • fast clocks, low design effort, low power • Numerous computational units  high parallelism University of Utah

  3. The Communication Bottleneck • Wire delays do not scale down at the same rate as • logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01] • 30 cycle delay to go across the chip in 10 years • 1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04] University of Utah

  4. Cache Design Centralized Cache L1D 6 cyc RAM Access Address Transfer 6 cyc Data 6 cyc Transfer 18-cycle access (12 cycles for communication) University of Utah

  5. Cache Design Centralized Cache Decentralized Cache L1D L1D L1D 6 cyc RAM Access Address Transfer 6 cyc Data 6 cyc Transfer L1D L1D 18-cycle access (12 cycles for communication) University of Utah

  6. Research Goals • Identify bottlenecks in cache access • Design cluster prefetch, a latency hiding mechanism • Evaluate and compare centralized and • decentralized designs University of Utah

  7. Outline • Motivation • Evaluation platform • Cluster prefetch • Centralized vs. decentralized caches • Conclusions University of Utah

  8. Clustered Microarchitectures • Centralized front-end • Dynamically steered • (dependences & load) • O-o-o issue and 1-cycle • bypass within a cluster • Hierarchical interconnect Instr Fetch L1D lsq crossbar ring University of Utah

  9. Simulation Parameters • Simplescalar-based simulator • In-flight instruction window of 480 • 16 clusters, each with 60 registers, 30 issue • queue entries, and one FU of each kind • Inter-cluster latencies between 2-10 • Primary focus on SPEC-FP programs University of Utah

  10. Steps Involved in Cache Access Instr Fetch L1D RAM Access Memory Disambiguation lsq Instr Dispatch Data Transfer Effective Address Transfer Effective Address Computation University of Utah

  11. Lifetime of a Load University of Utah

  12. Load Address Prediction Cache Access Cycle 68 Dispatch at cycle 0 Data Transfer Cycle 94 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 University of Utah

  13. Load Address Prediction Cache Access Cycle 68 Dispatch at cycle 0 Data Transfer Cycle 94 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 Cache Access Cycle 0 Data Transfer Cycle 26 L1D L S Q Cluster Eff. Addr. Transfer Cycle 27 Address Predictor University of Utah

  14. Memory Dependence Speculation • To allow early cache access, loads must issue • before resolving earlier store addresses • High-confidence store address predictions are • employed for disambiguation • Stores that have never forwarded results within • the LSQ are ignored • Cluster Prefetch: Combination of Load Address • Prediction and Memory Dependence Speculation University of Utah

  15. Implementation Details • Centralized table that maintains stride and last • address; stride is determined by five consecutive • accesses and cleared in case of five mispredicts • Separate centralized table that maintains a single • bit per entry to indicate stores that pose conflicts • Each mispredict flushes all subsequent instrs • Storage overhead: 18KB University of Utah

  16. Performance Results Overall IPC improvement: 21% University of Utah

  17. Results Analysis • Roughly half the programs improved IPC by >8% • Load address prediction rate: 65% • Store address prediction rate: 79% • Stores likely to not pose conflicts: 59% • Avg. number of mispredicts: 12K per 100M instrs University of Utah

  18. Decentralized Cache L1D L1D • Replicated Cache Banks • Loads do not travel far • Stores & cache refills are • broadcast • Memory disambiguation is • not accelerated • Overheads: interconnect for • broadcast and cache refill, • power for redundant writes, • distributed LRU, etc. lsq lsq lsq lsq L1D L1D University of Utah

  19. Comparing Centralized & Decentralized L1D L1D L1D IPCs without cluster prefetch lsq lsq lsq 1.43 1.52 IPCs with cluster prefetch 1.73 1.79 lsq lsq L1D L1D University of Utah

  20. Sensitivity Analysis • Results verified for processor models with • varying resources and interconnect latencies • Evaluations on SPEC-Int: address prediction rate • is only 38%  modest speedups: • twolf (7%), parser (9%) • crafty, gcc, vpr (3-4%) • rest (< 2%) University of Utah

  21. Related Work • Modest speedups with decentralized caches: • Racunas and Patt [ICS ’03], for dynamic clustered • processors; Gibert et al. [MICRO ’02] , for VLIW • clustered processors • Gibert et al. [MICRO ’03]: compiler-managed L0 • buffers for critical data University of Utah

  22. Conclusions • Address prediction and memory dependence • speculation can hide latency to cache banks; • prediction rate of 66% for SPEC-FP and • IPC improvement of 21% • Additional benefits from decentralization are • modest • Future work: build better predictors, impact on • power consumption [WCED ’04] University of Utah

  23. Title • Bullet University of Utah

  24. Title • Bullet University of Utah

More Related