180 likes | 302 Vues
This study explores the exploitation of load latency tolerance in enhancing cache design constraints. As memory and processor frequency gaps widen, larger caches become necessary to mask access latencies. Our investigation compares advanced cache configurations and their efficiency in handling critical loads, revealing that specific cache strategies can lead to significant performance improvements, achieving a potential 40% increase. We present a thorough evaluation of criticality determination methodologies and their practical applications, alongside promising results with advanced heuristics aimed at optimizing performance in high-demand processing environments.
E N D
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan
Motivation • Increasing Memory – Processor frequency Gap • Large Data Caches to hide Long Latencies • Larger caches – Longer Access Latencies [McFarland 98] • Processor Cycle determines Cache Size • Intel Pentium III – 16K DL1 Cache, 3 cycle access • Intel Pentium 4 – 8K DL1 Cache, 2 cycle access • Need Large AND Fast Caches!
Related Work • Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] • All Loads are NOT equal • Determining Criticality – Very Complex • Sophisticated Simulator with Rollback • Non-Critical Buffer [Fisk & Bahar, ICCD99] • Determining Criticality – Performance Degradation/Dependency Chains • Non-Critical Buffer – Victim Cache for non-critical loads • Small Performance Improvements (upto 4%)
Related Work(contd.) • Locality vs. Criticality [Srinivasan et.al., ISCA 01] • Determining Criticality – Practical Heuristics • Potential for Improvement – 40% • Locality is better than Criticality • Non-Vital Loads [Rakvic et.al., HPCA 02] • Determining Criticality – Run-time Heuristics • Small and fast Vital cache for Vital Loads • 17% Performance Improvement
Criticality • Criticality – Effect of Load Latency on Performance • Two thresholds – Performance and Latency • A Very Direct Estimation of Criticality • Computation Intensive! • Static
Determining Criticality-A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles
Effectiveness? • Load Reference Distribution • What %age of Loads Identified as Critical • Miss Rate for Critical Load References • Critical Cache Configuration compared with • Faster Conventional Cache Configuration • DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles • Critical Cache Configuration compared with • Larger Conventional Cache Configuration • DL1 Sizes – 8KB, 16KB, 32KB, 64KB
Processor Configuration Similar to Alpha 21264 using SimpleScalar-3.0 [Austin, Burger 97]
ResultsComparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated
ResultsComparison with a faster Conventional Cache Configuration IPCs normalized to 32K-1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated
ResultsComparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration
ResultsComparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache
Conclusions & Future Work • Conclusions • Compares well with a faster conventional cache • Outperforms a larger conventional cache in most cases • Future Work • More heuristics to refine “criticality” • Why are “critical loads” critical? • Criticality of a memory address vs. criticality of a load instruction • Criticality for lowpower Caches