310 likes | 443 Vues
This paper presents a novel dynamic spatial pattern prediction approach that optimizes cache performance by intelligently predicting memory access patterns. By exploiting variations in spatial locality, the proposed method adjusts block sizes for prefetching while maintaining cache efficiency. Our method achieves significant energy savings, with up to 40% leakage power reduction and an impressive 2x speedup in performance for certain applications. The Spatial Pattern Predictor (SPP) offers high coverage and learning efficiency, making it a promising solution for modern memory systems.
E N D
Accurate and Complexity-Effective Spatial Pattern Prediction Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos
Motivation – Variation in Spatial Locality • Caches ExploitSpatial Locality via Block Size • Prefetch Nearby Data Improve Performance • “One Size Fits All” Solution • Large enough for prefetching • Small enough to avoid memory link saturation • Opportunity Variation Within and Across Applications • If “Best Block Size” was known: • Prefetch even further Higher Performance • “Turn-off” unused data in cache Lower Leakage Power
This Work • Dynamic Spatial Pattern Prediction • Leakage Power Reduction • Sub-blocks of a block as a Group • Place “unused” block parts in low leakage state • Prefetching • Consecutive Memory Blocks as a Group • Selectively Prefetch Blocks Upon First Access in Group • Key Contribution: PC + Offset Within Group • Quick Learning • Compact Representation • High Coverage
How Well it Works • Spatial Pattern Predictor (SPP) • 256-entry Tag-Less Direct-Mapped • ~95% coverage • L1 Data Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Prefetching w/ 1024 byte Group • Up to 2x speedup and 56% Average • Conventional Cache: 14% Slowdown
Outline • Conventional Cache: Optimization Opportunities • Variation in Spatial Locality • Prediction Framework • Prior Work • Results
Optimization Opportunity #1 Conventional Cache typedef struct person { char name[20]; … int age; int isAdult; struct person* next; } // total 64 bytes // do something … while ( people ) { if ( peopleage >= 21 ) peopleisAdult = TRUE; people = peoplenext; } L1D with 64-Byte cache lines miss age isAdult next miss age isAdult next miss age isAdult next untouched touched Resident untouched data Wasteful Leakage
Optimization Opportunity #2 Conventional Cache typedef struct person { char name[20]; … int age; int isAdult; } people[LARGE] // do something … for i { if ( people[i].age >= 21 ) people[i].isAdult = TRUE; } L1D with 64-Byte cache lines age isAdult Group #1 age isAdult age isAdult Group #2 age isAdult Detech Access Patterns at Group Level Selectively Prefetch Same Block Members Improve Performance w/o Saturating Memory
100% 40% 89% 26% 48% 80% 60% 40% 20% 0% facerec gcc mcf vortex Variation in Spatial Locality Average Line Usage 8/8 7/8 6/8 5/8 All Cache Lines Touched 4/8 3/8 2/8 1/8 • Fraction of data used before eviction • Measured on 64KB 2-way L1D w/ 64B cache lines
1 0 . . . 1 Tag1 Tag0 Tag0 Tag1 Tag1 Prediction Framework Minimum Fetch Unit (MFU): • replacement unit of cache • e.g., cache line or sub block Spatial Group: • group of adjacent MFUs • indexed by logical tag Spatial Pattern: • reference pattern of a spatial group Spatial Group Generation: • starts with a new logical tag . . . . . . Time
Spatial Pattern Register PHT Entry Pointer 0 1 1 0 001 1 1 0 0 000 1 0 0 0 011 1 1 1 1 010 Spatial Pattern Predictor Pattern History Table (PHT) Current Pattern Table (CPT) Data Cache Prediction Index Spatial Pattern History 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 Prediction Index: 32 bits =? PC SPG Offset Spatial Pattern Prediction • Current Pattern Table records patterns • Pattern History Table stores captured patterns
Prior Work • Static profiling, V. Vleet, et al. ICCD 1999 • Adjustable block size, Dubnicki & LeBlanc. ISCA 1992 • Fetching adjacent cache lines, Temam & Jegou. ICS 1994 • Dual cache, Gonzalez, Aliagas & Valero. ICS 1995 • Spatial Locality Detection Table, Johnson, Merten & Hwu. MICRO 1998 • Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA 1998 Key Difference is Prediction Handle: PC + Group Offset 1. Compact Representation 2. Quick Learning 3. High Coverage
Results Overview • Predictor Performance Statistics • Leakage Power Reduction • Performance Improvement w/ Prefetching
Methodology • SimpleScalar simulator • 64KB 2-way L1D/L1I cache, 2-cycle latency • 2MB 8-way L2 cache, 12-cycle latency • SPEC CPU2000 • Alpha binaries + reference inputs • Predictor performance evaluation • Simulated to completion • Performance impact evaluation • Skipped 10B and simulated next 500M instructions • Energy reduction evaluation • SPICE w/ 70nm CMOS technology & 1V supply voltage
160% better 100% 80% 60% 40% 20% 0% Practical Predictor: Performance Training Over-Prediction Over-Prediction Under-Prediction Correct Prediction % of perfect predictions 256 Entries A: 16-way B: DM C: FA A B C A B C A B C A B C gcc mcf vortex fecerec • 256-entry tag-less direct-mapped • average prediction accuracy of 96%
Predictor Applications • Leakage energy reduction • Sub blocks as minimum fetch units • Cache lines as spatial groups • A cache miss starts a spatial group generation • Assuming Gated-Ground by Agarwal, Li, & Roy • Spatial group prefetcher • Cache lines as minimum fetch units • Adjacent cache lines grouped into spatial groups • A new logical tag starts a spatial group generation
100% 80% 60% 40% 20% 0% 5% gcc mcf vortex AVG fecerec Leakage Energy Reduction • Up to 73% leakage energy reduction • ~40% average leakage energy reduction • < 1% average performance degradation Relative Leakage Power better better Execution Time Increase 60% <1% ~2%
Performance Improvement • Up to 2x speedup with 1024B spatial groups • ~60% average speedup with 1024B spatial groups
Summary • Spatial Pattern Predictor (SPP) • Key Contribution: PC + Group Offset • Small and Effective, High Coverage • 256-entry Tag-Less Direct-Mapped • ~95% coverage • L1 Data Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Prefetching w/ 1024 byte Group • Up to 2x speedup and 56% Average • Conventional Cache: 14% Slowdown
Accurate and Complexity-Effective Spatial Pattern Prediction Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos
160% 100% 80% 60% 40% 20% 0% Prediction Index Training A: PC B: PC+SPG ID C: PC+SPG OFFSET D: PC+ADDR Over-Prediction Under-Prediction Correct Prediction A B C D A B C D A B C D A B C D facerec gcc mcf vortex • Infinite Tables • PC + SPG offset yields high prediction accuracy • PC + SPG offset has low prediction memory requirements
Contributions • Spatial Pattern Predictor (SPP) • 256-entry Tag-Less Direct-Mapped • ~95% coverage • Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Processor Performance Improvement • Up to 2x speedup
Variations in Spatial Locality • Fraction of data used before eviction • Measured on 64KB 2-way L1D w/ 64B cache lines
Prediction Index • PC + SPG offset yields high prediction accuracy • PC + SPG offset requires low prediction memory requirement
Predictor Memory Organization • 256-entry tag-less direct-mapped yields average prediction accuracy of 96%
Leakage Energy Reduction • Up to 73% leakage energy reduction • ~40% average leakage energy reduction • < 1% average performance degradation
Performance Improvement • Up to 2x speedup with 1024B spatial groups • ~60% average speedup with 1024B spatial groups
160% 100% 80% 60% 40% 20% 0% Predictor Memory Organization Training Over-Prediction Under-Prediction Correct Prediction A: 128-entry 16-way B: 128-entry DM C: 128-entry FA D: 256-entry 16-way E: 256-entry DM F: 256-entry FA A B C D E F A B C D E F A B C D E F A B C D E F gcc mcf vortex fecerec • 256-entry tag-less direct-mapped • average prediction accuracy of 96%