470 likes | 608 Vues
This presentation delves into the innovative region-centric memory design approach developed by the AENAO research group at the University of Toronto, led by Patrick Akl and team. It explores how coarse-grain memory tracking and optimizations, like snoop coherence and region scouting, can enhance cache efficiency, reduce power consumption, and improve overall system performance. By examining memory access patterns and leveraging the concept of non-shared regions, this work presents significant advancements in the architecture of on-chip caches, setting a foundation for future CPU designs.
E N D
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
CPU CPU I$ I$ D$ D$ Future On-Chip Caches: Just Larger? CPU Observe and Exploit Memory Access Behavior at a Coarse Grain D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto
Conventional Block-Centric Memory Hierarchy • “Small” Blocks • Performance and Bandwidth • Several optimizations exist Big picture is lost Conventional Fine-Grain Tracking Aenao Group/Toronto
“Big Picture” View Supplemental Coarse-Grain Tracking • Region: 2n sized, aligned memory area • Concept already in use: TLBs • Patterns Emerge in Space / Time • Exploit for performance & power • Expose to software Aenao Group/Toronto
This Presentation • Examples of Coarse-Grain Optimizations • Snoop Coherence • Thread-level speculation disambiguation • Region-Centric Memory Design • RegionTracker Cache • Snoop Coherence Revisited • Current Activities • Coherence Delegation • Predictor Virtualization Aenao Group/Toronto
An Example: Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory Aenao Group/Toronto
Coherence Basics • Given request for memory block X (address) • Detect where current value resides CPU CPU CPU X snoop snoop hit Main Memory Aenao Group/Toronto
Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Aenao Group/Toronto
RegionScout Motivation:Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses Aenao Group/Toronto
CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory Aenao Group/Toronto
Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes Aenao Group/Toronto
1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
1 2 RegionScout at Work:Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size) Aenao Group/Toronto
Remembering Non-Shared Regions address • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments Aenao Group/Toronto
What Regions are Locally Cached? Region Tag offset • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[Region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory 256K counters counter Aenao Group/Toronto
What Regions are Locally Cached? Region Tag offset counter hash() • Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved • Small: e.g., 256 entries for 1M cache • Power-Optimized structures described in the paper Aenao Group/Toronto
LFSR-Based Implementation Region Tag offset • Linear-Feedback Shift Register Array • Increment/Decrement/Is Zero? • 130nm commercial technology • ISLPED ’06 • Faster: 1.6x to 3.7x • More Energy Efficient: 1.4x to 2.3x • But Area: 3.2x LFSR hash() Zero Detector Aenao Group/Toronto
Filter Rates: SPLASH-II better Identified Global Region Misses CRH Size Jason Cantin@Wisconsin studied commercial workloads 40% filter rate Aenao Group/Toronto
Region-Centric Disambiguation Join work w/ Greg Steffan and Mihai Burcea Patrick Akl Andreas Moshovos
Speculative Parallelization Models • Thread level speculation • Transactional Memory Speculative Parallelization Original Good Scenario Bad Scenario read a read b time write a write a Need to Compare Addresses Across Code Pieces Aenao Group/Toronto
Ex #2: Region-Centric Disambiguation Region-Centric Conventional • Send digest at region level • Region-conflict • Send block-level info • Reduced bandwidth, potential for performance and power Task 1 Task 2 Task 1 Task 2 Memory Space Aenao Group/Toronto
TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model How Much Traffic Can We Save? Better Potential for traffic reduction by 38% Aenao Group/Toronto
Exploiting Region-Level Information • Region Coherence Arrays • Cantin, Lipasti and Smith • RegionScout • Both of these reduce snoop lookups (and broadcasts) in snoop coherence protocolsOur work • Spatial Memory Prefetching • Leverages spatial memory patterns for prefetching with commercial workloads • Impetus Group at CMU • Stealth Prefetching • Cantin, Lipasti and Smith Aenao Group/Toronto
CPU I$ D$ Coarse-Grain Techniques Today Conventional Cache • Overhead • Storage: e.g., 60% of tags • Functionality: Restrict placement, Region Evictions • Loss of Information Hard to justify for a commercial design Auxiliary Tracking DATA TAGS Aenao Group/Toronto
CPU I$ D$ Rethinking Cache Design Embedded Tracking DATA Dual-Grain TAGS • Can we provide a common substrate for all these optimizations? • Redesign caches: • Regions a first class citizen • RegionTracker Cache Aenao Group/Toronto
RegionTracker Cache • Goals • Expose region behavior • Is region X cached? • Which blocks are? • Facilitate management at the region level • Evict/migrate region X • Do something with all blocks in X • Constraints: • Data movement only at the block level • No increase in area • No decrease in performance • Complexity • Associativity Aenao Group/Toronto
Region-Based Caches • Start with conventional 16-way cache and replace tag array • Sector Caches • Hit rate suffers: 20% loss • Sector Pool Caches • High Associavity: 48-way for matching a 16-way cache • Decoupled-Sector Caches • No coarse-grain info • Replacements require searching • No previous design is adequate • RegionTracker: • Meets all requirements • But does not save as much tag resources Aenao Group/Toronto
Sector Cache D-way Data • Reduced Area and Power • Increased miss-rates (2.5% - 96% for 1kB sectors) • Replacement? D-way Region Tags { RVA Data Array Aenao Group/Toronto
M-way Region Tags RVA Sector Pool Cache D-way Data • M > D • Requires highly associative cache to achieve same performance as RegionTracker (~48-way) { 1 DSR Data Array Aenao Group/Toronto
Decoupled-Sectored Cache • Has multiple block evictions • Requires scanning “status” array • No simple mechanism to avoid this • Does NOT expose region-level information Aenao Group/Toronto
D-way Data L-way Region Tags { 1 DSR RVA Data Array RegionTracker • In practice L <= D • Decouple Data and Lookup organizations • Lower Associativity lookups with no hit-rate penalty • RegionTracker provides complete solution Aenao Group/Toronto
L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker Cache Block and Region Lookups Region Tag + Way Per Block Evict Region Blocks Lazily Simplify replacement and reduce area Status per block + RVA set backpointer Can be banked and partitioned Aenao Group/Toronto
Region-Aware Cache: Performance vs. Area • Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus • SimICS + SimFlex, Sampling, 2K Regions better Aenao Group/Toronto
RegionTracker-RegionScout • One bit per Region tag: Known to be not shared • 1KB Regions, Commercial workloads • 512KB L2 private caches Filter 41% of snoops at “Zero Cost” compared to conventional cache BlockScout better Reduction in Broadcasts Aenao Group/Toronto
Directory Optimizations Base Architecture Core L3 Data DRAM L2 Tags Directory L3 Tags L2 Data Aenao Group/Toronto
Coherence Delegation Ideal Path Requesting Node • Eliminate 3-hop overhead • Attract directory tracking to nodes Directory Lookup Remote L2 containing data Aenao Group/Toronto
CPU CPU CPU CPU L1-D L1-D L1-D L1-D L1-I L1-I L1-I L1-I Optimization Engines: Predictors PredictorVirtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto
Motivating Trends • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor table • Increased performance • Memory hierarchies • Increased capacities Use conventional memory hierarchies to store predictor information Aenao Group/Toronto
PV Architecture Optimization Engine entry index prediction Predictor Table Aenao Group/Toronto
PV Architecture Optimization Engine entry index prediction Predictor Virtualization Aenao Group/Toronto
+ PV Architecture Optimization Engine entry index prediction PVCache MSHR PVStart index PVProxy L2 PVTable Main Memory Aenao Group/Toronto
Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 80KB Virtualized Prefetcer: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. C. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
Summary • Caches are getting larger • Time to look at the “big picture” • Region-Centric Memory Design • Expose region-level info • Allow management at the region-level • RegionScout • eliminate broadcasts for snoop coherence • Region-Centric Disambiguation • Reduce bandwidth for TLS or TM • Region-Aware Memory • “Same” area and performance as conventional + region info • Predictor Virtualization Aenao Group/Toronto