An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs Haakon Dybdahl, Per Stenström HPCA 2007

CMP Caching • Extremes • Private and shared caches • NUCA organizations • Shared Caches • Adaptive vs uncontrolled sharing • Pollution issues Chang and Sohi, ISCA ‘06 So, is custom cache partitioning better?

Adaptive Partitioning • Dynamic sharing of last-level caches among cores • Private and shared cache partitions • Who needs more, gets more! • Overall goal – minimize total cache misses

Issues to be considered? • How to estimate private/shared space for a core? • How to share the “shared space” among cores? • Replacement policy for shared spaces?

Private/Shared Cache Partition Size • Private partition : Increase/decrease blocks per set, keep # of sets constant

Private/Shared Cache Partition Size • Shared partition • estimate relative gain • Estimate misses that can be avoided by increasing one block per set • Estimate increased cache misses if decrease in one block per set.

H/W Support • Core Id • Shadow Tags • Counters • Max # of blocks in a set • Cost: Shadow Tags Core ID Counters

Relative comparisons • Avoiding cache misses? • Shadow tags : one per set per core • Hits in shadow tags. • Hits of LRU blocks. • Re-evaluation • Core_with_most_hits_to_shadow_tags(1) compared with core_with_lowest_hits_to_LRU_block (2) • Done every 2000 cycles • If 1>2 one cache block/set added to core 1

Managing Partitions • Private partition • LRU replacement • Some key events: • Cache hit in private L3 • Block found, classified as MRU • Cache hit in neighboring L3 • All neighboring caches checked in parallel • Block brought to private L3, LRU replacement • Block evicted from private moved to shared $(???) • Cache miss • Block fetched from memory, placed in private $ • LRU block from private moved to shared

Shared partition block replacement Algorithm

Results • Single threaded workloads on each core • 4MB shared L3, 1 MB private L3 • Workload characterization • Last level cache sensitive/insensitive • Overall goal: Maximize HM of IPC’s of all 4 cores • Forms basis for comparison

Speedups For last level-cache sensitive benchmarks

Larger Caches 8 MB L3

Technology scaling • Smaller techs, delay is more dominant

Wrt Chang/Sohi, ISCA’06 • Uncontrolled vs adaptive partitioning

Summary • Adaptive cache partitioning gives you: • Better performance • Less interference • Improved sharing • Can do more with less (cache)

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs

Presentation Transcript

Shared Pool Waits

Library Cache Internals

Adaptive proxies: handling widely-shared data in shared-memory multiprocessors

Directory-Based Cache Coherence

Adaptive Cache-Line Size Management on 3D Integrated Microprocessors

Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches

Implementation of Cache Coherency in NoC Mike Cika 2/29/2012

Parallelizing Programs

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

Cache Design

Using Trace Cache In SMT

IDMSPLEX

ARC (Adaptive Replacement Cache)

Multiprocessors—Cache Coherency, Snooping Protocol

L30: Partitioning

Circuit Partitioning

Outline

Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Availability in CMPs

On Self Adaptive Routing in Dynamic Environments

ASR: Adaptive Selective Replication for CMP Caches