An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors Haakon Dybdahl and Per Stenstrom Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07) Feb. 2007 Presenter: Chun-Hung Lai

Abstract • The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed Non-Uniform Cache Architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. • We propose a novel Non-Uniform Cache Architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimate, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions.

What’s the Problem • The Non-uniform Cache Architecture (NUCA) combines the best of private and shared cache organization • Low latency of privatecaches • High capacity of shared caches (No replicas) • However, the uncontrolled sharing of private partitions • Cause cache pollution (interference) Spill evicted lines to a neighbor Spill evicted lines to a neighbor Over Utilization Under Utilization Over Utilization Over Utilization Evict blocks which will be used later Core 2 Local Cache Core 3 Local Cache Core 4 Local Cache Core 1 Local Cache

Related Works Cache Sharing among Multi-Processors Dynamic partitioning of a shared cache Cache space sharing based on private caches Trial and fail algorithm Estimate best partition size Protect cache blocks within a set Fair cache sharing Way/set counter to estimate # of increased misses when reduce $ size (and vice versa) Control block replacement to reduce conflicts Spill evicted cache blocks to a neighboring $ Not scale well Cooperative caching CMP-NuRAPID Divide into private partition and shared partition Mechanism to find the best partition sizes The uncontrolled sharing Put constraints to protect from pollution The adaptive shared/private NUCA cache partitioning This Paper:

The Adaptive Partitioning Scheme • The size of private partitions for each core is dynamically adjusted to minimize the total cache misses Intend for all the cores Shared Shared Private Private Shared Private Private Intend for its own blocks Protection from pollution Inaccessible by other cores

The Involved Techniques • A method for estimation of the best private/shared partition size • Estimate # of reducedmisses when increase $ size • Estimate # of increased misses when decrease $ size • A method for management of the private/shared partitions • A replacement policy for the shared partition • A method for repartitioning of the cache

Estimation of Private/Shared Partition Sizes- Preview • The example of illustration • The # of misses as a function of number of blocks per set gzip: requires four blocks per set mcf: one block per set is enough Goal: give up blocks to benefit another (EX: total misses can be reduced if mcf gives one block to gzip)

Estimation of Private/Shared Partition Sizes- Information Collection (Phase 1) • Estimate # of reduced misses when increase $ size with one block per set • Shadow tag • Record the tag of the most recently evictedblock • If cache miss but hit in the shadow tag for that set • Increase the counter “hits in shadow tags” for the requesting core • Estimate # of increased misses when decrease $ sizewith one block per set • Record # of hit in the LRU block • Increase the counter “hits in LRU blocks” for the requesting core Each set has one shadow tag per core Effect of decreasing cache size Effect of increasing cache size Two global counters per core

Estimation of Private/Shared Partition Sizes- Decision Made (Phase 2) • Re-evaluate the private partition sizes per core every 2000 cache misses • The core with the highest gain of increasing cache size • i.e., the core with most “hits in shadow tags’ • The core with the lowest loss of decreasing cache size • i.e., the core with fewest “hits in LRU blocks” Compare If the gain is higher than the loss: One cache block per set is provided tothe core withthe highest gain The partitioning parameters per core

Management of the Private/Shared Partitions- Cache Lookup Hit block • The behavior of cache is summarized as follow • Cache hit in private partition • Cache hit in neighboring cache • 1. a miss occurs in the private partition • 2. all neighboring caches are checked in parallel • 3. the hit block is moved to local cache • Cache miss • 1. the LRU block in private partition is inserted into shared partition • 2. request from memory and insert into private partition Advance Evict LRU in private Request miss Private Private Private Private Shared Shared Shared Shared Private Private Shared Shared Private Private The victim block in shared partition is found by Algorithm 1 LRU in private Request miss

Management of the Private/Shared Partitions- Replacement Policy for the Shared Partition • Goal: find the victim block in shared partition to allocate the evicted block from private partition Traverse the LRU stack from LRU to MRU Which core owns the LRU block Own too many blocks than the partition decision

Repartitioning of the Cache • The lazy repartitioning • Repartitioning only involves • Change the partition parameters for the replacement policy • The eviction process on cache misses repartition the cache later on • EX: when the partition size for one core is decreased • The “extra” cache blocks are not invalidated immediately • They are valid until being evicted Block Evict Repartition Partition parameter Replacement policy for the Shared Partition Repartition the cache via the eviction process on cache misses

Experimental Setup • Use SimpleScalar simulator including the new scheme • Four independent cores • Private L3 cache partitions that can be shared • Compare the efficiency with • LRU-based shared cache • Private caches • Recently proposed NUCA schemes • Benchmarks • SPEC2000 benchmark • Multi-programmed workloads for CMP architecture • Four randomly picked applications are run in parallel • Note • Only show the results of last-level cache intensive applications

Speedup of the Proposed Scheme- Global • Metric: IPC (Instructions Per Cycle) • The new scheme has equal or higher performance than both private and shared cache • Average speedup is 5% than shared cache • Average speedup is 13% than private cache Each consists of four randomly picked applications run in parallel Speedup comp. to private caches Higher speedup

Speedup of the Proposed Scheme- Per Application Four apps. benefit from larger caches • Metric: IPC (Instructions Per Cycle) • The proposed scheme works well for these four apps. • But a shared cache degrades for two of them compared to private caches • The proposed scheme degrades performance for some apps. • Since cache spaces are given to benefit other apps. IPC relative to private Caches ammp applu apsi art facerec galgel mgrid swim twolf vpr

Speedup of the Proposed Scheme- Comparison with another NUCA Scheme • Call this scheme “random replacement” • The victim block in neighboring cache is picked randomly • The proposed scheme works better than the random replacement scheme

Conclusions • This paper proposed an adaptive NUCA cache partitioning scheme • Adapt cache usage per core to its needs • Protect its most recently used data in the last level cache • Results show that the proposed scheme has higher performance than • Private cache • Potentially larger caches for apps. that benefit from an increased $ size • Shared cache • (a) Lower access time; (b) improved sharing • Earlier NUCA schemes • Each core is protected from pollution by the other cores

Comment for This Paper • Lack of figures and block diagrams to illustrate the concept • How to implement the replacement policy which takes different steps to walk through the LRU stack?? • One cycle or N cycles??

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

Presentation Transcript

Library Cache Internals

Intel Pentium 4: A Detailed Description

Chapter4 Multiprocessors

Chapter 5: Multiprocessors and Thread-Level Parallelism

Strategic Formulation

Adaptive Query Processing with Eddies

What is an “ SoC ”?

Nursing Shared Governance

Principles of Adaptive Thermal Comfort

Computer Architecture Shared Memory MIMD Architectures

An Introduction to Adaptive Learning

Inside RAC

Software for embedded multiprocessors

MICROCONTROLLERS

Adaptive Hypermedia From Concepts to Authoring

DISTRIBUTED COMPUTING

Utilizing DeltaV Adaptive Control

What is an “ SoC ”?