190 likes | 276 Vues
This study explores innovative strategies for optimizing cache organization in chip multiprocessors, addressing challenges of capacity, latency, and communication to enhance performance. The novel mechanisms include controlled replication, in-situ communication, and capacity stealing. The CMP-NuRAPID hybrid cache, utilizing these mechanisms, shows significant performance improvements over shared and private caches for various workloads.
E N D
Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published in Proceedings of the 32nd International Symposium on Computer Architecture, pages 357-368, June 2005.
Motivation • Emerging trend for CMPs • New Challenges in Cache design policies • Increased capacity pressure on the on-chip memory- Need for large on chip capacity for multiple cores • Increased cache latencies in large caches- Wire delays Need for a cache design that tackles these challenges
Cache Organization • Goal: • Utilize Capacity Effectively- Reduce capacity misses • Mitigate Increased Latencies- Keep wire delays small • Shared • High Capacity but increased latency • Private • Low Latency but limited capacity Neither private nor shared caches provide both goals
Latency-Capacity Tradeoff • SMPs and DSMs have same goals in terms of cache design • Capacity • CMPs have limited on-chip memories • SMPs have large off-chip memories • Latency of accesses • SMPs have slow off-chip access • CMPs have fast on-chip access CMPs change Latency-Capacity Tradeoff in two ways
Novel Mechanisms • Controlled Replication • Avoid copies for some read-only shared data • In-Situ Communication • Use fast on-chip communication to avoid coherence miss of read-write-shared data • Capacity Stealing • Allow a core to steal another core’s unused capacity • Hybrid cache • Private Tag Array and Shared Data Array • CMP-NuRAPID(Non-Uniform access with Replacement and Placement using Distance associativity) • Performance • CMP-NuRAPID improves performance by 13% over a shared cache and 8% over a private cache for three commercial multithreaded workloads Three novel mechanisms to exploit the changes in Latency-Capacity tradeoff
CMP-NuRAPID • Non-Uniform Access and Distance Associativity • Caches divided into d-groups • D-group preference 4-core CMP with CMP-NuRAPID
CMP-NuRAPID Organization Data Array Tag Arrays CMP NuRAPID Tag and Data Arrays
CMP-NuRAPID Organization • Private Tag Array • Shared Data Array • Leverages forward and reverse pointers Single copy of block shared by multiple tags Data for one core in different d-groups Extra Level of Indirection for novel mechanisms
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
Controlled Replication • On a read miss- Updates tag pointer to point to the already-on-chip block • On a subsequent read-Data copy is made in the reader’s closest d-group to avoid slow accesses in future
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
In-Situ Communication • Enforce single copy of read-write shared block in L2 and keep the block in communication (C) state Replace M to S transition by M to C transition Fast communication with capacity savings
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
Capacity Stealing • Demotion: Demote less frequently used data to un-used frames in d-groups closer to core with less capacity demands. • Promotion: if tag hit occurs on a block in farther d-group promote it Data for one core in different d-groups Use of unused capacity in a neighboring core
Methodology • Full-system simulation of 4-core CMP using Simics • CMP NuRAPID: 8 MB, 8-way • 4 d-groups,1-port for each tag array and data d-group • Compare to • Private 2 MB, 8-way, 1-port per core • CMP-SNUCA: Shared with non-uniform-access, no replication
Results Multi-Threaded Workloads Multi-programmed Workloads
Conclusions • CMPs change the Latency Capacity tradeoff • Controlled Replication, In-Situ Communication and Capacity Stealing are novel mechanisms to exploi the change in the Latency-Capacity tradeoff • CMP-NuRAPID is a hybrid cache that uses incorporates the novel mechanisms • For commercial multi-threaded workloads– 13% better than shared, 8% better than private • For multi-programmed workloads– 28% better than shared, 8% better than private
Thank you Questions?