1 / 29

CMP L2 Cache Management

CMP L2 Cache Management. Presented by: Yang Liu CPS221 Spring 2008. Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood. Outline.

weston
Télécharger la présentation

CMP L2 Cache Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

  2. Outline • Motivation • Related Work (1) – Non-uniform Caches • CMP-NuRAPID • Related Work (2) – Replication Schemes • ASR

  3. Motivation • Two options for L2 caches in CMPs • Shared: high latency because of wire delay • Private: more misses because of replications • Need hybrid L2 caches • Take in mind • On-chip communication is fast • On-chip capacity is limited

  4. NUCA • Non-Uniform Cache Architecture • Place frequently-accessed data closest to the core to allow fast access • Couple tag and data placement • Can only place one or two ways in each set close to the processor

  5. NuRAPID • Non-uniform access with Replacement And Placement usIng Distance associativity • Decouple the set-associative way number from data placement • Divide the cache data array into d-groups • Use forward and reverse pointers • Forward: from tag to data • Reverse: from data to tag • One to one?

  6. CMP-NuRAPID - Overview • Hybrid private tag • Shared data organization • Controlled Replication – CR • In-Situ Communication – ISC • Capacity Stealing – CS

  7. CMP-NuRAPID – Structure • Need carefully chosen d-group preference

  8. CMP-NuRAPID – Data and Tag Array • Tag arrays snoop on bus to maintain coherence • The data array is accessed through a crossbar

  9. CMP-NuRAPID – Controlled Replication • For read-only sharing • First use no copy, save capacity • Second copy, reduce future access latency • In total, avoid off-chip misses

  10. CMP-NuRAPID – Time Issues • Start to read before the invalidation and end after the invalidation • Mark the tag for the block being read from a farther d-group busy • Start to read after the invalidation begins and end before the invalidation completes • Put an entry in the queue that holds the order of the bus transaction before sending a read request to a farther d-group

  11. CMP-NuRAPID – In-situ Communication • For read-write sharing • Communication state • Write-through for all C blocks in L1 cache

  12. CMP-NuRAPID – Capacity Stealing • Demote less-frequently-used data to unused frames in the d-groups closer to the cores with less capacity demands • Placement and Promotion • Place all private blocks in the d-group closest to the initiating core • Promote the block directly to the closest d-group for the core

  13. CMP-NuRAPID – Capacity Stealing • Demotion and Replacement • Demote the block to the next-fastest d-group • Replace in the order of invalid, private, and shared • Doesn’t this kind of demotion pollute another core’s fastest d-group?

  14. CMP-NuRAPID - Methodology • Simics • 4-core CMP • 8 MB, 8-way CMP-NuRAPID with 4 single-ported d-groups • Both multithreaded and multiprogrammed workloads

  15. CMP-NuRAPID – Multithreaded

  16. CMP-NuRAPID – Multiprogrammed

  17. Replication Schemes • Cooperative Caching • Private L2 caches • Restrict replication under certain criteria • Victim Replication • Share L2 cache • Allow replication under certain criteria • Both have static replication policies • How about dynamic?

  18. ASR - Overview • Adaptive Selective Replication • Dynamic cache block replication • Replicate blocks when the benefits exceed the costs • Benefits: lower L2 hit latency • Costs: More L2 misses

  19. ASR – Sharing Types • Shingle Requestor • Blocks are accessed by a single processor • Shared Read-Only • Blocks are read, but not written, by multiple processors • Shared Read-Write • Blocks are accessed by multiple processors, with at least one write • Focus on replicating shared read-only blocks • High locality • Little Capacity • Large portion of requests

  20. ASR - SPR • Selective Probabilistic Replication • Assume private L2 caches and selectively limits replication on L1 evictions • Use probabilistic filtering to make local replication decisions

  21. ASR – Balancing Replication

  22. ASR – Replication Control • Replication levels • C: Current • H: Higher • L: Lower • Cycles • H: Hit cycles-per-instruction • M: Miss cycles-per-instruction

  23. ASR – Replication Control

  24. ASR – Replication Control • Wait until there are enough events to ensure a fair cost/benefit comparison • Wait until four consecutive evaluation intervals predict the same change before change the replication level

  25. ASR – Designs Supported by SPR • SPR-VR • Add 1-bit per L2 cache block to identify replicas • Disallow replications when the local cache set is filled with owner blocks with identified sharers • SPR-NR • Store a 1-bit counter per remote processor for each L2 block • Remove the shared bus overhead (How?) • SPR-CC • Model the centralized tag structure using an idealized distributed tag structure

  26. ASR - Methodology • Two CMP configurations – Current and Future • 8 processors • Writeback, write-allocate cache • Both commercial and scientific workloads • Use throughput as metrics

  27. ASR – Memory Cycles

  28. ASR - Speedup

  29. Conclusion • Hybrid is better • Dynamic is better • Need tradeoff • How does it scale…

More Related