1 / 19

CS 7960-4 Lecture 10

CS 7960-4 Lecture 10. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990. Cache Basics. Tag array. Data array. D E C O D E R. Way 1. Way 2. Set. Comparator. Address. Mux.

Télécharger la présentation

CS 7960-4 Lecture 10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990

  2. Cache Basics Tag array Data array D E C O D E R Way 1 Way 2 Set Comparator Address Mux

  3. Multiplexing M

  4. Banking Words/Ways get distributed Sets get distributed Wordline Bitline • Banking reduces acces time per bank and overall power • Allows multiple accesses without true multiporting

  5. Virtual Memory • A single physical address (A) can map to multiple • virtual addresses (X, Y) • The CPU provides addresses X and Y and the • cache must make sure that both map to the same • cache location • Naive solution: perform virtual-to-physical • translation (TLB) before accessing the cache

  6. Page Coloring • To identify potential cache locations and initiate • the RAM look-up, only index bits are needed • If OS ensures that virtual index bits always match • physical index bits, you can start RAM look-up • before completing TLB look-up • When both finish, use newly obtained physical • address for the tag comparison (note: can’t use • virtual address for tag comparison • Virtually-indexed, Physically-tagged

  7. Memory Wall Year : Clock speed : Memory latency in seconds : in cycles : 1997 0.75 GHz 50+20ns 53 cycles 2011 10 GHz 16ns 160 cycles Improves by 10%/year Clock speed has traditionally improved by 50%/year, but will improve by only ~20%/year in the future

  8. Bottlenecks

  9. Conflict Misses • Direct-mapped caches have lower access times, • but suffer from conflict misses • Most conflict misses are localized to a few sets • -- an associativity of 1.2 is desirable?

  10. Victim Caches • Every eviction from L1 gets put in the victim cache • (VC and L1 are exclusive) • Victim cache associative look-up can happen in • parallel with L1 look-up – VC hit results in a swap L1 Victim cache

  11. Results • The cache and line size influence the percentage • of misses attributable to conflicts • 15-entry victim cache eliminates half the conflict • misses – reduction in total cache misses is less • than 20%

  12. Prefetch Techniques • Prefetch on miss fetches multiple lines for every • cache miss • Tagged prefetch waits till a prefetched line is • touched before bringing in more lines • Prefetch deals with capacity and compulsory • misses, but causes cache pollution

  13. Stream Buffers • On a cache miss, fill the stream buffer with • contiguous cache lines • When you read the top of the queue, bring in the • next line • If the top-of-q does not service a miss, the stream • buffer flushes and starts from scratch L1 Sequential lines Stream buffer

  14. Results • Eight entries are enough to eliminate most capacity • and compulsory misses • 72% of I-cache misses and 25% of D-cache misses • are eliminated • Multiple stream buffers help eliminate 43% of • D-cache misses • Large cache lines minimize stream buffer impact • (stream buffer removes 10% of D-cache misses for • 128B cache line size)

  15. Potential Improvements • Relax the top-of-q constraint for the stream buffer • Maintain a stride value to detect non-sequential • accesses

  16. Bottlenecks Again For 4KB caches, 16B lines

  17. Harmonic and Arithmetic Means • HM of IPC = N / (1/IPCa + 1/ IPCb + 1/ IPCc) • = N / (CPIa + CPIb + CPIc) • = 1 / AM of CPI • Weight each benchmark as if they all execute one • instruction • If you want to assume each benchmark executes • for the same time, HM of CPI or AM of IPC is • appropriate

  18. Next Week’s Paper • “Memory Dependence Prediction Using Store • Sets”, Chrysos and Emer, ISCA-25, 1998

  19. Title • Bullet

More Related