1 / 59

Outline

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank Nagari. Outline. Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses

althea
Télécharger la présentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Direct-Mapped Cache Performance by the Additionof a Small Fully-Associative Cache and Prefetch BuffersBySreemukha KandlakuntaPhani Shashank Nagari

  2. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  3. Base Line Design

  4. Base Line Design Contd.. • Size of on-chip caches usually varies • High speed technologies result in smaller on chip caches • L1 caches are assumed to be direct mapped • L1 cache line sizes – 16 - 32 B • L2 cache line sizes – 128-256B

  5. Parameters assumed • Processor Speed: 1000 MIPS • L1 Inst and Data Cache Size : 4Kb Line Size : 16B • L2 Inst and Data Cache Size : 1MB Line Size : 128B

  6. Parameters assumed Contd.. Miss Penalty L1- 24 Inst times L2- 320 Inst times

  7. Test Program Characteristics

  8. Base Line system L1 Cache Miss Rates

  9. Base Line DesignPerformance

  10. Inferences • Potential performance loss in memory hierarchy • Improving performance of memory hierarchy rather than CPU performance • H/w Techniques are used for improving the performance of the baseline M-H

  11. Main Memory Tag Data Block Number 00 000 00 001 How Direct Mapped Cache works 00 010 00 011 000 00 100 01 001 00 101 010 00 110 Direct Mapped Cache with 8 Blocks 011 00 111 100 01 000 101 01 001 001 110 01 010 111 01 011 01 100 01 101 01 110 • How to search • 00101, 01101, 10101, 11101 maps to block 101 01 111 10 000 10 001 001 10 010 10 011 10 100 • How to identify? • Match the Tag • Tag 01 in block 001 means address 01001 is there 10 101 10 110 10 111 11 000 11 001 001 11 010 11 011 11 100 11 101

  12. Main Memory Tag Data Block Number 00 000 00 001 How Fully-associative Cache works 00 010 00 011 000 00 100 001 00 101 Fully Associative Cache with 8 Blocks 010 00 110 011 00 111 100 01 000 101 01 001 110 01 010 111 01 011 01 100 01 101 01 110 01 111 10 000 10 001 10 010 • Where to search? • Every Block in Cache 10 011 10 100 10 101 10 110 10 111 11 000 • Very Expensive 11 001 11 010 11 011 11 100 11 101

  13. Cache Misses • Three Kinds - Instruction read miss: Causes most delay, CPU has to wait until the instruction Is fetched from the DRAM - Data read miss: Causes less delay, Inst not dependent on cache miss can continue execution until data is returned from DRAM - Data write miss: causes least delay, write can be queued & CPU can continue until queue is full

  14. Types of Misses • Conflict Misses Reduced by caching : Miss and Victim • Compulsory Misses • Capacity Misses Both are reduced by prefecthing: Stream Buffers Multi-way Buffers

  15. Conflict Miss • Conflict Misses are the misses which would not occur if the cache was Fully associative and had LRU • If an item has been evicted from the cache and the next miss corresponds to that item then that kind of miss is called the conflict miss

  16. Conflict Misses Contd.. • Conflict Misses account to • 20-40% of overall D-M misses • 39% of L1-D$ misses • 29% of L1-I$ misses

  17. Conflict Misses,4Kb I&D

  18. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  19. Miss Caching • Small, Fully associative on-chip cache • On Miss Data is returned to -Direct mapped cache -Small Miss cache ( Where it replaces LRU item) • Processor probes both D-M and Miss cache

  20. Miss cache Organization

  21. Observations • Eliminates long off-chip miss penalty • More data conflicts misses are removed than Instruction conflict misses - Instructions within a procedure do not conflict as long as the procedure size is < cache size - If an instruction within the program calls another program which may be mapped else where, a conflict arises- instruction conflict

  22. Miss Cache Performance • For 4 KB D$ size - Miss cache of 2 entries can remove 25% of D$ conflict misses i.e. 13% of overall D$ misses - Miss cache of 4 entries can remove 36% of D$ conflict misses i.e. 18% of overall D$ misses • After 4 entries the improvement is minor

  23. Conflict Misses removed by Miss caching

  24. Overall Cache Misses removed by Miss Caching

  25. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  26. Victim Caching • Duplication of the data wastes storage space in miss cache • Loads F-A cache with victim line from the D-M cache • When data misses in the D-M cache but hits in the Victim cache, contents are swapped

  27. Victim Cache Organization

  28. Victim Cache Performance • Victim cache consisting of just one line is better than miss cache consisting of 2 lines • Significant improvement in the performance of all the benchmark programs

  29. Conflict Misses removed by Victim Caching

  30. Overall Cache Misses removed by Victim Caching

  31. Comparison of Miss cache and Victim cache performances

  32. Effect of D-M cache size on Victim cache performance • Smaller D-M caches – Most benefited due to addition of victim cache • As D-M cache size increases, likelihood of conflict misses removed by victim cache decreases • As the percentage of conflict misses decreases, the percentage of these misses removed by victim cache decreases

  33. Victim cache: vary direct-map cache size

  34. Effect of Line Size on Victim CachePerformance • As line size increases the number of conflict misses increases • As a result percentage of misses removed by victim cache increases

  35. Victim cache: vary data cache line size

  36. Victim caches and L2 Caches • Victim caches are also useful for L2 caches due to large line sizes • Using L1 victim cache can also reduce the number of L2 conflict misses

  37. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  38. Reducing Capacity and Compulsory Misses • Compulsory Misses First reference to a piece of data • Capacity Misses Due to insufficient cache size

  39. Prefetching Algorithms • Prefetch Always: Access to line “i” implies to prefetch access for “i+1” • Prefetch on miss : Reference to block “i” causes prefetch to block “i+1” Iff the block was a miss • Tagged Prefetch : Tag bit is set to `0` when a block is prefetched and to set 1 when block is used

  40. Limited Time For Prefetch

  41. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  42. Stream Buffers • Prefetched lines are placed in buffer in order to avoid polluting • Each entry consists of tag ,an available bit and data line • If a reference misses in the cache but hits in the buffer , the cache can be reloaded • When a line is moved from the SB , entries in the SB shift up and new successive data is fetched

  43. Stream Buffer Mechanism

  44. Stream Buffer Mechanism Contd.. • On Miss • Prefetch successive lines • Enter tag for address in to the SB • Set available bit to false • On return of the prefetched data • Place data in entry with its tag • Set available bit to true

  45. Stream Buffer Performance • Most instruction references break the purely sequential access pattern by the time the 6th successive line is fetched • Data references end even sooner • As a result , Stream buffers show better performance at removing I$ misses

  46. Sequential SB performance

  47. Limitations of Stream Buffers • Stream buffers considered are FIFO queues • Head of the queue has tag comparator • Elements must be removed strictly in sequence • Works only for sequential line misses • Fails for a non-sequential line miss

  48. Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

  49. Multi-way stream buffers • Single data SB`s could remove 72% of I$ misses and 25% of D$ misses • Multi-way SB was simulated- to improve performance of SB`s for data references • Consists of 4 SB in parallel • On Miss the least recently Hit SB is cleared and data is started fetching from the miss address

  50. Multi-way stream Buffer Design

More Related