Chapter 7b: Cache Memory Performance

Chapter 7b: Cache Memory Performance

6-bit Address Main Memory 00 00 00 5600 00 01 00 3223 00 10 00 2 1 0 31 23 Cache Tag Index 00 11 00 1122 01 00 00 0 Valid Index Tag Data 01 01 00 32324 00 Y 00 5600 01 10 00 845 01 Y 11 775 01 11 00 43 10 Y 01 845 10 00 00 976 11 N 00 33234 10 01 00 77554 10 10 00 433 10 11 00 7785 11 00 00 2447 11 01 00 775 11 10 00 433 11 11 00 3649 Split depends oncache size Memory Address: Direct Mapping Review Each word has only one placeit can be in the cache: Index must match exactly Tag Index Always zero (words) 7.2

Missed me, Missed me... • What to do on a hit: • Carry on... (Hits should take one cycle or less) • What to do on an instruction fetch miss: • Undo PC increment (PC <-- PC-4) • Do a memory read • Stall until memory returns the data • Update the cache (data, tag and valid) at index • Un-stall • What to do on a load miss • Same thing, except don’t mess with the PC 7.2

Missed me, Missed me... • What to do on a store (hit or miss) • Won’t do to just write it to the cache • The cache would have a different (newer) value than main memory • Simple Write-Through • Write both the cache and memory • Works correctly, but slowly • Buffered Write-Through • Write the cache • Buffer a write request to main memory • 1 to 10 buffer slots are typical 7.2

IF RF M WB EX Splitting up • It is common to use two separate caches for Instructions and for Data • All Instruction fetches use the I-cache • All data accesses (loads and stores) use the D-cache • This allows the CPU to access the I-cache at the same time it is accessing the D-cache • Still have to share a single memory Note: The hit rate will probably be lower than for a combined cache of the same total size. 7.2

Data V Tag CacheEntry Word 4 3 2 1 0 31 14 13 Address 10 2 2 18 Index Blockoffset Byteoffset Tag What about Spatial Locality? • Spatial locality says that physically close data is likely to be accessed close together • On a cache miss, don’t just grab the word needed, but also the words nearby • The easiest way to do this is to increase the block size 3 Word 2 Word 1 Word 0 One 4-word Block All words in the same block have the same index and tag Note: 22 = 4 7.2

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks 4 3 1 2 0 31 15 14 Data (4-word Blocks) Index V Tag 0 1 2 ... ... 2046 2047 Mux 3 2 1 0 32 32KByte/4-Word Block D.M. Cache 211=2K Tag Index Byte offset 11 Block offset 17 17 Hit! Data 7.2

Benchmark Block Size Instruction Data miss Combined (words) miss rate miss rate How Much Change? Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks) gcc 1 6.1% 2.1% 5.4% gcc 4 2.0% 1.7% 1.9% spice 1 1.2% 1.3% 1.2% spice 4 0.3% 0.6% 0.4% 7.2

This actuallydepends onthe busspeed The cost of a cache miss • For a memory access, assume: • 1 clock cycle to send address to memory • 40 clock cycles for each DRAM access(clock cycle 0.5ns, 20 ns access time) • 1 clock cycle to send each resulting data word • Miss access time (4-word block) • 4 x (Address + access + sending data word) • 4 x (1 + 40 + 1) = 168= 168 cycles for each miss 7.2

4 bytes 4 bytes CPU CPU Bus Bus Cache Cache Bus 1 1 1 1 40 40 40 40 1 1 1 1 Memory Bus Bus Bus Bus Memory3 Memory2 Memory1 Memory0 Memory Interleaving Interleaving Default Begin accessing one word, and while waiting, start accessing other three words (pipelining) Must finish accessing one word before starting the next access (1+40+1)x4 = 168 cycles 45 cycles Requires 4 separate memories, each 1/4 size Interleaving worksperfectly with caches Spread out addresses among the memories Sophisticated DRAMs (EDO, SDRAM, etc.) provide support for this 7.2

Block with index 1000: tag word 3 word 2 word 0 word 1 V 1 3000 23 322 355 2 The issue of Writes Ö On a read miss, we read the entire block from memory into the cache On a write hit, we write one word into the block. The other words in theblock are unchanged. Ö On a write miss, we write one word into the block and update the tag. 4334 2420 Perform a write to a location with index 1000, tag 2420, word 1 (value 4334) The other words are still the old data (for tag 3000). Bad news! Solution 1: Don’t update the cache on a write miss. Write only to memory. Solution 2:On a write miss, first read the referenced block in (including the old value of the word being written), then write the new word into the cache and write-through to memory. 7.2

Choosing a block size • Large block sizes help with spatial locality, but... • It takes time to read the memory in • Larger block sizes increase the time for misses • It reduces the number of blocks in the cache • Number of blocks = cache size/block size • Need to find a middle ground • 16-64 bytes works nicely 7.2

V Tag Data V Tag Data 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: Other Cache organizations Fully Associative Direct Mapped Index No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset 7.3

Fully Associative vs. Direct Mapped • Fully associative caches provide much greater flexibility • Nothing gets “thrown out” of the cache until it is completely full • Direct-mapped caches are more rigid • Any cached data goes directly where the index says to, even if the rest of the cache is empty • A problem, though... • Fully associative caches require a complete search through all the tags to see if there’s a hit • Direct-mapped caches only need to look one place 7.3

V Tag Data V Tag Data 0: 0: 1: 1: 2: 3: 2: 4: 5: 3: 6: 7: A Compromise 4-Way set associative 2-Way set associative Each address has four possible locations with the same index Each address has two possible locations with the same index One fewer index bit: 1/2 the indexes Two fewer index bits: 1/4 the indexes Address = Tag | Index | Block offset Address = Tag | Index | Block offset 7.3

Index Index Index V Tag Data V Tag Data V Tag Data 000: 0 0 0 00: 001: 0 0 0 0: 010: 0 0 0 01: 011: 0 0 0 100: 0 0 0 10: 101: 0 0 0 1: 110: 0 0 0 11: 111: 0 0 0 Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits) Set Associative Example 128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity 0100111000 0100111000 0100111000 Miss Miss Miss Miss Miss Miss 1100110100 1100110100 1100110100 Miss Hit Hit 0100111100 0100111100 0100111100 Miss Miss Miss 0110110000 0110110000 0110110000 1100111000 Miss 1100111000 Miss 1100111000 Hit - 010 011 110 110 010 1 - 01001 1 - 11001 1 - 01101 - 0100 1100 1 1 - 1100 0110 1 Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc. 7.3

New Performance Numbers Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (4K 4-word blocks) Benchmark Associativity Instruction Data miss Combined rate miss rate gcc Direct 2.0% 1.7% 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4% 7.3

Block Replacement Strategies • We have to replace a block when there is a collision • Collisions occur whenever the selected set is full • Strategy 1: Ideal (Oracle) • Replace the block that won’t be used again for the longest time • Drawback - Requires knowledge of the future • Strategy 2: Least Recently Used (LRU) • Replace the block that was last used (hit) the longest time ago • Drawback - Requires difficult bookkeeping • Strategy 3: Approximate LRU • Set a use bit for each block every time it is hit, clear all periodically • Replace a block without its use bit set • Strategy 4: Random • Pick a block at random (works almost as well as approx. LRU) 7.5

The Three C’s of Misses • Compulsory Misses • The first time a memory location is accessed, it is always a miss • Also known as cold-start misses • Only way to decrease miss rate is to increase the block size • Capacity Misses • Occur when a program is using more data than can fit in the cache • Some misses will result because the cache isn’t big enough • Increasing the size of the cache solves this problem • Conflict Misses • Occur when a block forces out another block with the same index • Increasing Associativityreduces conflict misses • Worst in Direct-Mapped, non-existent in Fully Associative 7.5

Chapter 7b: Cache Memory Performance

Chapter 7b: Cache Memory Performance

Presentation Transcript

Chapter 8: Memory-Management Strategies

Chapter 8 Virtual Memory

Chapter 5. The Memory System

Advanced Computer Architecture Memory Hierarchy Design

Memory

Chapter 8 Virtual Memory

Advanced Computer Architecture Memory Hierarchy Design

Computer Architecture Shared Memory MIMD Architectures

Cognition: Memory How Does your Memory Work 1 of 5 - BBC Horizon Documentary.flv

Inside RAC

Dynamic Memory Allocation

CS 201 The Memory Hierarchy II

Chapter 5

Csci 136 Computer Architecture II –Cache Memory

Chapter 4

Performance Tuning Tips

Semiconductor Memories

Very High Performance Cache Based Techniques for Iterative Methods

Chapter 4 Internal Memory