Understanding Cache Memory: Amdahl’s Law, Importance, and Organization

Chapter 7a: Cache Memory

Amdahl’s law? Big is Slow • Consider taking an open-book exam. You might find the answer: • The more information stored, the slower the access • In your memory • In a sheet of notes • In course handouts • In the textbook Spatial Locality – You’re likely to have questions on similar topics Temporal Locality – If you need a particular formula, you’re likely to need it again soon 7.1

Registers And so it is with Computers • Our system has two kinds of memory • Registers • Close to CPU • Small number of them • Fast • Main memory • Big • Slow (15ns) • “Far” from CPU CPU Store Load or I-Fetch Main Memory Assembly language programmers andcompilers manage all transitions betweenregisters and main memory 7.1

... ... IF RF M WB LW EX Memory Access Instruction Fetch The problem... Note: Access time is faster in some memory modes, but basic access is around 10-20ns • DRAM Memory access takes around 15ns • At 100 MHz, that’s 1.5 cycles • At 1GHz, that’s 15 cycles • Don’t even get started talking about 3-4GHz • Since every instruction has to be fetched from memory, we lose big time • We lose double big time when executing a load or store 7.1

A hopeful thought • Static RAMs are much faster than DRAMs • 3-4 ns possible (instead of 15ns) • So, build memory out of SRAMs • SRAMs cost about 20 times as much as DRAM • Technology limitations cause the price difference • Access time gets worse if larger SRAM systems are needed (small is fast...) • Nice try. 7.1

CPU Registers Store Load or I-Fetch SRAM Cache Main Memory (DRAM) A more hopeful thought • Remember the telephone directory? • Do the same thing with computer memory • Build a hierarchy of memories between the registers and main memory • Closer to CPU: Small and fast (frequently used) • Closer to Main Memory: Big and slow (more rarely used) The big question:What goes in the cache? 7.1

Locality p = A[i]; q = A[i+1] r = A[i] * A[i+3] - A[i+2] i= i+1; if (i<20) { z = i*i + 3*i -2; } q = A[i]; name = employee.name; rank = employee.rank; salary = employee.salary; Temporal locality Spatial Locality The program is very likelyto access the same dataagain and again over time The program is very likelyto access data that is closetogether 7.1

Main Memory Fragment 1000 5600 1004 3223 1008 23 1012 1122 1000 1016 5600 0 1016 1020 0 32324 1048 1024 2447 845 1028 1028 43 43 1032 976 1036 77554 1040 433 1044 7785 1048 2447 1052 775 1056 433 The Cache Cache 4 Most recently accessedMemory locations (exploitstemporal locality) Issues: How do we know what’s in the cache? What if the cache is full? 7.2

Goals for Cache Organization • Complete • Data may come from anywhere in main memory • Fast lookup • We have to look up data in the cache on every memory access • Exploits temporal locality • Stores only the most recently accessed data • Exploits spatial locality • Stores related data

6-bit Address Main Memory 00 00 00 5600 00 01 00 3223 00 10 00 23 Cache 00 11 00 1122 01 00 00 0 Valid Index Tag Data 01 01 00 32324 00 Y 00 5600 01 10 00 845 01 Y 11 775 01 11 00 43 10 Y 01 845 10 00 00 976 11 N 00 33234 10 01 00 77554 10 10 00 433 10 11 00 7785 11 00 00 2447 11 01 00 775 11 10 00 433 11 11 00 3649 Direct Mapping In a direct-mapped cache: -Each memory address corresponds to one location in the cache -There are many different memory locations for each cache entry (four in this case) Tag Index Always zero (words) 7.2

Hits and Misses • When the CPU reads from memory: • Calculate the index and tag • Is the data in the cache? Yes – a hit, you’re done! • Data not in cache? This is a miss. • Read the word from memory, give it to the CPU. • Update the cache so we won’t miss again. Write the data and tag for this memory location to the cache. (Exploits temporal locality) • The hitrateand miss rate are the fraction of memory accesses that are hits and misses • Typically, hit rates are around 95% • Many times instructions and data are considered separately when calculating hit/miss rates 7.2

2 1 0 31 12 11 Data Index V Tag 0 1 2 ... ... 1022 1023 A 1024-entry Direct-mapped Cache Memory Address Index 10 Byte offset 20 Tag One Block 20 32 Hit! Data 7.2

2 1 0 31 12 11 Index V Tag Data Tag- 20 bits Index- 10 bits 0 1 11153 2332232 1 1 4323 323 2 0 212 998 3 1 14 34238829 ... 1023 1 8941 1976 address = 0000 0000 0000 0000 11100000 00001100 byte offset=0 index = 3 tag = 14 address = 0000 0000 0000 0000 00110000 00000101 byte offset=1 tag = 3 index = 1 Example - 1024-entry Direct Mapped Cache 3 8764 Assume the cache has been used for awhile, so it’s not empty... LW $t3, 0x0000E00C($0) Hit: Data is 34238829 byte address LB $t3, 0x00003005($0) (let’s assume the word at mem[0x0003004] = 8764) Miss: load word from mem[0x0003004] and write into cache at index 1 7.2

IF RF M WB EX Separate I- and D-Caches • It is common to use two separate caches for Instructions and for Data • All Instruction fetches use the I-cache • All data accesses (loads and stores) use the D-cache • This allows the CPU to access the I-cache at the same time it is accessing the D-cache • Still have to share a single memory InstructionCache DataCache Main Memory miss miss 7.2

Benchmark Instruction Data miss Combined miss rate rate miss rate So, how’d we do? Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (16K 1-word blocks) gcc 6.1% 2.1% 5.4% spice 1.2% 1.3% 1.2% Note: This isn’tjust the average 7.2

The issue of writes • What to do on a store (hit or miss) • Won’t do to just write it to the cache • The cache would have a different (newer) value than main memory • Simple Write-Through • Write both the cache and memory • Works correctly, but slowly • Buffered Write-Through • Write the cache • Buffer a write request to main memory • 1 to 10 buffer slots are typical 7.2

Address Main Memory 00 00 00 5600 00 01 00 3223 00 10 00 23 00 11 00 1122 01 00 00 0 01 01 00 32324 01 10 00 845 01 11 00 43 10 00 00 976 10 01 00 77554 10 10 00 433 10 11 00 7785 11 00 00 2447 11 01 00 775 11 10 00 433 11 11 00 3649 What about Spatial Locality? • Spatial locality says that physically close data is likely to be accessed close together • On a cache miss, don’t just grab the word needed, but also the words nearby • Organize memory in multi-word blocks • Memory transfers between cache and memory are always one full block Example of 4-word blocks. Each block is 16 bytes. On a miss, the cache copies the entire block that contains the desired word 7.2

Data V Tag CacheEntry Word 4 3 2 1 0 31 14 13 Address 10 2 2 18 Index Blockoffset Byteoffset Tag The block size may be any power of 2: 1,2,4,8,16,… Working with Blocks 3 Word 2 Word 1 Word 0 One 4-word Block The requested word may be at any position within a block. All words in the same block have the same index and tag

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks 4 3 1 2 0 31 15 14 Data (4-word Blocks) Index V Tag 0 1 2 ... ... 2046 2047 Mux 3 2 1 0 32 32KByte/4-Word Block D.M. Cache 211=2K Tag Index Byte offset 11 Block offset 17 17 Hit! Data 7.2

Benchmark Block Size Instruction Data miss Combined (words) miss rate miss rate How Much Change? Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks) gcc 1 6.1% 2.1% 5.4% gcc 4 2.0% 1.7% 1.9% spice 1 1.2% 1.3% 1.2% spice 4 0.3% 0.6% 0.4% 7.2

Choosing a block size • Large block sizes help with spatial locality, but... • It takes time to read the memory in • Larger block sizes increase the time for misses • It reduces the number of blocks in the cache • Number of blocks = cache size/block size • Need to find a middle ground • 16-64 bytes works nicely 7.2

V Tag Data V Tag Data 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: Other Cache organizations Fully Associative Direct Mapped Index No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset 7.3

Fully Associative vs. Direct Mapped • Fully associative caches provide much greater flexibility • Nothing gets “thrown out” of the cache until it is completely full • Direct-mapped caches are more rigid • Any cached data goes directly where the index says to, even if the rest of the cache is empty • A problem, though... • Fully associative caches require a complete search through all the tags to see if there’s a hit • Direct-mapped caches only need to look one place 7.3

V Tag Data V Tag Data 0: 0: 1: 1: 2: 3: 2: 4: 5: 3: 6: 7: A Compromise 4-Way set associative 2-Way set associative Each address has four possible locations with the same index Each address has two possible locations with the same index One fewer index bit: 1/2 the indexes Two fewer index bits: 1/4 the indexes Address = Tag | Index | Block offset Address = Tag | Index | Block offset 7.3

Index Index Index V Tag Data V Tag Data V Tag Data 000: 0 0 0 00: 001: 0 0 0 0: 010: 0 0 0 01: 011: 0 0 0 100: 0 0 0 10: 101: 0 0 0 1: 110: 0 0 0 11: 111: 0 0 0 Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits) Set Associative Example 128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity 0100111000 0100111000 0100111000 Miss Miss Miss Miss Miss Miss 1100110100 1100110100 1100110100 Miss Hit Hit 0100111100 0100111100 0100111100 Miss Miss Miss 0110110000 0110110000 0110110000 1100111000 Miss 1100111000 Miss 1100111000 Hit - 010 011 110 110 010 1 - 01001 1 - 11001 1 - 01101 - 0100 1100 1 1 - 1100 0110 1 Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc. 7.3

New Performance Numbers Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (4K 4-word blocks) Benchmark Associativity Instruction Data miss Combined rate miss rate gcc Direct 2.0% 1.7% 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4% 7.3

Block Replacement Strategies • We have to replace a block when there is a collision • Collisions occur whenever the selected set is full • Strategy 1: Ideal (Oracle) • Replace the block that won’t be used again for the longest time • Drawback - Requires knowledge of the future • Strategy 2: Least Recently Used (LRU) • Replace the block that was last used (hit) the longest time ago • Drawback - Requires difficult bookkeeping • Strategy 3: Approximate LRU • Set a use bit for each block every time it is hit, clear all periodically • Replace a block without its use bit set • Strategy 4: Random • Pick a block at random (works almost as well as approx. LRU) 7.5

The Three C’s of Misses • Compulsory Misses • The first time a memory location is accessed, it is always a miss • Also known as cold-start misses • Only way to decrease miss rate is to increase the block size • Capacity Misses • Occur when a program is using more data than can fit in the cache • Some misses will result because the cache isn’t big enough • Increasing the size of the cache solves this problem • Conflict Misses • Occur when a block forces out another block with the same index • Increasing Associativityreduces conflict misses • Worst in Direct-Mapped, non-existent in Fully Associative 7.5

CPU Registers Store Load or I-Fetch Main Memory (DRAM) Cache Sizing • How big should the cache be? • As big as possible! Hold as much data in the cache as you can. • But… Smaller is faster… Cache • The cache must provide the data within 1 CPU cycle to avoid stalling • Cache must be on the same chip as the CPU • Make the cache as large as possible until either: • Access time is > 1 CPU cycle • Run out of room on CPU chip

CPU Registers Main Memory (DRAM) Multi-level Caches • The difference between a cache hit (1 cycle) and miss (30-50 cycles) is huge • Introduce a series of larger, but slower caches to smooth out the difference L1 Cache L2 Cache • L1 Cache: As big as can be in 1 cycle • L2 Cache: As big as can be in 3-5 cycles • L3 Cache: As big as can be in 5-10 cycles L3 Cache • L2/L3 Cache may be on/off chip depending on CPU speeds and constraints

Understanding Cache Memory: Amdahl’s Law, Importance, and Organization