Lecture 4.2

Lecture 4.2 Memory Hierarchy: Cache Basics

Learning Objectives • Given a 32-bit address, figure out the corresponding block address • Given a block address, find the index of the cache line it maps to • Given a 32-bit address and a multi-word cache line, identify the particular location inside the cache line the word maps to • Given a sequence of memory requests, decide hit/miss for each request • Give the cache configuration, calculate the total number of bits to implement the cache • Describe the behaviors of write-through cache and write-back cache for write hit or write miss 2

Coverage • Textbook Chapter 5.3 3

Cache Memory • Cache memory • The level of the memory hierarchy closest to the CPU • Given accesses X1, …, Xn–1, Xn §5.3 The Basics of Caches • How do we know if the data is present? • Where do we look? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4

Direct Mapped Cache • Location determined by address • Direct mapped: only one choice • (Block address) modulo (#Blocks in cache) • #Blocks is a power of 2 • Use low-order address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5

Tags and Valid Bits • How do we know which particular data block is stored in a cache location? • Store block address as well as the data • Actually, only need the high-order bits • Called the tag • What if there is no data in a location? • Valid bit: 1 = valid, 0 = not valid • Initially 0 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6

Cache Example • 8 blocks, 1 word / block, direct mapped • Initial state • Access sequence (address in word): 22, 26, 22, 26, 16, 3, 16, 18, 16 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

Cache Example Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

Address Subdivision Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13

Byte offset Hit 31 …… 12 11 . . . 4 3 2 1 0 Data 20 Block offset Tag 8 Index Data Index Valid Tag 0 1 2 . . . 253 254 255 20 32 Multiword-Block Direct Mapped Cache • Four words / block, cache size = 1K words What kind of locality are we taking advantage of?

0 1 2 3 4 3 11 01 5 15 14 4 4 15 Taking Advantage of Spatial Locality • Let cache block hold more than one word 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid miss hit miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) hit miss hit 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) hit miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) • 8 requests, 4 misses

31 10 9 4 3 0 Tag Index Offset 22 bits 6 bits 4 bits Larger Block Size • 64 blocks, 16 bytes / block • To what block number does address 1200 map? • 120010=0x0000,04B0 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16

Block Size Considerations • Larger blocks should reduce miss rate • Due to spatial locality • In a fixed-sized cache • Larger blocks  fewer of them • More competition  increased miss rate • Larger miss penalty • Can override benefit of reduced miss rate • Early restart and critical-word-first can help Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17

Miss Rate v.s. Block Size Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18

Cache Field Sizes • The number of bits in a cache includes the storage for data, the tags, and the flag bits • For a direct mapped cache with 2n blocks, n bits are used for the index • For a block size of 2m words (2m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word • What is the size of the tag field? • The total number of bits in a direct-mapped cache is then 2n x (block size + tag field size + flag bits size) • How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?

Handling Cache Hits • Read hits (I$ and D$) • this is what we want! • Write hits (D$ only) • require the cache and memory to be consistent • always write the data into both the cache block and the next level in the memory hierarchy (write-through) • allow cache and memory to be inconsistent • write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”)

Write-Through • Write through: update the data in both the cache and the main memory • But makes writes take longer • Writes run at the speed of the next level in the memory hierarchy • e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles • Effective CPI = 1 + 0.1×100 = 11 • Solution: write buffer • Holds data waiting to be written to memory • CPU continues immediately • Only stalls on write if write buffer becomes full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

Write-Back • Write back: On data-write hit, just update the block in cache • Keep track of whether each block is dirty • Need a dirty bit • When a dirty block is replaced • Write it back to memory • Can use a write buffer Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Handling Cache Misses (single-word cache block) • Read misses (I$ and D$) • Stall the CPU pipeline • Fetch block from the next level of hierarchy • Copy it to the cache • Send the requested word to the processor • Resume the CPU pipeline 23

Handling Cache Misses (single-word cache block) • Write allocate (aka: fetch on write) • Data at the missed-write location is loaded to cache • For write-through: two alternatives • Allocate on miss: fetch the block, i.e., write allocate • No write allocate: don’t fetch the block • Since programs often write a whole data structure before reading it (e.g., initialization) • For write-back • Usually fetch the block, i.e., write allocate Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Multiword Block Considerations • Read misses (I$ and D$) • Processed the same as for single word blocks – a miss returns the entire block from memory • Miss penalty grows as block size grows • Early restart – processor resumes execution as soon as the requested word of the block is returned • Requested word first – requested word is transferred from the memory to the cache (and processor) first • Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss

Multiword Block Considerations • Write misses (D$) • If using write allocate, must first fetch the block from memory and then write the word to the block • or could end up with a “garbled” block in the cache, e.g., for 4-word blocks, a new tag, one word of data from the new block, and three words of data from the old block • If not using write allocate, forward the write request to the main memory

Main Memory Supporting Caches • Use DRAMs for main memory • Example: • cache block read • 1 bus cycle for address transfer • 15 bus cycles per DRAM access • 1 bus cycle per data transfer • 4-word cache block • 3 different main memory configurations • 1-word wide memory • 4-word wide memory • 4-bank interleaved memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

1-Word Wide Bus, 1-Word Wide Memory • What if the block size is four words and the bus and the memory are 1-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty

15 cycles 15 cycles 15 cycles 15 cycles 1-Word Wide Bus, 1-Word Wide Memory • What if the block size is four words and the bus and the memory are 1-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 1 4 x 15 = 60 4 65 Bandwidth = (4 x 4)/65 = 0.246 B/cycle

4-Word Wide Bus, 4-Word Wide Memory • What if the block size is four words and the bus and the memory are 4-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty

15 cycles 4-Word Wide Bus, 4-Word Wide Memory • What if the block size is four words and the bus and the memory are 4-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 1 15 1 17 Bandwidth = (4 x 4)/17 = 0.941 B/cycle

Interleaved Memory, 1-Word Wide Bus • For a block size of four words cycle to send address cycles to read DRAM banks cycles to return 4 data words total clock cycles miss penalty

Interleaved Memory, 1-Word Wide Bus • For a block size of four words cycle to send address cycles to read DRAM banks cycles to return 4 data words total clock cycles miss penalty 1 15 4*1 = 4 20 15 cycles 15 cycles 15 cycles 15 cycles Bandwidth = (4 x 4)/20 = 0.8 B/cycle

Lecture 4.2

Lecture 4.2

Presentation Transcript

Lecture 4.2 (cont.) Geometric Random Variables

Lecture Unit 4 Section 4.2

4.2

LECTURE 4.2

4.2

Lecture 3: Chapter 4.1 and 4.2

Lecture 4: Finishing 4.2 and onto 4.3

4.2

4.2 Lecture Notes

Lecture 4.2: Hash Functions: Design*

4.2

§ 4.2

Lecture 4.2 Fisheries and Wildlife Legislation

Lecture 4.2: Key Distribution

Introduction to Regression Lecture 4.2

Lecture 4.2: Relations Basics

§ 4.2

Lecture 4.2: Key Distribution

4.2