Understanding the Memory Hierarchy in Computer Organization

CPE 232 Computer Organization Cache Introduction Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/Courses/CPE335_S08/index.html

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks 1,024+ bytes (disk sector = page) The Memory Hierarchy Processor Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level • The size is for the block, which represents the minimum amount of data moved between two adjacent levels.

Cache • Caches were first introduced in research computers in early 1960s • The term is used to refer to the memory unit that resides between the processor and the main memory • Consider the following cache and a request to word at address 5 from the processor • Questions to answer (in hardware): • Q1: How do we know if a data item is in the cache? • Q2: If it is, how do we find it?

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Hit: Data requested by the processor is found in some block in the upper level (Blk X) • Hit Rate The fraction of memory accesses found in the upper level • Hit Time Time to access the upper level which consists of Access time + Time to determine hit/miss

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Miss Data requested by the processor is not in the upper level so retrieve from a block in the lower level (Blk Y) • Miss Rate = 1 - (Hit Rate) • Miss Penalty Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty

How is the Hierarchy Managed? • registers  memory • by compiler (programmer?) • cache  main memory • by the cache controller hardware • main memory  disks • by the operating system (virtual memory) • virtual to physical address mapping assisted by the hardware (TLB) • by the programmer (files)

Cache – Direct Mapped • Direct-mapped • For each item of data at the lower level, there is exactly one location in the cache where it might be • So lots of items at the lower level must share locations in the upper level • Use the memory address to select the cache location and to identify the block • Memory-to-cache address mapping: Cache block # =(block address) modulo (# of blocks in the cache) • If the cache size is power of 2, then use the lower log2(number of cache blocks) bits of the address (after the bits used for block offset (# words/block) and byte offset (# bytes/word)) • First consider a cache with block size of one word

Caching: A Simple First Example Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx • Two low order bits define the byte in the word (32b words) • Use next 2 low order memory address bits – the index – to determine where we may find the word • Use the remaining bits of the address – the tag- to identify the word • The valid bit ! Cache Index Valid Tag Data 00 01 10 11 (block address) modulo (# of blocks in the cache)

Byte offset 31 30 . . . 13 12 11 . . . 2 1 0 Tag 20 Data 10 Hit Index Index Valid Tag Data 0 1 2 . . . 1021 1022 1023 20 32 MIPS Direct Mapped Cache Example • One word/block, cache size = 1K words What kind of locality are we taking advantage of?

01 4 11 15 Direct Mapped Cache Example • Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) miss 3 hit 4 hit 15 miss 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) • 8 requests, 6 misses

Time (clock cycles) Inst 0 I n s t r. O r d e r D$ Reg D$ D$ D$ D$ Reg Reg Reg Reg Reg Reg Reg Reg Reg I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU Inst 1 Inst 2 Inst 3 Inst 4 Caching and Pipelining • To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$) To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that?

Handling Cache Hits • Read hits (I$ and D$) • Data/instruction is found in the cache • Keep executing • Write hits (D$ only): Two approaches • Require the cache and memory to be consistent • always write the data into both the cache block and the next level in the memory hierarchy (write-through) • writes run at the speed of the next level in the memory hierarchy – so slow! – or can use a write buffer, so only have to stall if the write buffer is full • Allow cache and memory to be inconsistent • write the data only into the cache block (write-back the cache contents to the next level in the memory hierarchy when that cache block is replaced) • need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is replaced

Write Buffer for Write-Through Caching • Write buffer between the cache and main memory • Processor: writes data into the cache and the write buffer • Memory controller: writes contents of the write buffer to memory • The write buffer is just a FIFO • Typical number of entries: 4 • Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycle • Memory system designer’s nightmare • When the store frequency (w.r.t. time) → 1 / DRAM write cycle leading to write buffer saturation • One solution is to use a write-back cache; another is to use an L2 cache Cache Processor DRAM write buffer

Handling Cache Misses • Read misses (I$ and D$) • Stall the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume • Write misses (D$ only) • Stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to replace a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume Or (normally used in write-back caches) • Write allocate – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall Or (normally used in write-through caches with a write buffer) • No-write allocate – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full; must invalidate the cache block since it will be inconsistent (now holding stale data)

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Sources of Cache Misses • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): • First access to a block, “cold” fact of life, not a whole lot you can do about it • If you are going to run “millions” of instruction, compulsory misses are insignificant • Conflict (collision): • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity (next lecture) • Capacity: • Cache cannot contain all blocks accessed by the program • Solution: increase cache size

Byte offset Hit 31 30 . . . 13 12 11 . . . 4 3 2 1 0 Data 20 Tag 8 Block offset Index Data Index Valid Tag 0 1 2 . . . 253 254 255 20 32 Taking Advantage of Spatial Locality • Multiword direct mapped cache • Four words/block, cache size = 1K words

Multiword Direct Mapped Cache Example • Consider a cache with two blocks; each holds two words • Assume access to the word addresses 0,1,2,3,4,3,4,15 • Block address = floor((word address) / (# of words per block)) • Block # = (block address) mod (# blocks)

0 1 2 3 4 3 01 11 15 5 4 14 4 15 Multiword Direct Mapped Cache Example • Word addresses 0,1,2,3,4,3,4,15 miss hit miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) hit miss hit 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) hit miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) • 8 requests, 4 misses • Spatial locality

Miss Rate vs Block Size vs Cache Size • Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Average Access Time Miss Rate Miss Penalty Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks compromises Temporal Locality Block Size Block Size Block Size Block Size Tradeoff • Larger block sizes take advantage of spatial locality but • If the block size is too big relative to the cache size, the miss rate will go up • Larger block size means larger miss penalty • Latency to first word in block + transfer time for remaining words • In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate

Multiword Block Considerations • Read misses (I$ and D$) • Processed the same as for single word blocks – a miss returns the entire block from memory • Miss penalty grows as block size grows • Early restart – datapath resumes execution as soon as the requested word of the block is returned • Requested word first – requested word is transferred from the memory to the cache (and datapath) first • Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss • Write misses (D$) • Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time

Determining the size of Direct-Mapped Cache • The size of the cache is determined by the block size and the size of the tag field • Consider a cache with 2n blocks with 2m words each and 32-bit memory address . • The number of bits used for the tag is nTag = 32 – log2(# of cache blocks) – log2(#words per block) – 2 nTag = 32 – n– m – 2 = 30 – (m+n) • 1 bit is used for validation • So, the block size (data + tag + validation) in bits is nBlock = (#words per block)x(word size) + nTag + 1 nBlock = 2mx32 + nTag + 1 • Cache size = # of blocks x block size = 2n x nBlock Cache size = 2n x (2mx32 + 31 – (m+n))

Example 1 • How many bits are required for a direct mapped cache with 16 KB of data and 4-word blocks, assuming 32-bit address. • # of blocks in cache = 16 KB / number of words per block / 4 = 1 Kblocks • Index bits = n = log2(1 K) = 10 • Block offset = m = log2(4) = 2 • Cache size = 2n x (2mx32 + 31 – (m+n)) = 210 x (22x32 + 31 – (2+10)) = 147 Kbits

Example 2 • Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to ? block address = byte address / # bytes per block = floor (1200 / 16) = 75 block # = (block address) mod (# cache blocks) = 75 mod 64 = 11

Memory Systems that Support Caches • Logic vs DRAM speed gap continues to grow Clocks per DRAM access Clocks per instruction

Memory Systems that Support Caches • The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways on-chip One word wide organization (one word wide bus and one word wide memory) CPU • Assume • 1 clock cycle to send the address • 25 clock cycles for DRAM cycle time • 1 clock cycle to return a word of data • Memory-Bus to Cache bandwidth • number of bytes accessed from memory and transferred to cache/CPU per clock cycle Cache bus 32-bit data & 32-bit addr per cycle Memory

One Word Wide Memory Organization • If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 25 1 27 Cache bus Memory 4/27 = 0.148

One Word Wide Memory Organization, con’t • What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return data words total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip 1 4 x 25 = 100 4 105 CPU Cache bus Memory (4 x 4)/105 = 0.152

Wide Memory Organization • What if the memory is two words wide and the block is 4 words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 25 + 25 = 50 2 53 Cache bus Memory 64-bit data per cycle (4 x 4)/53 = 0.302

25 cycles 25 cycles 25 cycles 25 cycles Interleaved Memory Organization • For a block size of four words cycle to send 1st address cycles to read DRAM cycles to return data words total clock cycles miss penalty on-chip 1 25 4 30 CPU Cache bus Memory bank 0 Memory Bank 1 Memory bank 2 Memory bank 3 • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock (4 x 4)/30 = 0.533

Cache Summary • The Principle of Locality: • Program likely to access a relatively small portion of the address space at any instant of time • Temporal Locality: Locality in Time • Spatial Locality: Locality in Space • Three major categories of cache misses (3 Cs): • Compulsory misses: sad facts of life. Example: cold start misses • Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! • Capacity misses: increase cache size • Cache design space • total size, block size, associativity (replacement policy) • write-hit policy (write-through, write-back) • write-miss policy (write allocate, write buffers)

Understanding the Memory Hierarchy in Computer Organization