910 likes | 1.32k Vues
Computer Architecture Cache Memory. By Yoav Etsion and Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz. In the old days…. The predecessor of ENIAC (the first general-purpose electronic computer)
E N D
Computer ArchitectureCache Memory By Yoav Etsion and Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
In the old days… • The predecessor of ENIAC (the first general-purpose electronic computer) • Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann • Unlike ENIAC, binary rather than decimal, and a “stored program” machine • Operational until 1961 EDVAC (Electronic DiscreteVariable Automatic Computer)
In the olden days… • In 1945, Von Neumann wrote:“…This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.” Von Neumann & EDVAC
In the olden days… • Later, in 1946, he wrote:“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible” Von Neumann & EDVAC
Not so long ago… • In 1994, in their paper “Hitting the Memory Wall: Implications of the Obvious”, William Wulf and Sally McKee said:“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”
1000 CPU 100 Performance Gap grew 50% per year 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Not so long ago… CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs
More recently (2008)… Fast Conventionalarchitecture lower = slower Performance (seconds) Processor cores Slow The memory wall in the multicore era
Memory Trade-Offs • Large (dense) memories are slow • Fast memories are small, expensive and consume high power • Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap • Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L3 Cache L1 Cache Memory (DRAM) CPU L2 Cache
Why Hierarchy Works: Locality • Temporal Locality (Locality in Time): • If an item is referenced, it will tend to be referenced again soon • Example: code and variables in loops Keep recently accessed data closer to the processor • Spatial Locality (Locality in Space): • If an item is referenced, nearby items tend to be referenced soon • Example: scanning an array Move contiguous blocks closer to the processor • Due to locality, memory hierarchy is a good idea • We’re going to use what we’ve just recently used • And we’re going to use its immediate neighborhood
Bad locality behavior Temporal Locality Spatial Locality Programs with locality cache well ... Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Memory Address (one dot per access) Time
Memory Hierarchy: Terminology • For each memory level define the following • Hit: data appears in the memory level • Hit Rate: the fraction of accesses found in that level • Hit Latency: time to access the memory level • includes also the time to determine hit/miss • Miss: need to retrieve data from next level • Miss Rate: 1 - (Hit Rate) • Miss Penalty: Time to bring in the missing info (replace a block) + Time to deliver the info to the accessor • Average memory access time = t_effective = (Hit Lat. Hit Rate) + (Miss Pen. Miss Rate) = (Hit Lat. Hit Rate) + (Miss Pen. (1- Hit Rate)) • If hit rate is close to 1, t_effective is close to Hit latency, which is generally what we want
Effective Memory Access Time • Cache – holds a subset of the memory • Hopefully – the subset that is being used now • Known as “the working set” • Effective memory access time • teffective = (tcache Hit Rate) + (tmem (1 – Hit rate)) • tmem includes the time it takes to detect a cache miss • Example • Assume: tcache = 10 ns , tmem = 100 nsec Hit Rate t eff (nsec) 0 100 50 55 90 20 99 10.9 99.9 10.1 • tmem/tcache goes up more important that hit-rate closer to 1
memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 Block # offset 2 4 92 Cache – main idea • The cache holds a small part of the entire memory • Need to map parts of the memory into the cache • Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines” • Typical block size is 32, 64 bytes • Blocks are “aligned” in memory • Cache partitioned to cache lines • Each cache line holds a block • Only a subset of the blocks is mapped to the cache at a given time • The cache views an address as • Why use lines/blocks rather than words? Line Tag
memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 2 4 92 Cache Lookup • Cache hit • Block is mapped to the cache – return data according to block’s offset • Cache miss • Block is not mapped to the cache do a cacheline fill • Fetch block into fill buffer • may require few cycles • Write fill buffer into cache • May need to evict another block from the cache • Make room for the new block
31 4 0 Tag Offset Tag Array Data array v Line Tag = = = valid bit 31 0 hit data Checking valid bit & tag • Initially cache is empty • Need to have a “line valid” indication – line valid bit • A line may also be invalidated
Cache organization • Basic questions: • Associativity: Where can we place a memory block in the cache? • Eviction policy: Which cache line should be evicted on a miss? • Associativity: • Ideally, every memory block can go to each cache line • Called Fully-associative cache • Most flexible, but most expensive • Compromise: simpler designs • Blocks can only reside in a subset of cache lines • Direct-mapped cache • 2-way set associative cache • N-way set associative cache
Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • An address is partitioned to • offset within block • block number • Each block may be mapped to each of the cache lines • Lookup block in all lines • Each cache line has a tag • All tags are compared to the block# in parallel • Need a comparator per line • If one of the tags matches the block#, we have a hit • Supply data according to offset • Best hit rate, but most wasteful • Must be relatively small
Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • Is said to be a “CAM” • Content Addressable Memory
Address Block number 31 13 4 0 14 5 Tag Set Offset Tag Line 29 =512 sets Set# 0 31 Data Array Tag Array Direct Map Cache • Each memory block can only be mapped to a single cache line • Offset • Byte within the cache-line • Set • The index into the “cache array”, and to the “tag array” • For a given set (an index), only one of the cache lines that has this set can reside in the cache • Tag • Remaining block bits are used as tag • Tag uniquely identifies mem. block • Must compare the tag stored in the tag array to the tag of the address
x Cache Size x Cache Size . . . . Mapped to set X x Cache Size Direct Map Cache (cont) • Partition memory into slices • slice size = cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice start indicates position in cache (set) • Advantages • Easy & fast hit/miss resolution • Easy & fast replacement algorithm • Lowest power • Disadvantage • Line has only “one chance” • Lines replaced due to “conflict misses” • Organization with highest miss-rate
Address Fields 31 14 13 5 4 0 Tag Set Offset Tag = Hit/Miss Tag Array Direct Map Cache – Example Line Size: 32 bytes 5 Offset bits Cache Size: 16KB = 214 Bytes #lines = cache size / line size = 214/25=29=512 #sets = #lines = 512 #set bits = 9 bits (=5…13) #Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31) Lookup Address: 0x12345678 0001 0010 0011 0100 0101 0110 0111 1000 tag= 0x048B1 set= 0x0B3 offset= 0x18
Direct map (tiny example) • Assume • Memory size is 2^5 = 32 bytes • For this, need 5-bit address • A block is comprised of 4 bytes • Thus, there are exactly 8 blocks • Note • Need only 3-bits to identify a block • The offset is exclusively used within the cache lines • The offset is not used to locate the cache line Offset (within a block) Block index Address 11111 Address 01110 Address 00001
Direct map (tiny example) • Further assume • The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits) • The address divides like so • b4 b3 | b2 | b1 b0 • tag | set | offset Offset (within a block) Block index even cache lines odd cache lines tag array(bits) data array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 0 00 1 0 (= marked “C”) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 0 10 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 1 00 1 0 (=Q) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (10) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 1 10 1 0 (=J) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (11) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 0 01 1 0 (=B) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Accessing address • 0 11 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)
Direct map (tiny example) • Now assume • The size of our cache is 4 cache-lines • The address divides like so • b4 | b3 b2 | b1 b0 • tag | set | offset Offset (within a block) Block index memory array(bytes) tag array(bits) cache array(bytes)
Direct map (tiny example) • Now assume • The size of our cache is 4 cache-lines • The address divides like so • b4 | b3 b2 | b1 b0 • tag | set | offset Offset (within a block) Block index memory array(bytes) tag array(bits) cache array(bytes)
Address Fields 31 12 4 0 13 5 Tag Set Offset Tag Line Tag Line Set# Set# Way 0 Tag Array 31 Cachestorage 0 Way 1 Tag Array 31 Cachestorage 0 WAY #0 WAY #1 2-Way Set Associative Cache • Each set holds two line (way 0 and way 1) • Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel) • Cache effectively partitioned into two Example: Line Size: 32 bytes Cache Size 16KB #of lines 512 lines #sets 256 Offset bits 5 bits Set bits 8 bits Tag bits 19 bits Address 0001 0010 0011 01000101 0110 0111 1000 Offset: 1 1000 = 0x18 = 24 Set: 1011 0011 = 0x0B3 = 179 Tag: 000 1001 0001 1010 0010 = = 0x091A2
31 13 12 5 4 0 Tag Set Offset Tag Data Tag Data Set# Way 0 Way 1 = = MUX Data Out Hit/Miss 2-Way Cache – Hit Decision
x Way Size x Way Size . . . . Mapped to set X x Way Size 2-Way Set Associative Cache (cont) • Partition memory into “slices” or “ways” • slice size = way size = ½ cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice-start indicates position in cache (set) • Compared to direct map cache • Half size slice 2× #slices 2× #blocks mapped to each cache set • Each set can have 2 blocks at a given time ++ Fewer collisions/evictions ---- More logic, more power consuming
N-way set associative cache • Similarly to 2-way • At the extreme, every cache line is a way…
Cache organization summary • Increasing set associativity • Improves hit rate • Increases power consumption • Increases access time • Strike a balance
Cache Read Miss • On a read miss – perform a cache line fill • Fetch entire block that contains the missing data from memory • Block is fetched into the cache line fill buffer • May take a few bus cycles to complete the fetch • e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles • Can stream (forward) the critical chunk into the core before the line fill ends • Once the entire block fetched into the fill buffer • It is moved into the cache
Cache Replacement Policy • Direct map cache – easy • A new block is mapped to a single line in the cache • Old line is evicted (re-written to memory if needed) • N-way set associative cache – harder • Choose a victim from all ways in the appropriate set • But which? To determine, use a replacement algorithm • Example replacement policies • FIFO (First In First Out) • Random • LRU (Least Recently used) • Optimum (theoretical, postmortem, called “Belady”) • More on this next week…
Cache Replacement Policy • Direct map cache – easy • A new block is mapped to a single line in the cache • Old line is evicted (re-written to memory if needed) • N-way set associative cache – harder • Choose a victim from all ways in the appropriate set • But which? To determine, use a replacement algorithm • Example replacement policies • Optimum (theoretical, postmortem, called “Belady”) • FIFO (First In First Out) • Random • LRU (Least Recently used) • A decent approximation of Belady
LRU Implementation • 2 ways • 1 bit per set to mark latest way accessed in set • Evict way not pointed by bit • k-way set associative LRU • Requires full ordering of way accesses • Algorithm: when way i is accessed x = counter[i] counter[i] = k-1 for (j = 0 to k-1) if( (ji) && (counter[j]>x) ) counter[j]--; • When replacement is needed • evict way with counter = 0 • Expensive even for small k-s • Because invoked for every load/store • Need a log2k bit counter per line Initial State Way 0 1 2 3 Count 0 1 2 3 Access way 2 Way 0 1 2 3 Count 0 1 32 Access way 0 Way 0 1 2 3 Count3 0 21
Pseudo LRU (PLRU) • In practice, it’s sufficient to efficiently approximate LRU • Maintain k-1 bits, instead of k ∙ log2k bits • Assume k=4, and let’s enumerate the way’s cache lines • We need 2 bits: cache line 00, cl-01, cl-10, and cl-11 • Use a binary search tree to represent the 4 cache lines • Set each of the 3 (=k-1) internal nodes to holda bit variable: B0, B1, and B2 • Whenever accessing a cache line b1b0 • Set the bit variable Bj to be thecorresponding cache line bit bk • Can think about the bit value as Bj “right side was referenced more recently” • Need to evict? Walk tree as follows: • Go left if Bj = 1; go right if Bj = 0 • Evict the leaf you’ve reached (= the oppositedirection relative to previous insertions) B0 0 1 B1 B2 1 0 1 0 11 01 10 00 cache lines
Pseudo LRU (PLRU) – Example • Access 3 (11), 0 (00), 2 (10), 1 (01) => next victim is 3 (11), as expected 1 B0 1 0 0 0 1 1 0 0 1 1 0 0 B1 0 1 1 B2 3 0 2 B1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 11 11 11 11 01 01 10 10 01 01 10 10 00 00 00 00 0 0 1 1 0 1 0 1 0 1 11 01 10 00 cache lines
LRU vs. Random vs. FIFO • LRU: hardest • FIFO: easier, approximates LRU (oldest rather then LRU) • Random: easiest • Results: • Misses per 1000 instructions in L1-d, on average • Average across ten SPECint2000 / SPECfp2000 benchmarks • PLRU turns out rather similar to LRU
Effect of Cache on Performance • MPKI (miss per kilo-instruction) • Average number of misses for every 1000 instructions • MPKI = Memory accesses per kilo-instruction × Miss rate • Memory stall cycles = |Memory accesses| × Miss rate × Miss penalty cycles = IC/1000 × MPKI × Miss penalty cycles • CPU time = (CPU execution cycles + Memory stall cycles) × cycle time = IC/1000 × (1000* CPIexecution + MPKI × Miss penalty cycles) × cycle time
Memory Update Policy on Writes • Write back: Lazy writes to next cache level; prefer cache • Write through: Immediately update next cache level
Write Back: Cheaper writes • Store operations that hit the cache • Write only to cache; next cache level (or memory) not accessed • Line marked as “modified” or “dirty” • When evicted, line written to next level only if dirty • Pros: • Saves memory accesses when line updated more than once • Attractive for multicore/multiprocessor • Cons: • On eviction, the entire line must be written to memory (there’s no indication which bytes within the line were modified) • Read miss might require writing to memory (evicted line is dirty)
Write Through: Cheaper evictions • Stores that hit the cache • Write to cache, and • Write to next cache level (or memory) • Need to write only the bytes that were changed • Not entire line • Less work • When evicting, no need to write to next cache level • Never dirty, so don’t need to be written • Still need to throw stuff out, though • Use write buffers • To mask waiting for lower level memory
Processor Cache DRAM Write Buffer Write through: need write-buffer • A write buffer between cache & memory • Processor core: writes data into cache & write buffer • Write buffer allows processor to avoid stalling on writes • Works ok if store frequency in cycles << DRAM write cycle • Otherwise store buffer overflows no matter how big it is • Write combining • Combine adjacent writes to same location in write buffer • Note: on cache miss need to lookup write buffer (or drain it)
Cache Write Miss • The processor is not waiting for data • continues to work • Option 1: Write allocate: fetch the line into the cache • Goes with write back policy • Because, with write back,write ops are quicker if line in cache • Assumes more writes/reads to cache line will be performed soon • Hopes that subsequent accesses to the line hit the cache • Option 2: Write no allocate: do not fetch line into cache • Goes with write through policy • Subsequent writes would update memory anyhow • (If read ops occur, first read will bring line to cache)