Memory Hierarchy and Cache Memories in Computer Architecture

MAMAS – Computer Architecture234367Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi Mendelson Some of the slides were taken from: (1) Lihu Rapoport (2) Randi Katz and (3) Petterson

Technology Trends Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 10 years Disk 2x in 3 years 1.4x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1000:1 1 Mb 2:1 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns

1000 CPU Processor-Memory Performance Gap:(grows 50% / year) 100 Performance 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Processor-DRAM Memory Gap (latency)

Why can’t we build Memory at the same frequency as Logic? • It is too expensive to build large memory with that technology • The size of the memory determine its access time, the larger the slower. • We do not aim to achieve the best performance solution. We aim to achieve the best COST EFFECTIVE solution (best performance for a given amount of money).

Important observation – programs preserve locality (and we can help it) • Temporal Locality (Locality in Time): • If an item is referenced, it will tend to be referenced again soon • Example: code and variables in loops => Keep most recently accessed data items closer to the processor • Spatial Locality (Locality in Space): • If an item is referenced, nearby items tend to be referenced soon • Example: scanning an array • Example 2: program execution => Move blocks of contiguous words closer to the processor • Locality + smaller HW is faster + Amdahl’s law=> memory hierarchy

CPU Level 1 Level 2 Level 3 Level 4 The Goal: illusion of large, fast, and cheap memory • Fact: Large memories are slow, fast memories are small • How do we create a memory that is large, cheap and fast (most of the time)? • Hierarchy: Speed: Fastest Slowest Size: Smallest Biggest Cost: Expensive Cheap

Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes 10-100 ns $.01-.001/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns-1us $.01-.001 Memory OS 512-4K bytes Pages Disk G Bytes ms 10 - 10 cents Disk -4 -3 user/operator Mbytes Files Larger Infinite storage sec-min 10 Backup Lower Level -6 Levels of the Memory Hierarchy

Simple performance evaluation • Suppose we have a processor that can execute one instruction per cycle when working from the first level of the memory hierarchy (hits the L1). • Example: If the information is not found in the first level, the CPU waits for 10 cycles, and if it is found only in the third level, it costs another 100 cycles. CPU Level 1 Level 2 Level 3

Cache Performance CPU time = (CPU execution cycles + Memory stall cycles) × cycle time Memory stall cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty Memory stall cycles = Memory accesses × Miss rate × Miss penalty CPU time = IC × (CPIexecution + Mem accesses per instruction × Miss rate × Miss penalty) × Clock cycle time Misses per instruction = Memory accesses per instruction × Miss rate CPU time = IC × (CPIexecution + Misses per instruction × Miss penalty) × Clock cycle time

Example • Consider a program that executes 10x106 instructions and with CPI=1. • Each instruction causes (in average) 0.5 accesses to data • 95% of the accesses hit L1 • 50% of the accesses to L2 are misses and so need to be looked up in L3. • What is the slowdown due to memory hierarchy? Solution • Program generates 15x106 accesses to memory that could be executed in 10x106 cycles if all the information was at level-1 • 0.05* 15x106 = 750000 accesses to L2 and 375000 accesses to L3. • New cycles = 10x106 + 10*750,000 + 100*375,000 = 55*106 • It is 5.5 times slowdown!!!!!

The first level of the memory hierarchy: Cache memories – Main Idea • At this point we assume only two levels of memory hierarchy: Main memory and cache memory • For the simplicity we also assume that all the program (data and instructions) is placed in the main memory. • The cache memory(ies) is part of the Processor • Same technology • Speed: same order of magnitude as accessing Registers • Relatively small and expensive • Acts like an HASH function: holds parts of the programs’ address spaces. • It needs to achieve: • Fast access time • Fast search mechanism • Fast replacement mechanism • High Hit-Ratio

Cache - Main Idea (cont) • When processor needs instruction or data it first looks to find it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there. • Address space (or main memory) is partitioned into blocks • Typical block size is 32, 64 or 128 bytes • Block address is the address of the first byte in the block • block address is aligned (multiple of the block size) • Cache holds lines, each line holds a block • Need to determine which line the block is mapped to (if at all) • A block may not exist in the cache - cache miss • If we miss the Cache • Entire block is fetched into a line fill buffer (may require few bus cycles), and then put into the cache • Before putting the new block in the cache, another block may need to be evicted from the cache (to make room for the new block)

Memory Hierarchy: Terminology • For each memory level we can define the following: • Hit: data appears in the memory level • Hit Rate: the fraction of memory accesses which are hits • Hit Time: Time to access the memory level (includes also the time to determine hit/miss) • Miss: data needs to be retrieved from the lower level • Miss Rate = 1 - (Hit Rate) • Miss Penalty: Time to replace a block in the current level + Time to deliver the data to the processor • Average memory-access time = teffective = (Hit time  Hit Rate) + (Miss Time  Miss rate) = (Hit time  Hit Rate) + (Miss Time  (1- Hit rate)) • If hit rate is close to 1, teffective is close to Hit time

Four Questions for Memory Hierarchy Designers In order to increase efficiently, we are moving data in blocks between different levels of memories; e.g., pages in main memory. In order to achieve that we need to answer (at least) 4 questions: • Q1: Where can a block be placed when brought? (Block placement) • Q2: How is a block found when needed? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy)

Q1-2: Where can a block be placed and how can we find it? • Direct Mapped: Each block has only one place that it can appear in the cache. • Fully associative: Each block can be placed anywhere in the cache. • Set associative: Each block can be placed in a restricted set of places in the cache. • If there are n blocks in a set, the cache placement is called n-way set associative • What is the associativity of a direct mapped cache?

Address Fields 31 4 0 Tag = Block# Line Offset Tag Array Data array Line Tag = = = 31 0 hit data Fully Associative Cache • An address is partitioned to • offset within block • block number • Each block may be mapped to each of the cache lines • need to lookup the block in all lines • Each cache line has a tag • tag is compared to the block number • If one of the tags matches the block# we have a hit and the line is accessed according to the line offset • need a comparator per line

Fully associative - Cont • Advantages • Good utilization of the area, since any block in the main memory can be mapped to any cache line • Disadvantage • A lot of hardware • Complicated hardware that causes a slow down to the access time.

Direct Mapped Cache Address • The l.s.bits of the block number determine to which cache line the block is mapped - called the set number • Each block is mapped to a single line in the cache • If a block is mapped to the same line as another, it will replace it. • The rest of the block number bits are used as a tag • Compared to the tag stored in the cache for the appropriate set Block number 31 13 4 0 5 Tag Set Line Offset Tag Line 29 =512 sets Set# Tag Array 31 Cache storage 0

Direct Map Cache (cont) • Memory is conceptually divided into slices whose size is the cache size • Offset from from slice start indicates position in cache (set) • Addresses with the same offset map into the same line • One tag per line is kept • Advantages • Easy hit/miss resolution • Easy replacement algorithm • Lowest power and complexity • Disadvantage • Excessive Line replacementdue to “Conflict misses” x Cache Size Line 1 x Cache Size Line 2 . . . . Map to the Same set X x Cache Size Line n

Address Fields 31 12 4 0 Tag Set Line Offset 2-Way Set Associative Cache • Each set holds two lines (way 0 and way 1) • Each block can be mapped into one of two lines in the appropriate set Example: Line Size: 32 bytes Cache Size 16KB # of lines 512 lines #sets 256 Offset bits 5 bits Set bits 8 bits Tag bits 19 bits Address 0x12345678 0001 0010 0011 0100 0101 0110 0111 1000 Offset: 1 1000 = 0x18 = 24 Set: 1011 0011 = 0x0B3 = 179 Tag: 000 1001 0001 1010 0010 = 0x091A2 Tag Line Tag Line Set# Set# Way 0 Tag Array 31 Cachestorage 0 Way 1 Tag Array 31 Cachestorage 0 WAY #0 WAY #1

31 12 4 0 Tag Set Line Offset Tag Tag Data Data Set# Way 0 Way 1 = = MUX Data Out Hit/Miss 2-Way Cache - Hit Decision

x Way Size Line 1 x Way Size Line 2 . . . . Map to the Same set X x Way Size Line n 2-Way Set Associative Cache (cont) • Memory is conceptually divided into slices whose size is 1/2 the cache size (way size) • Offset from from slice start indicates set#Each set contains now two potential lines! • Addresses with the same offset map into the same set • Two tags per set, one tag per line is needed

What happens on a Cache miss? • Read miss • Cache line file - fetch the entire block that contains the missing data from memory • Block is fetched into the cache line fill buffer • May take a few bus cycles to complete the fetch • e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles • Once the entire line is fetched it is moved from the fill buffer into the cache • What happens on a write miss ? • The processor does not wait for data  continues its work • 2 options: write allocate and no write allocate • Write allocate: fetch the line into the cache • Assumes that we may read from the line soon • Goes with write back policy (hoping that subsequent writes to the line hit the cache) • Write no allocate: do not fetch line into the cache on a write miss • Goes with write through policy (subsequent writes would update memory anyhow)

Replacement • Each line contains Valid indication • Direct map: simple, line can be brought to only one place • Old line is evicted (re-written to cache, if needed) • n-Ways: need to choose among ways in set • Options: FIFO, LRU, Random, Pseudo LRU • LRU is the best (in average) • LRU • 2 ways: requires 1 bit per set to mark latest accessed • 4-ways: • Need to save full ordering • Fully associative: • Full ordering cannot be saved (too many bits) • approximate LRU

For each set hold a kk matrix Initialization: 0 0 0 0 …. 0 1 0 0 ….. 0 1 1 0 ….. 0 N 1 1 1 …. 1 0 When line j (1jk) is accessed Set all bits on row J to 1 (done in parallel by hardware) THEN Reset all bits on column J to 0 (at the same cycle) i 0 0 : j 1...101...111 : 0 Implementing LRU in a k-way set associative cache Evict row with ALL “0”

bit0 0 1 1 1 bit1 0 0 bit2 0 1 2 3 Pseudo LRU • We will use as an example a 4-way set associative cache. • Full LRU records the full-order of way access in each set (which way was most recently accessed, which was second, and so on). • Pseudo LRU (PLRU) records a partial order, using 3 bits per-set: • Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2 and 3. • Bit1 specifies which of ways 0 and 1 was least recently used • Bit2 specifies which of ways 2 and 3 was least recently used • For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1

Processor Cache DRAM Write Buffer Write Buffer for Write Through • A Write Buffer is needed between the Cache and Memory • Write buffer is just a FIFO • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle • Store frequency (w.r.t. time) > 1 / DRAM write cycle • If exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): • Store buffer will overflow no matter how big you make it • The CPU Cycle Time <= DRAM Write Cycle Time • Write combining: combine writes in the write buffer • On cache miss need to lookup write buffer

Improving Cache Performance • Separating data cache from instruction cache (will be discussed in future lectures) • Reduce the miss rate • In order to reduce the misses, we need to understand why misses happen • Reduce the miss penalty • Bring the information to the processor as soon as possible. • Reduce the time to hit in the cache • Using Amdahl law, since most of the time we hit the cache it is important to make sure we are accelerating the hit process.

Classifying Misses: 3 Cs • Compulsory • The first access to a block is not in the cache, so the block must be brought into the cache • Also called cold start misses or first reference misses • Misses in even an Infinite Cache • Solution: for a fixed cache-line size -> prefetching • Capacity • cache cannot contain all blocks needed during program execution (it also termed the working set of the program is too big) • blocks are evicted and later retrieved • Solution: increase cache size, stream buffers, software solution • Conflict • Occurs in set associative or direct mapped caches when too many blocks map to the same set • Also called collision misses or interference misses • Solution: increase associativity, victim cache, linker optimizations

3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small

How Can We Reduce Misses? • 3 Cs: Compulsory, Capacity, Conflict • In all cases, assume total cache size is not changed: • What happens if: 1) Change Block Size: Which of 3Cs is obviously affected? 2) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected?

Reduce Misses via Larger Block Size

Reduce Misses via Higher Associativity • We have two conflicting trends here: • Higher associativity • improve the hit ratio BUT • Increase the access time • Slow down the replacement • Increase complexity • Most of the modern cache memory systems are using at least 4-way associative cache memories

Example: Avg. Memory Access Time vs. Miss Rate • Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.481.47 1.43 161.291.321.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 128 1.10 1.17 1.18 1.20 Effective access time to cache(Red means -> not improved by more associativity) Note this is for a specific example

80H-87H 88H-8FH 90H-97H 98H-9FH 4 3 1 2 Reducing Miss Penalty byCritical Word First • Don’t wait for full block to be loaded before restarting CPU • Early restart • As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Critical Word First • Request the missed word first from memory and send it to the CPU as soon as it arrives • Let the CPU continue execution while filling the rest of the words in the block • Also called wrapped fetch and requested word first • Example: • 64 bit = 8 byte bus, 32 byte cache line  4 bus cycles to fill line • Fetch date from 95H

Prefetchers • In order to avoid compulsory misses, we need to bring the information before it was requested by the program • We can use the locality of references behavior • Space -> bring the environment. • Time -> same “patterns” repeats themselves. • Prefetching relies on having extra memory bandwidth that can be used without penalty • There are hardware and software prefetchers.

Hardware Prefetching • Instruction Prefetching • Alpha 21064 fetches 2 blocks on a miss • Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required • On miss check stream buffer • Branch predictor directed prefetching • Let branch predictor run ahead • Data Prefetching • Try to predict future data access • Next sequential • Stride • General pattern

Software Prefetching • Data Prefetch • Load data into register (HP PA-RISC loads) • Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) • Special prefetching instructions cannot cause faults;a form of speculative execution • How it is done • Special prefetch intrinsic in the language • Automatically by the compiler • Issuing Prefetch Instructions takes time • Is cost of prefetch issues < savings in reduced misses? • Higher superscalar reduces difficulty of issue bandwidth

Other techniques

Multi-ported cache and Banked Cache • A n-ported cache enables n cache accesses in parallel • Parallelize cache access in different pipeline stages • Parallelize cache access in a super-scalar processors • Effectively doubles the cache die size • Possible solution: banked cache • Each line is divided to n banks • Can fetch data from k n different banks (in possibly different lines)

Separate Code / Data Caches • Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors) • Code cache is a read only cache • No need to write back line into memory when evicted • Simpler to manage • What about self modified code? (X86 only) • Whenever executing a memory write need to snoop the code cache • If the code cache contains the written address, the line in which the address is contained is invalidated • Now the code cache is accessed both in the fetch stage and in the memory access stage • Tags need to be dual ported to avoid stalling

Increasing the size with minimum latency loss - L2 cache • L2 is much larger than L1 (256K-1M in compare to 32K-64K) • Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip) • If L2 is on-chip, why not just make L1 larger? • Can be inclusive: • All addresses in L1 are also contained in L2 • Data in L1 may be more updated than in L2 • L2 is unified (code / data) • Most architectures do not require the caches to be inclusive (although, due to the size difference they are)

Victim Cache • Problem: per set load may non-uniform • some sets may have more conflict misses than others • Solution: allocate ways to sets dynamically, according to the load • When a line is evicted from the cache it placed on the victim cache • If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1 • On cache lookup, victim cache lookup is also performed (in parallel) • On victim cache hit, • line is moved back to cache • evicted line moved to the victim cache • Same access time as cache hit • Especially effective for direct mapped cache • Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses

Stream Buffers • Before inserting a new line into the cache put it in a stream buffer • Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future • Example: • Assume that we scan a very large array (much larger than the cache), and we access each item in the array just once • If we inset the array into the cache it will thrash the entire cache • If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache

Backup

Compiler issues • Data Alignment • Misaligned access might span several cache lines • Prohibited in some architectures (Alpha, SPARC) • Very slow in others (x86) • Solution 1: add padding to data structures • Solution 2: make sure memory allocations are aligned • Code Alignment • Misaligned instruction might span several cache lines • x86 only. VERY slow. • Solution: insert NOPs to make sure instructions are aligned

Compiler issues 2 • Overalignment • Alignment of an array can be a multiple of cache size • Several arrays map to same cache lines • Excessive conflict misses (thrashing) for (int i=0; i<N; i++) a[i] = a[i] + b[i] * c[i] • Solution 1: increase cache associativity • Solution 2: break the alignment

Memory Hierarchy and Cache Memories in Computer Architecture