Lecture 4.3

Lecture 4.3 Memory Hierarchy: Improving Cache Performance

Learning Objectives • Calculate the effective CPI and the average memory access time • Given a 32-bit address, specify the index and tag for direct mapped cache, n-way set associative cache, and fully associative cache, respectively • Calculate the effective CPI for multiple-level caches 2

Coverage • Textbook: Chapter 5.4 3

Measuring Cache Performance • Components of CPU time • Program execution cycles • Includes cache hit time • Memory stall cycles • Mainly from cache misses §5.4 Measuring and Improving Cache Performance • CPU time CPU time = IC × CPI × CC = IC × (CPIideal + Average memory stall cycles per instruction) × CC Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4

Memory stall cycles • With simplifying assumptions: §5.4 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5

Cache Performance Example • Given • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (ideal cache) = 2 • Load & stores are 36% of instructions • Memory stall cycles per instruction • I-cache: 0.02 × 100 = 2 • Need to fetch every instruction from I-cache • D-cache: 0.36 × 0.04 × 100 = 1.44 • 36% of instructions need to access D-cache • Actual CPI = 2 + 2 + 1.44 = 5.44 • The real execution is 5.44/2 =2.72 times slower than the ideal case Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6

Average Memory Access Time • Average memory access time (AMAT) • AMAT = Hit time + Miss rate × Miss penalty • Hit time is also important for performance • Example • CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, cache miss rate = 5% • AMAT = 1 + 0.05 × 20 = 2 cycles • 2 cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

Performance Summary • When CPU performance increased • Miss penalty becomes more significant • Greater proportion of time spent on memory stalls • Decreasing base CPI • Increasing clock rate • Memory stalls account for more CPU cycles • Can’t neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

Mechanisms to Reduce Cache Miss Rates • 1. Allow more flexible block placement • 2. Use multiple levels of caches 9

Reducing Cache Miss Rates #1 1. Allow more flexible block placement • In a direct mapped cache a memory block maps to exactly one cache block • At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache • A compromise is to divide the cache into sets, each of which consists of n “ways” (n-way set associative) • A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)

Associative Caches • Fully associative (only one set) • Allow a given block to go into any cache entry • Requires all entries to be searched at once • Comparator per entry (expensive) • n-way set associative • Each set contains n entries • Block address determines which set • (Block address) modulo (#Sets in cache) • Search all entries in a given set at once • n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11

Associative Cache Example Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12

Spectrum of Associativity • For a cache with 8 entries Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13

Associativity Example • Compare caches containing 4 cache blocks • Each block is 1-word wide • Direct mapped, 2-way set associative,fully associative • Block address sequence: 0, 8, 0, 6, 8 • Direct mapped Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14

Associativity Example • 2-way set associative • Fully associative Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Another Reference String Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Another Reference String Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid 0 4 0 4

Another Reference String Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss hit hit 0 4 0 4 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) • 8 requests, 2 misses • Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

How Much Associativity • Increased associativity decreases miss rate • But with diminishing returns • Simulation of a system with 64KBD-cache, 16-word blocks, SPEC2000 • 1-way: 10.3% • 2-way: 8.6% • 4-way: 8.3% • 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19

Benefits of Set Associative Caches • Largest gains are in going from direct mapped cache to 2-way set associative cache (20+% reduction in miss rate) 20

Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

Range of Set Associative Caches • For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag Index Block offset Byte offset

Used for tag compare Selects the set Selects the word in the block Increasing associativity Decreasing associativity Fully associative (only one set) Tag has all the bits except block and byte offset Direct mapped (only oneway) Smaller tags, only a single comparator Range of Set Associative Caches • For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag Index Block offset Byte offset

Example: Tag Size • Cache • 4K blocks • 4-word block size • 32-bit byte address • Number of tag bits for the cache (?) • Directed mapped • 2-way set associative • 4-way set associative • Fully associative 24

Replacement Policy • Direct mapped: no choice • Set associative • Prefer non-valid entry, if there is one • Otherwise, choose among entries in the set • Least-recently used (LRU) • Choose the one unused for the longest time • Simple for 2-way, manageable for 4-way, too hard beyond that • Random • Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

Reducing Cache Miss Rates #2 2. Use multiple levels of caches • Primary (L1) cache attached to CPU • Small, but fast • Separate L1 I$ and L1 D$ • Level-2 cache services misses from L1 cache • Larger, slower, but still faster than main memory • Unified cache for both instructions and data • Main memory services L-2 cache misses • Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Multilevel Cache Example • Given • CPU base CPI = 1, clock rate = 4GHz • Miss rate/instruction = 2% • Main memory access time = 100ns • With just primary cache • Miss penalty = 100ns/0.25ns = 400 cycles • Effective CPI = 1 + 0.02 × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

Example (cont.) • Now add L-2 cache • Access time = 5ns • Global miss rate to main memory = 0.5% • L-1 miss with L-2 hit • Penalty = 5ns/0.25ns = 20 cycles • L-1 miss with L-2 miss • Extra penalty = 400 cycles • CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 • Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Multilevel Cache Considerations • Primary (L-1) cache • Focus on minimal hit time • Smaller total size with smaller block size • L-2 cache • Focus on low miss rate to avoid main memory access • Hit time has less overall impact • Larger total size with larger block size • Higher levels of associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Global v.s. Local Miss Rate • Global miss rate (GR) • The fraction of references that miss in all levels of a multilevel cache • Dictate how often the main memory is accessed • Global miss rate so far(GRS) • The fraction of references that miss up to a certain level in a multilevel cache • Local miss rate (LR) • The fraction of references to one level of a cache that miss • L2$ local miss rate >> the global miss rate 30

Example • LR1=5%, LR2=20%, LR3=50% • GR=LR1×LR2×LR3=0.05×0.2×0.5=0.005 • GRS1=LR1=0.05 • GRS2=LR1×LR2=0.05×0.2=0.01 • GRS3=LR1×LR2×LR3=0.05×0.2×0.5=0.005 • CPI=1+GSR1×Pty1+GSR2×Pty2+GSR3×Pty3 =1+0.05×10+0.01×20+0.005×100=2.2 31

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): • First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant • Solution: increase block size (increases miss penalty; very large blocks could increase miss rate) • Capacity: • Cache cannot contain all blocks accessed by the program • Solution: increase cache size (may increase access time) • Conflict(collision): • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity (may increase access time)

Cache Design Trade-offs Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Lecture 4.3

Lecture 4.3

Presentation Transcript

4.3

Lecture 5: Finishing 4.3 and onto 5.1

4.3

Lecture 4: Finishing 4.2 and onto 4.3

4.3

4.3

4.3

4.3

LECTURE UNIT 4.3

Topic 4.3

Chemistry 4.3

4.3

4.3

4.3

4.3

4.3

§ 4.3

4.3

Lecture 4.3: Closures and Equivalence Relations

4.3

4.3

4.3