Cache Part 2
E N D
Presentation Transcript
Timing • Total Execution cycles = • Program execution cycles • Includes cache hit time • Memory stall cycles • Mainly from cache misses
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles / miss • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • What is the CPI cost of cache misses?How many times faster would perfect cache be?
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Instruction misses • All instructions miss 2% at cost 100 per miss= 2 CPI
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Data misses • 36% of instructions miss 4% of the time= 1.44 CPI
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Miss penalty = 2 + 1.44 = 3.44CPI penalty • 5.44 total CPI • Speedup of perfect cache: Speedup = 2.72 times
Speedup vs HitRate • Need high hit rate for large speedup • k = 1/cycles per miss https://ggbm.at/CXfa6mCP
Hierarchy • Often multiple levels of cache • Bigger # usually means • Larger cache • Slower
Process • I need memory location 0x000E • Is it in L1 cache? • Yes : Hit – use it • No : Miss – go search next level • Is it in L2? • Yes : Hit – use it • No : Miss – go search next level • Is it in L3… • Is it in memory…
Cache • L2 & L3 • May be on chip or board • May be shared by cores • ~ 1 MB (L2) ~5-10 MB (L3)
Differences • No hard rulesabout • What cache you have • Where it lives
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • What is effective CPI?
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns / miss • Effective CPI = base + miss penalty • Miss penalty = = 8 CPI • Effective CPI = 1 + 8 = 9 CPI
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • Add a L2 Cache • 5ns per access • 0.5% global miss rate to memory • i.e. 25% hit rate on misses from L1 (0.5 / 2) • How many times faster is it than just L1?
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • L2 miss: = 2 CPI
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • Or L2 miss: = 2 CPI
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss 1 + 0.4 + 2 • Total CPI with L2 = 3.4 • Speedup over just L1: Speedup = 2.6 times
Associativity • Associativity • What chunks of memory can go in which cache lines
Direct Mapping • Direct mapping : every memory line has one cache entry it can use
Direct Mapped Cache • Issue : Thrashing • Reusing same line rapidlyfor different sets
Direct Mapped Cache • Adding arrays in a loop:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 1 (miss – kills Set 0)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (miss – kills Set 1) Read Line 1 / Set 0 (hit) Write Line 0 / Set 1 (miss)
Fully Associative • Fully associative cache • Any memory line can go in any cache entry
Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest
Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest Set 0 Set 4 Set 2
Fully Associative • Issues: • Must check all tags in parallel for a match • Large amounts of hardware • Most practical for very small caches
Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 2-way
Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 4-way
Set Associative Address • Example – 2-way • 2 cache entries • Each holds two items (A & B)
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0020 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0 Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 2 Set 0 Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (hit) Read Line 1 / Set 0 (hit) Write Line 0 / Set 2 (hit) Set 2 Set 0 Set 0
Associativity Compared • Cache with space for 8 entries could be • 8 Blocks direct map • 4 Sets2 lines capacity • 2 Sets4 lines capacity • Fully Associative8 lines capacity
4-Way Implementation • 32 bit addresses, 256 cache lines, each line holds 1 word of memory (4 bytes) • Address broken up: • 2 bits for offset in line4 addresses/line • 8 bits for line256 lines • 22 bits for tageverything else
4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • 2 bits for offset in line00Byte 0 in line • 8 bits for line10 0010 11Index 139 in cache • 22 bits for tag0101 1101 0010 …
4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • Cache index is 139 • Tag is0101 1101 0010 … • Check all 4 tags in index 139 looking formatch
Set Accociative Performance • Larger caches = higher hit rate • Smaller caches benefit more from associativity
Replacement Strategies • How do what block to kick out? • FIFO : Track age • Least Used : Track accesses • Very susceptible to thrashing • Least Recently Used : Track age of accesses • Very complex for larger caches • Random
Update Strategies • When we store to memory, what gets updated? • Write Through : Update all levels of cache and memory • - Have to stall and wait for slowest component to finish update • + Consistency between levels • + Simple • Write Back : Just update cache. Update memory when leave cache • - Complex • - Lack of consistency between levels • + Faster – only stall for current level
What do they use? • Intell Nehalem& ARM Cortex A-8(ARM v7)