Memory Hierarchy Design for Faster Processing Speeds

CHAPTER 5

1977: DRAM faster than microprocessors

Since 1980, CPU has outpaced DRAM ...

How do architects address this gap? • Programmers want unlimited amounts of memory with low latency • Fast memory technology is more expensive per bit than slower memory • Solution: organize memory system into a hierarchy • Entire addressable memory space available in largest, slowest memory • Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor • Temporal and spatial locality insures that nearly all references can be found in smaller memories • Gives the allusion of a large, fast memory being presented to the processor

Memory Hierarchy

Advantage of memory hierarchy

Memory Hierarchy Design • Memory hierarchy design becomes more crucial with recent multi-core processors: • Aggregate peak bandwidth grows with # cores: • Intel Core i7 can generate two references per core per clock • Four cores and 3.2 GHz clock • 25.6 billion 64-bit data references/second + • 12.8 billion 128-bit instruction references • = 409.6 GB/s! • DRAM bandwidth is only 6% of this (25 GB/s) • Requires: • Multi-port, pipelined caches • Two levels of cache per core • Shared third-level cache on chip

Locality in Caches • A principle that makes memory hierarchy a good idea • If an item is referenced • Temporal locality: it will tend to be referenced again soon • Spatial locality: nearby items will tend to be referenced soon

Memory Hierarchy Basics • When a word is not found in the cache, a miss occurs: • Fetch word from lower level in hierarchy, requiring a higher latency reference • Lower level may be another cache or the main memory • Also fetch the other words contained within the block • Takes advantage of spatial locality

Cache • Two issues • How do we know if a data item is in the cache? • If it is, how do we find it? • Our first example • Block size is one word of data • ”Direct mapped“ • Our initial focus: two levels (upper, lower) • Block: minimum unit of data • Hit: data requested is in the upper level • Miss: data requested is not in the upper level Direct Mapped Cache: For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

Direct mapped cache • Mapping • Cache address is Memory address modulo the number of blocks in the cache • Find a cache location: • (Block address) modulo (#Blocks in cache)

Direct mapped cache • What kind of locality are we taking advantage of? • How many words does this cache store? • How do we determine if the data we are looking for is in the cache? • For a 32-bit byte address • Cache size is 2n blocks, so n bits are used for the index • Block size is 2m words (2m+2 bytes), m bits are used to the address the word in a block, two bits are used for the byte part of the address • Size of the tag field is 32 – (n + m + 2)

Direct mapped cache • Taking advantage of spatial locality • (16KB cache, 256 Blocks, 16 words/block) • For a 32-bit byte address • Cache size is 2n blocks, so n bits are used for the index • Block size is 2m words (2m+2 bytes), m bits are used to the address the word in a block, two bits are used for the byte part of the address • Size of the tag field is 32 – (n + m + 2)

Block Size vs. Performance

Block Size vs. Cache Measures • Increasing Block Size generally increases Miss Penalty and decreases Miss Rate Miss Penalty Miss Rate Avg. Memory Access Time X = Block Size Block Size Block Size

Number of Bits? • How many bits are required for a direct-mapped cache with 16 KB of data and 4-word blocks, assuming a 32-bit address? • The 16 KB cache contains 4K words (212) • There are 1024 blocks (210) because block size is 4 words. n = 10 • Each block has 4 * 32, or 128 bits of data plus the tag, and valid bit • Cache size = (210) * (bits for words in block + valid bit + tag) • Cache size = (210) * (128 + 1 + tag) • How many bits are used for the tag? • 32 – (n + m + 2) = 32 – (10 + 2 + 2) = 18 bits • Cache size = (210) * (128 + 1 + 18) = (210) * 147 or 147 Kbits Cache size is 2n blocks Block size is 2m words

Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level? • Direct Mapped: Each block has only one place that it can appear in the cache. • Fully associative: Each block can be placed anywhere in the cache. • Must search entire cache for block (costly in terms of time) • Can search in parallel with additional hardware (costly in terms of space) • Set associative: Each block can be placed in a restricted set of places in the cache. • Compromise between direct mapped and fully associative • If there are n blocks in a set, the cache placement is called n-way set associative

Associativity Examples Fully associative: Block 12 can go anywhere Direct mapped: Block no. = (Block address) mod (No. of blocks in cache) Block 12 can go only into block 4 (12 mod 8) Set associative: Set no. = (Block address) mod (No. of sets in cache) Block 12 can go anywhere in set 0 (12 mod 4)

Direct Mapped Cache

2 Way Set Associative Cache

Fully Set Associative Cache

An implementation of a four-way set associative cache

Performance

Q2: How Is a Block Found If It Is in the Upper Level? • The address can be divided into two main parts • Block offset: selects the data from the block offset size = log2(block size) • Block address: tag + index • index: selects set in cache index size = log2(#blocks/associativity) • tag: compared to tag in cache to determine hit tag size = address size - index size - offset size Tag Index

Set Associative Cache Problem • Design an 8-way set associative cache that has 16 blocks and 32 bytes per block. Assume 32-bit addressing. Calculate the following: • How many bits are used for the block offset? • How many bits are used for the set (index) field? • How many bits are used for the tag? • Offset Size = log2(Block Size) • Offset Size = log2(32) • Offset Size = 5-bits • Index Size = log2(#blocks/associativity) • Index Size = log2(2) • Index Size = 1-bit • Tag Size = Address Size – Index Size - Offset • Tag Size = 32 – 1 – 5 • Tag Size = 26-bits

Q3: Which Block Should be Replaced on a Miss? • Easy for Direct Mapped • Set Associative or Fully Associative: • Random - easier to implement • Least Recently used - harder to implement - may approximate • Miss rates for caches with different size, associativity and replacement algorithm. Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What Happens on a Write? • Write through: The information is written to both the block in the cache and to the block in the lower-level memory. • Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. • Is block clean or dirty? (add a dirty bit to each block) • Pros and Cons of each: • Write through • read misses cannot result in writes to memory, • easier to implement • Always combine with write buffers to avoid memory latency • Write back • Less memory traffic • Perform writes at the speed of the cache

Q4: What Happens on a Write?

Q4: What Happens on a Write? • Since data does not have to be brought into the cache on a write miss, there are two options: • Write allocate • The block is brought into the cache on a write miss • Used with write-back caches • Hope subsequent writes to the block hit in cache • No-write allocate • The block is modified in memory, but not brought into the cache • Used with write-through caches • Writes have to go to memory anyway, so why bring the block into the cache

Hits vs. misses • Read hits • This is what we want! • Read misses • Stall the CPU, fetch block from memory, deliver to cache, restart • Write hits • Can replace data in cache and memory (write-through) • Write the data only into the cache (write-back the cache later) • Write misses • Read the entire block into the cache, then write the word

Cache Misses • On cache hit, CPU proceeds normally • On cache miss • Stall the CPU pipeline • Fetch block from next level of hierarchy • Instruction cache miss • Restart instruction fetch • Data cache miss • Complete data access

Performance • Simplified model • Execution time=(execution cycles + stall cycles)*cct • stall cycles= #of instructions*miss ratio*miss penalty • Two ways of improving performance • Decreasing the miss ratio • Decreasing the miss penalty • What happens if we increase block size?

Cache Measures • Hit rate: fraction found in the cache • Miss rate = 1 - Hit Rate • Hit time: time to access the cache • Miss penalty: time to replace a block from lower level, • access time: time to access lower level • transfer time: time to transfer block CPU time = (CPU execution cycles + Memory stall cycles)*Cycle time

Cost of Misses • Average memory access time = Hit Time + Memory stall cycles • Note that speculative and multithreaded processors may execute other instructions during a miss • Reduces performance impact of misses

Assume 75% instruction, 25% data access

Assume 75% instruction, 25% data access • Which has a lower miss rate: a 16-KB instruction cache with a 16-KB data cache or a 32-KB unified cache? • 16KB Instruction cache miss rate: 0.64% • 16KB Data cache miss rate: 6.47% • 32KB Unified cache miss rate: 1.99% Miss rate of separate caches = (75% * 0.64%) + (25% * 6.47%) = 2.10%

Assume 75% instruction, 25% data access • What is the average memory access time for the separate instruction and data caches and unified cache assuming write-through caches with a write buffer. Ignore stalls due to the write buffer. A hit takes 1 clock cycle and a miss penalty costs 50 clock cycles. A load or store hit on the unified cache takes an extra clock cycle. • 16KB Instruction cache miss rate: 0.64% • 16KB Data cache miss rate: 6.47% • 32KB Unified cache miss rate: 1.99% Average Access Time = % inst * (Hit time + instruction miss rate * miss penalty) + %data * (Hit time + data miss rate * miss penalty) • Split • Average memory access time = 75% * (1 + 0.64% * 50) + 25% * (1 + 6.47% * 50) = 75% * 1.32 + 25% * 4.235 = 2.05 cycles • Unified • Average memory access time = 75% * (1 + 1.99% * 50) + 25% * (1 + 1+ 1.99% * 50) = (75% * 1.995) + (25% * 2.995) = 2.24 cycles

Cost of Misses, CPU time

Improving Cache Performance • Average memory-access time = Hit time + Miss rate x Miss penalty • Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

Types of misses • Compulsory • Very first access to a block (cold-start miss) • Capacity • Cache cannot contain all blocks needed • Conflict • Too many blocks mapped onto the same set

How do you solve • Compulsory misses? • Larger blocks with a side effect! • Capacity misses? • Not much options: enlarge the cache otherwise face “thrashing!”, computer runs at a speed of the lower memory or slower! • Conflict misses? • Full associative cache with a cost of hardware and may slow the processor!

Basic cache optimizations: • Larger block size • Reduces compulsory misses • Increases capacity and conflict misses, increases miss penalty • Larger total cache capacity to reduce miss rate • Increases hit time, increases power consumption • Higher associativity • Reduces conflict misses • Increases hit time, increases power consumption • Higher number of cache levels • Reduces overall memory access time

3. Reducing Misses via Victim Cache • Add a small fully associative victim cache to place data discarded from regular cache • When data not found in cache, check victim cache • 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Get access time of direct mapped with reduced miss rate

4. Reducing Misses by HW Prefetching of Instruction & Data • E.g., Instruction Prefetching • Alpha 21064 fetches 2 blocks on a miss • Extra block placed in stream buffer • On miss check stream buffer • Norman Jouppi [1990 HP] 1 data stream buffer got 25% of misses from 4KB cache; 4 streams got 43% • Works with data blocks too: • Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches • Prefetching relies on extra memory bandwidth that can be used without penalty

5. Reducing Misses by SW Prefetching Data • Data Prefetch • Load data into register (HP PA-RISC loads) • Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) • Special prefetching instructions cannot cause faults;a form of speculative execution • Issuing Prefetch Instructions takes time • Is cost of prefetch issues < savings in reduced misses?

Multi-Level Caches • Second level cache accessed on first level cache miss • On first level miss, only pay the cost of accessing second level instead of main memory • Different design considerations • Primary Cache: Minimize hit time to yield shorter clock cycle or fewer pipeline stages • Secondary Cache: Reduce miss rate to reduce penalty of main memory accesses • Primary cache is generally smallest • May use smaller block size and lower associativity to reduce the miss penalty • Secondary cache is much larger • May use larger block size and higher associativity • Intel Core i7-980X Gulftown (Cache size per core) • 32KB Level 1 Data Cache • 32KB Level 1 Instruction Cache • 256KB Level 2 Cache • 12MB Level 3 Cache

Memory Hierarchy Design for Faster Processing Speeds

Memory Hierarchy Design for Faster Processing Speeds

Presentation Transcript

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5 5

chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

CHAPTER 5

Chapter 5

CHAPTER 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5