CMPE 421 Parallel Computer Architecture

CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Fully Associative CacheReducing Cache Misses by More Flexible Placement Blocks • Instead of direct mapped, we allow any memory block to be placed in any cache slot. • There are many different potential addresses that mapped to each index • Use any available entry to store memory elements • Remember: Direct memory caches are more rigid, any cache data goes directly where the index says to, even if the rest of the cache is empty • But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full. • It’s harder to check for a hit (hit time will increase). • Requires lots more hardware (a comparator for each cache slot). • Each tag will be a complete block address (No index bits are used).

Fully Associative Cache • Must compare tags of all entries in parallel to find the desired one (if there is a hit) • But Direct mapped cache only need to look one place • No conflict misses, only capacity misses • Practical only for caches with small number of blocks, since searching increases the hardware cost

Fully Associative Cache

V Tag Data V Tag Data 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: Direct Mapped vs Fully Associative Fully Associative Direct Mapped Index No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset

Trade off • Fully Associate is much more flexible, so the miss rate will be lower. • Direct Mapped requires less hardware (cheaper). – will also be faster! • Tradeoff of miss rate vs. hit time. • Therefore we might be able to compromise to find best solution between direct mapped cache and fully associative cache • We can also provide more flexibility without going to a fully associative placement policy. • For each memory location, provide a small number of cache slots that can hold the memory element. • This is much more flexible than direct-mapped, but requires less hardware than fully associative. SOLUTION: Set Associative

SET Associative Cache • A fixed number of locations where each block can be placed. • N-way set associative means there are N places (slots) where each block can be placed. • Divide the cache into a number of sets each set is of size N “ways” (N way set associative) • Therefore, A memory block maps to unique set (specified by index field) and can be placed in any “way” of that set • So there N choices • A memory block can be mapped is Set-accociative cache • (Block address) modulo (Number of set in the cache) • Remember that in a direct mapped cache the position of memory block is given by (Block address) modulo (Number of cache blocks)

V Tag Data V Tag Data 0: 0: 1: 1: 2: 3: 2: 4: 5: 3: 6: 7: A Compromise 4-Way set associative 2-Way set associative Each address has four possible locations with the same index Each address has two possible locations with the same index One fewer index bit: 1/2 the indexes Two fewer index bits: 1/4 the indexes Address = Tag | Index | Block offset Address = Tag | Index | Block offset

Used for tag compare Selects the set Selects the word in the block Increasing associativity Decreasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags Range of Set Associative Caches Index is the set number is used to determine which set the block can be placed in Tag Index Block offset Byte offset

Range of Set Associative Caches • For a fixed size cache, • each increase by a factor of two in associativity doubles the number of blocks per set (i.e. the numbers or ways) • And halves the number of sets, • Decreases the size of the index by 1 bit • And increases the size of the tag by 1 bit Tag Index Block offset Byte offset

Set Associative Cache Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32-b words) One word blocks Cache V Tag Data 0 0 1 0 1 1 Q1: How do we find it? Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache) Q2: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache (block address) modulo (# set in the cache) Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block

Set Associative Cache Organization FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs.

Set Associative Cache Organization • This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel. • This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address. • If no tags match the incoming address tag, we have a cache miss. • Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur. • This is simple enough. What is its disadvantages?

N-way Set Associative Cache versus Direct Mapped Cache: • N way set associative cache will also be slower than a direct mapped cache because • N comparators vs. 1 • Extra MUX delay for the data • Data comes AFTER Hit/Miss decision and set selection • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: • Possible to assume a hit and continue. Recover later if miss.

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Remember the Example for Direct Mapping (ping pong effect) • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Solution: Use set associative cache • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss hit hit 0 4 0 4 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) • 8 requests, 2 misses • Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

Index Index Index V Tag Data V Tag Data V Tag Data 000: 0 0 0 00: 001: 0 0 0 0: 010: 0 0 0 01: 011: 0 0 0 100: 0 0 0 10: 101: 0 0 0 1: 110: 0 0 0 11: 111: 0 0 0 Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits) Set Associative Example 0100111000 0100111000 0100111000 Miss Miss Miss Miss Miss Miss 1100110100 1100110100 1100110100 Miss Hit Hit 0100111100 0100111100 0100111100 Miss Miss Miss 0110110000 0110110000 0110110000 1100111000 Miss 1100111000 Miss 1100111000 Hit - 010 011 110 110 010 1 - 01001 1 - 11001 1 - 01101 - 0100 1100 1 1 - 1100 0110 1 Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

New Performance Numbers Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches Benchmark Associativity Instruction Data miss Combined rate miss rate gcc Direct 2.0% 1.7% 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4%

Benefits of Set Associative Caches • The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Data from Hennessy & Patterson, Computer Architecture, 2003 • Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

Benefits of Set Associative Caches • As the cache size grow, the relative improvement from associativity increases only slightly • Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases • And the obsolete improvement in miss rate from associativity shrinks significantly

Cache Block Replacement Policy For deciding which block to replace when a new entry is coming • Random Replacement: • Hardware randomly selects a cache item and throw it out • First in First Out (FIFO) • Equally fair / equally unfair to all frames • Least Recently Used (LRU) strategy: • Use idea of temporal locality to select the entry that has not been accessed recently • Additional bit(s) required in the cache entry to track access order • Must update on each access, must scan all on a replace • For two way set associative cache one needs one bit for LRU replacement. • Common approach is to use pseudo LRU strategy • Example of a Simple “Pseudo” Least Recently Used Implementation: • Assume 64 Fully Associative Entries • Hardware replacement pointer points to one cache entry • Whenever an access is made to the entry the pointer points to: - Move the pointer to the next entry -Otherwise: do not move the pointer

Source of Cache Misses

Designing a cache Not: If you are running “billions” of instructions compulsory misses are insignificand

Key Cache Design Parameters

Two Machines’ Cache Parameters

Where can a block be placed/found?

Multilevel caches • Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle. • The second level cache (L2) focuses on reducing the penalty of long memory access time. • Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate. • Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size. • The access time of L2 is less critical than that of the cache of a single cache machine.

CMPE 421 Parallel Computer Architecture