CMPE 421 Parallel Computer Architecture

CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache

01 4 11 15 Direct Mapped Cache Example 2 (4, 1-word blocks) • Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) miss 3 hit 4 hit 15 miss 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) • 8 requests, 6 misses, 2 hits

Direct Mapped Caching: A Simple First Example Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) Cache Index Valid Tag Data 00 Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block 01 10 11 Q1: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q2: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache (block address) modulo (# of blocks in the cache)

Direct Mapped Cache Example 1 • 8-blocks, 1 word/block, direct mapped • Initial state

Direct Mapped Cache Example 1

Byte offset 31 30 . . . 13 12 11 . . . 2 1 0 Tag 20 Data 10 Hit Index Index Valid Tag Data 0 1 2 . . . 1021 1022 1023 20 32 Address Subdivision : Direct Mapped Cache • One word/block, cache size = 1K words What kind of locality are we taking advantage of? FIGURE 7.7 For this cache, the lower portion of the address is used to select a cache entry consisting of a data word and a tag.

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Another Example for Direct Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Handling Cache Misses • Our control unit must detect a miss and process the miss by fetching the data from memory or from a lower-level cache. • Approach – stall the CPU, freezing the contents of all the registers – a separate controller fetches the data from memory – once data is present, execution of datapath is resumed

Instruction Cache Misses • Send the original PC value (current PC - 4) to the memory • Instruct main memory to perform a read and wait for the memory to complete its access. • Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and setting the valid bit • Restart the instruction execution at the first step, which will re-fetch the instruction (now in the cache)

Example Machine • The Digital DECStation 3100, one of the first commercially available RISC-Architecture machines used a MIPS R2000 processor. • 5 stage pipeline • requested an instruction word and a data word on every clock cycle • static branch prediction • delayed branch instruction • 64 KB data cache and 64 KB instruction cache

DECStation 3100 Cache (16 K words) This cache has 214 (16K) words with a block size of 1 word. 14 bits are used to index into the cache. 16 bits are compared against the tag. A hit results if the upper 16 bits matches the tag AND the valid bit is set offset

DECStation 3100 Cache • 64 KB = 16 K words with a 1 word block • Read requests (lw instruction): 1. Send the address to the appropriate cache. • Instruction - from PC • Data - from ALU 2. If the cache signals a hit, read the requested word from the data lines. Else, send the address to main memory. When the requested word comes back from main memory, write it into the cache.

DECStation 3100 Cache • Write requests: On a sw instruction, the DECStation 3100 used a scheme called write-through. • Write-through is when you store the word in the data cache AND in main memory. This is done to keep the data cache and main memory consistent. • Don’t bother to check for hits and misses. Just write the word into the cache and into main memory. Process wants to update A as 10 Now, cache copy of A is not equal to MEM copy of A With write policy write into both of them cache mem cache cache mem mem A=5 A=5 P P P A=10 A=10 A=5 A=10 inconsistency

DECStation 3100 Cache • Write requests using a write-through scheme (see page17): 1. Index the cache using bits 2-15 of the address. 2. Write bits 31-16 of the address into the tag, write the data word into the data portion, and set the valid bit. 3. Also write the word to main memory using the entire address. • This simple approach slows our performance down because of the long write to main memory.

EXAMPLE : for cache MISS • Suppose 10% of the instructions are stores, if the CPI without cache misses 1.0 spending 100 extra cycles on every write • 1.0+ 100 * 10 % = 11 • Reducing the performance by more than factor of 10 • SOLUTION: Write Buffer Solution

Write Buffer Solution • Use a write buffer to store the data while it is waiting to be written to memory • Write the data into the cache and into the write buffer • Processor continues with execution • The write buffer copies the data into main memory (memory controller) • Hopefully, the processor does not generate writes faster than the write buffer can take them. If the write buffer becomes full, stalls occur. • A write buffer stores the data while it is waiting to be written to memory. After writing the data into the cache and into the write buffer the processor can continue execution. When a write main memory completes the entry in the write buffer is freed.

Write Buffer Saturation • PROBLEM: Memory system designer’s nightmare:! • If Store frequency (w.r.t. time) << 1 / DRAM write cycle! • Write buffer works fine • If Store frequency (w.r.t. time) -> 1 / DRAM write cycle! • If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):! • Store buffer will overﬂow no matter how big you make it! • Because you simply feeding data faster than can empty it • The CPU Cycle Time << DRAM Write Cycle Time!

Solution for write buffer saturation: • Use a write back cache! • Install a second level (L2) cache:! • store compression!

Alternative: Write-Back Solution • The write-back scheme doesn’t automatically write to the cache AND to main memory. • Instead, it writes only to the cache. The data does not get written to memory until that cache block has to be replaced with a different block (when that cache block is being replaced on a cache miss). • Can improve performance. • Greatly reduce the memory bandwidth requirement! • More complex to implement than write-through. But, • Control can be complex! • Need a “dirty bit” for each cache block!

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): They are caused when we first start the program. • First access to a block, “cold” fact of life, not a whole lot you can do about it • If you are going to run “millions” of instruction, compulsory misses are insignificant • Conflict (collision): • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity (next lecture) • Capacity: • Cache cannot contain all blocks accessed by the program • Solution: increase cache size

Multiword (4 word) Block Direct Mapped CacheTaking the Advantage of Spatial Locality • The cache organization (i.e. 1 word= 1 block) we have discussed so far does not take the spatial locality, • We want have a cache block size > one word • So, when we have cache miss occurs, we fetch multiple adjacent words, and the probability that one these words will be needed shortly will be high • Example: • Choose block size= 4 words (4x4=16 bytes) • Now, we need an extra “block index” field, which selects one of four words in the indexed block according to the request • The total size of the tag field is also reduced per word, because each tag is used for 4 words (25% tag overhead only, compared to the case where block size is 1 word)

Multiword (4 word) Block Direct Mapped Cache Cache size 4Kx4= 16K words 16Kx4 Bytes =64K Byte cache

Formulas • During Lecture Period

Byte offset Hit 31 30 . . . 17 16 15 . . . 4 3 2 1 0 Data 16 Tag 12 Block offset Index Data Index Valid Tag 0 1 2 . . . 4093 4094 4095 32 32 Multiword (4 word) Block Direct Mapped Cache • Four words/block, cache size = 16K words 2 32 16 32 32 What kind of locality are we taking advantage of?

0 1 2 3 4 3 11 01 5 15 14 4 4 15 Taking Advantage of Spatial Locality • Let cache block hold more than one word 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid miss hit miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) hit miss hit 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) hit miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) • 8 requests, 4 misses

Cache Hits and Misses • Read misses are processed the same as with a 1 word block cache. Read the entire block from memory into the cache. • Write hits and misses must be handled differently with a multiword block cache. Consider: • Assume memory addresses X and Y both map to cache block C • C is a 4 word block containing Y • What would happen if we did a write to address X by simply overwriting the data and tag in cache block C? Ans: Cache block C would contain the tag for X, 1 word of X, and 3 words of Y.

Cache Hits and Misses Solution 1 (when write Miss occurs) • Ignore the cache when we have a write miss • Do not change the tag for X and do not update data X1, just write it on memory • Where is the idea of using cache? if the data resides in memory • In this case we can not use the advantage of locality • ….. Another Solution ?

Cache Hits and Misses Solution 2 (First implement read miss then write) • Perform a tag comparison while writing the data. • If equal, we have a write hit. No problem. • If unequal, we have a write miss. – Fetch the block from memory – Rewrite the word that caused the miss (Write through) • So, with a multi-word cache, a write miss causes a read from memory.

Miss Penalty and Miss Rate vs Block Size • In general, larger block sizes take advantage of spatial locality • BUT: • – Larger block size means larger miss penalties: • Takes longer time to fill up the block • – If block size is too big relative to cache size, miss rate will go up • Too few cache blocks • 16- 64 bytes works fine • In general, Average Access Time: = Hit Time + Miss Penalty x Miss Rate • Need to find a middle ground (Good design needs compromise

Miss Rate vs Block Size vs Cache Size • Miss rate goes up if the block size is too large relative to the cache size. • because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Spacial Locality • Does increasing the block size help the miss rate? • Yes, until the number of blocks in the cache becomes small. Then, a cache block may be swapped out before many of its words are accessed losing any spatial locality benefits. Effective Instruction Data combined Program miss rate miss rate miss rate gcc (1 word blocks) 6.1% 2.1% 5.4% gcc (4 word blocks) 2.0% 1.7% 1.9% spice (1 word blocks) 1.2% 1.3% 1.2% spice (4 word blocks) 0.3% 0.6% 0.4%

Miss Penalty • Increasing the block size also increases the miss penalty since we must read more words from memory for each miss. Reading more words takes more time. • Miss Penalty = latency to the first word + transfer time for the rest. • One way around this problem is to design our memory system to transfer blocks of memory efficiently. • One common way is to increase the width of the memory and the bus. (Transfer 2 or 4 words at a time, 1 latency period.) • Another common way is interleaving. This technique uses banks of memory. The requested address is sent to each bank in parallel. The memory latency is incurred once. Then the banks take turns at sending the requested words to the cache.

Cache Summary • The Principle of Locality: • Program likely to access a relatively small portion of the address space at any instant of time • Temporal Locality: Locality in Time • Spatial Locality: Locality in Space • Three Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses • Conflict Misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! • Capacity Misses: increase cache size • Cache Design Space • total size, block size, associativity (replacement policy) • write-policy (write-through, write-back)

Main Memory Organizations C P U C P U C P U M u l t i p l e x o r C a c h e C a c h e C a c h e B u s B u s B u s M e m o r y M e m o r y M e m o r y M e m o r y M e m o r y b a n k 0 b a n k 1 b a n k 2 b a n k 3 (c) interleaved memory organization (b) wide memory organization M e m o r y DRAM access time >> bus transfer time • Processing the cache Miss • Latency to fetch the first word from MEM(finding the addr. for word0) • Block transfer time (to bring all words in block from MEM) • It is difficult to reduce the latency to fetch the first word from MEM. However, we can reduce the miss penalty if we increase the bandwidth from MEM to cache (a) one-word widememory organization

Memory Access Time Example • Assume that it takes 1 cycle to send the address, 15 cycles for each DRAM access and 1 cycle to send a word of data. • Assuming a cache block of 4 words and one-word wide DRAM (page 41 fig. a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles • With main memory and bus width of 2 words (page 41 fig. b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits. • With interleaved memory of 4 memory banks and same bus width (page 41 fig. c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers.

Cache Performance • During Lecture Period

CMPE 421 Parallel Computer Architecture