Chapter Seven Memory Hierarchy-2 by Patterson

Chapter Seven Memory Hierarchy-2by Patterson

Hits vs. Misses Increasing the block size usually decrease the miss rate and my go up eventually if the block size becomes a significant fraction of the cache size. Why?

Hits vs. Misses • Because: • The number of blocks that can be held in the cache will become small. • Competition between blocks in cache. • As a result, a block will be bumped out before many of its words accessed. • The cost of a miss increases: • The time to fetch the block has two parts: • 1-The latency to the first word and • 2-Transfer time for the rest of the block. This part will increases as the block become larger.

Hits vs. Misses • Read hits • this is what we want • Read misses • The control unit must detect a miss and process the miss by fetching the requested data from memory (lower-level memory). • It handled with the processor control unit and with a separate controller that initiates the memory access and refills the cache. • Stall the CPU, fetch block from memory, deliver to cache, restart • Send the original PC value (Current PC – 4) to the memory • Instruct main memory to perform a read and wait for the memory to complete its access. • Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. • Restart the instruction execution at the first step, which will re-fetch the instruction, this time finding it in the cache.

Hits vs. Misses • Write hits: • write-through : and replace data in cache and memory • Avoid the inconsistent (in that we wrote the data into the cache only and memory would have different value from that tin the cache) situation. • With a write-through scheme, every write causes the data to be written to main memory • Write-through process takes 100 processor clock cycles which reduce the performance. • Use write buffer to improve the performance. A write buffer stores the data while it is waiting to be written to memory. • The processor continue execution after writing the data into cache and into the write buffer. • When a write to main memory completes, the entry in the write buffer is freed. • If the write buffer is full when the processor reaches a write, the processor must stall unit there is an empty position in the write buffer.

Hits vs. Misses • write-back the cache later : write the data only into the cache • Handle writes by updating values only to the block in the cache, the writing that modifies block to the lower level of the hierarchy when the block is replaced. • Write-back schemes can improve performance, especially when processors can generate write as fast or faster than the writes can be handled by main memory. But it is more complex than write through. • Write misses: • Fetch the word from the memory and place into cache • Overwrite the word that cause the miss into the cache block • Write the word to main memory using the full address • or • read the entire block into the cache, then write the word • write the data only into the cache (write-back the cache later)

Direct Mapped Cache

Hardware Issues • When cache misses occur, it is difficult to reduce the latency of fetch the first word from the memory, but we can reduce the miss penalty if we increase the bandwidthfrom the memory to the cache. • This reduction allows larger block sizes to be used while still maintaining a low miss penalty, similar to that for a smaller block • The processor is typically connected to memory over a bus. The clock rate of the busis usually slower than the processor as a factor of 10. The speed of this bus affects the miss penalty. • Suppose the memory access time: • 1 memory bus clock cycle to sendthe address. • 15 memory bus clock cycles for each DRAM access initiated. • 1 memory bus clock cycle to send a word of data.

Hardware Issues If we have cache block of four words and a one-word-wide bank of DRAMs, the miss penalty would be 1+4*15+4*1 = 65 The number of bytes transferred per bus clock cycle (4*4)/65 =0.25

Hardware Issues • There are three options (organizations): • a. one-word-wide memory organization: As described in the previous slide. • b. wide memory organization: Increase the bandwidth to memory by widening the memory and the buses between the processor and memory • Parallel access to all the words of the block • With memory width of two words, the missing penalty: 1+2*15+1*2 = 33 memory bus clock cycle • Bandwidth per miss is 0.48 bytes per bus cycle • With memory width of four words, the missing penalty: 1+15+1 = 17 memory bus clock cycle • Bandwidth per miss is 0.96 bytes per bus cycle

Hardware Issues • c. Interleaved memory organization: Increases the bandwidth by widening the memory but not the interconnection bus. • The memory chips can be organized in banks to read or write multiple words in one access time rather than reading or writing a single word each time. • Still pay a cost to transmit each word, but we can avoid paying the cost of the access latency more than once. • Sending an address to several banks permits them all to read simultaneously.  interleaving, incurring the full memory latency one once. • Miss penalty: 1+15+4*1=20 memory bus clock cycle • Bandwidth per miss of 0.8 byte per clock

Hardware Issues

Performance • How to measure cache performance • CPU time = (CPU execution clock cycles + Memory-stall clock cycle) * Clock cycle time where Memory-stall time comes from cache misses • Memory-stall clock cycles = Read-stall cycles + Write-stall cycles • Read-stall cycles = Reads/Program * Read miss rate * Read miss penalty • For Write-through scheme: • There are two sources of write stalls: • Write misses – fetch the block • Write buffer stalls – when the write buffer is full • Write-stall cycles = (Writes/Program * Write miss rate * Write miss penalty) + Write buffer stalls

Performance • Write buffer stalls is very small and can be ignored. • In most write-through cache organization, the read and write miss penalties are the same (the time to fetch block from memory). • Memory-stall clock cycles = (Memory Access/Program) * Miss rate * Miss penalty • or • Memory-stall clock cycles = (Instructions/Program) * (Misses/Instruction) * Miss Penalty

Performance • See example page 493. • (Calculating cache performance) Assume a instruction cache miss rate for a program is 2% and data miss rate is 4%. If a processor has a CPI (Clock cycle per instruction) of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed. The frequency of all loads and stores is 36%. • Sol: • The number of memory miss in term of Instruction count (I) is: Instruction miss cycles = I * 2% * 100 = 2.00 * I • The number of memory miss cycles for data reference: Data miss cycles = I * 36%*4%*100 = 1.44 * I • The total number of memory-stall cycles is 2.00 I+ 1.44 I = 3.44 I • The CPI with memory stalls is 2+ 3.44 = 5.44. • The CPI without memory stalls (perfect) = 2

Performance • The ratio of CPU execution time (CPU time with stalls)/(CPU time with perfect cache) = 5.44/2 = 2.77

Performance • (Continued from last example) What if the speed increases by reducing its CPI from 2 to 1? 4.44/1= 4.44 • The amount of execution time spent on memory stalls would have risen from 3.44/5.44= 63 % to 3.44/4.44=77% • See example on page 495. (Cache Performance with Increased Clock Rate) When the clock rate increased. The new miss penalty will be twice as many clock cycles. Suppose we increase the performance of the computer in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the computer be with the faster clock, assuming the same miss rate as the previous example?

Performance • Reducing Cache misses by more flexible placement of blocks: • Simple placement: A block can go in exactly one place in the cache. Direct mapped. This is one extreme. • Another extreme is a Fully associative • Fully associative: A block can be placed in any location in the cache. To find a given block in a fully associate cache, all the entries in the cache must be searched. • It needs comparator to search the block in parallel. • The comparator increase the hardware cost. • It is effective when the cache with small number of blocks.

Performance • Set associative: A middle range of design between direct mapped and fully associative. • An n-way set-associative cache consists of a number of sets, each of which consists of nblocks. • Each block in the memory maps to unique set in the cache given by the index field, and a block can be placed in any element of the set. • It combines direct-mapped placement and full associative placement: a block is directly mapped to a set, and all the blocks in the set are searched for a match.

Performance • The set containing the memory block is given by (Block number) modulo (Number of sets in the block) • Since the block may be placed in any element of the set, all the elements of the set must be searched. • All the tags of all the elements of the set must be searched. • All the tags of all the blocks in the cache must be searched in full associative cache. • See Figure 7.13 on board. In direct-mapped placement, there is only one cache block where memory block 12 found, and that block is given by (12 modulo 8) = 4. In a two-way set-associative cache, there would be four sets, and memory block 12 must be set (12 modulo 4) = 0. In a fully associative placement, the memory block for block address 12 can appear in any of the eight cache blocks.

Performance • A direct-mapped cache is simply a one way set-associative cache. • Each entry holds one block and each set has one element only. • A fully associative cache with m entries is an m-way set-associative cache; it has one set with m blocks and an entry can reside in any block within that set. • Increasing the degree of associativity usually decreases the miss rate but increase the hit time.

Decreasing miss ratio with associativity • See example on page 499. • (Misses and Associativity) Assume three small caches, each consists of four one-word blocks. • One cache is fully associative, a second is two-way set associative, and the third is direct mapped. Find the number of misses for the 0, 8, 0, 6, 8 • Note: Because we have a choice of which entry in a set to replace on a miss, we need a replacement rule. • Set-associative caches usually replace the least recently used block within a set; that is, the block that was used furthest in the past is replaced. • For direct-mapped

Decreasing miss ratio with associativity • For set-associative with 2 way • For direct-mapped • How much of a reduction in the miss rate is achieved b associativity? See Figure 7.15.

Decreasing miss ratio with associativity How to find a block in a cache in set associative? • Each block in a set-associative cache includes an address tag that gives the block address. • In the diagram Figure 7-16, • The index is used to select the set containing the address of interest. • Address tag: gives the block address. The tag of every cache block within the appropriate set is checked to see if it matches the block address from the processor. • Block offset: The block offset is the address of the desired data within the block. • All the tags are searched in parallel. Tag Index Block Offset

Decreasing miss ratio with associativity • In Figure 7.17, all the tags are searched in parallel. • four-way set-associative cache, four comparators are needed, together with a 4-to-1 multiplexor to choose among the four potential members of the selected set.

Decreasing miss ratio with associativity

Decreasing miss ratio with associativity • The choice among direct-mapped, set-associative, or fully associative mapping in any memory hierarchy will depend on the cost of a miss versus the cost of implementing associativity, both in time and extra hardware. • See example on page 504. (Size of Tags versus Set Associativity) Increasing associativity requires more comparators, and more tag bits per cache block. Assuming a cache of 4K blocks, a four-word size, and a 32-bit address, find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way and four-way set associative and full associative.

Decreasing miss ratio with associativity • Increasing the associativity requires more comparators, and more tag bit per cache block per cache block. • Four-word blocks size.  16 = 24 bytes per block. • 32 bit address yields 32-4 = 28 bits to be used for index and tag. • Direct-mapped cache has same number of sets as blocks, log24K = 12 (bits) of index,. 28 -12 = 16. total number of tag bits is 16 * 4K = 64 K bits. • Each degree of associativity decreases the number of sets by a factor of two and decreases the number of bits used to index the cache by one and increase the number of bits in the tag by one. • Two-way set-associative cache, 2K sets, total number of tag bits (28-11)*2*2k= 34 *2K = 68 K bits.

Decreasing miss ratio with associativity • Four-way set-associative cache, 1K sets, total number of tag bits (28-10)*4*2k= 34 *1K = 72 K bits. • For Fully associative cache, only one set with 4K blocks, the tag is 28 bits, leading to a total of 28*4K*1

Choosing which block to Replace • Least recent used (LRU): The block replaced is the one that has been unused for the longest time.

Decreasing miss penalty with multilevel caches • Add a second level cache: • Often, primary cache is on the same chip as the processor • Use SRAMs to add another cache above primary memory (DRAM) • Miss penalty goes down if data is in 2nd level cache • Example: page 505 (Performance of Multilevel Cache) A processor with a base CPI of 1.0. Assuming all references hit in the primary cache, and a clock rate of 5 GHz, Assume a main memory access time of 100 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 2%. How much faster will the processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss is large enough to reduce the miss rate to main memory to 0.5 %?

Decreasing miss penalty with multilevel caches • CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access • Adding 2nd level cache with 5ns access time decreases miss rate to .5% • The miss penalty to main memory is (100 ns)/(0.2 ns/clock cycle) = 500 clock cycles • For the processor with one level chacling: Total CPI = Base CPI + Memory stall cycle per instructino = 1.0+2%*500 = 11.0 • With two levels of caches, the miss penalty for an access to the second level cache is (5ns)/(.2 ns/clock cycle) = 25 clock cycles For a two level cache, total CPI: CPI= 1 + primary stalls per instruction + Secondary stalls per instruction = 1 + 2%*25 + 0.5% *500 = 1 + 0.5 + 2.5 =4.0 The processor with the secondary cache is faster by 11.0/4.0 = 2.8

Decreasing miss penalty with multilevel caches • Using multilevel caches: • try and optimize the hit time on the 1st level cache • try and optimize the miss rate on the 2nd level cache

Cache Complexities • Not always easy to understand implications of caches: Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

Cache Complexities • Here is why: • Memory system performance is often critical factor • multilevel caches, pipelined processors, make it harder to predict outcomes • Compiler optimizations to increase locality sometimes hurt ILP • Difficult to predict best algorithm: need experimental data

Homework #3-1 • Page 556 7.5 A new processor can use either a write-through or write-back cache selectable through software: a. Assume the processor will run data intensive application with a large number of load and store operations. Explain which cache write policy should be used. b. Consider the same operation but this time for a safety critical system in which data integrity is more important than memory performance. 7.9 Here is a series of address references given as word addresses:2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 6 and 11. Assuming a direct-mapped cache with 16 one-word blocks that is initially empty, label each reference in the list as a hit or a miss and show the final contents of the cache. 7.12 Compute the total number of bits required to implement the cache in Figure 7.9 on page 486. This number is different form the size of the cache, which usually refers to the number of bytes of data stored in the cache. The number of bits needed to implement the cache represents the total number of memory needed for storing all the data, tags and valid bits.

Homework #3-1 • Page 557 7.14 Consider a memory hierarchy using one of the three organizations for main memory shown in Figure 7.11 on page 486. Assume that the cache block size is 16 words, that the width of organization (b) of the figures is four words, and that the number of banks in organization (c) is four. If the main memory latency for a new access is 10 memory bus clock cycles and the transfer time is 1 memory bus clock cycle, what are the miss penalties for each of these organizations? • Page 558 7.29 Suppose a computer’s address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of two, so B = 2b. Figure out what the following quantities are in terms of S, B, A, b, and k: the number of sets in the caches, the number of index bits in the address, and the number of bits needed to implement the cache. (See 7.12.)

Chapter Seven Memory Hierarchy-2 by Patterson

Chapter Seven Memory Hierarchy-2 by Patterson

Presentation Transcript

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy Part 2

Chapter 7 Memory Hierarchy

Memory Hierarchy

Memory hierarchy

Memory Hierarchy

Chapter Seven Large and Fast: Exploiting Memory Hierarchy

Chapter 2: Memory Hierarchy Design

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Chapter Seven Memory Hierarchy by Patterson

Memory Hierarchy

Memory Hierarchy

Chapter 2 Memory Hierarchy Design

Memory Hierarchy

Memory Hierarchy Part 2