Reducing Cache Miss Penalties

Reducing Cache Miss Penalties • Traditionally, the focus on cache improvements is on miss rate • But our memory access formula suggests that reducing the miss penalty (time to retrieve from main memory) will also be of benefit • A problem with the miss penalty is that, since processors and SRAM memory has increased in speed much faster than DRAM has, the miss penalty has been increasing over time rather than decreasing!

Priority of Reads over Writes • Stated earlier, “make the more common case fast” • reads occur with a much greater frequency than writes • instructions are read only, many operands are read but not written back • How to make reads faster than writes? Writes are slower anyway because of the need to write to both cache and main memory • Write-through cache: write buffer should be of a proper size in order to prevent delays from writing to memory • Write-back cache: we can use a buffer to store the block when removed from cache and store to memory later so that the cache is not stalled due to writing to memory

Sub-block Placement • Because tags may be very large and take up too much space in the cache memory, designers may go towards larger block sizes • This reduces tag sizes and improves cache miss rate although raises the miss penalty • Sub-block placement may improve the miss penalty in this case • A valid bit is added to units smaller than a full block making up sub-blocks • A single sub-block is read into cache on a miss and the valid bit is set (all other sub-blocks of the given block will have valid bits cleared) • See figure 5.21, page 413

Early Restart • If a cache miss occurs and the memory system moves a block into cache, the cache may wait until the entire block is moved in before returning the requesting word • Early restart - the cache transmits the requested word as soon as it arrives from memory rather than waiting on the whole block • Critical word first - return the specific word requested first and the remainder of the block afterward. As soon as the requested word arrives, send it to CPU. This is also called a wrapped fetch • These techniques are only useful for large blocks (e.g., block size > bus bandwidth)

Nonblocking Caches • For CPUs that can execute instructions out of order (e.g., Tomasulo’s approach) there is no need to wait for memory to return to cache on a miss • Instead, the CPU can continue fetching instructions if there is a data cache miss or fetching data if there is an instruction cache miss • Similarly, a nonblocking cache allows the data cache to continue supplying cache hits during a miss (also called a lock-up free cache) • This is known as “hit under miss” and reduces overall miss penalty

Second-Level Caches • Should the cache be faster to keep pace with memory or should the cache be larger to lower miss rate? Both • Offer a smaller but faster cache and a larger but slower cache (still much faster than main memory) • New formula for avg mem access time = hit time L1 + miss rate L1 * miss penalty L1 (L1 - first level cache) • Miss penalty L1 = hit time L2 + miss rate L2 * miss penalty L2 (L2 - second level cache) • Or avg mem acc time = hit time L1 + miss rate L1 * (hit time L2 + miss rate L2 * miss penalty L2)

Miss rates • Redefine miss rates because of the two caches: • Local miss rate - number of misses in the cache divided by total number of memory accesses this cache • Global miss rate - number of misses in the cache divided by total number of memory accesses overall • Local miss rate for first cache will be larger than local miss rate for second cache because the first cache is smaller and because the second level cache is only asked to search when the first level misses entirely • The global miss rate is a more useful term than the second level cache’s local miss rate -- it tells us how many misses there are in all accesses

Example • Suppose in 1000 references, there are 40 misses in the first level cache and 20 in the second level cache. What are the miss rates? • Local (and global) miss rate of first level cache = 40/1000 = 4% • Local miss rate of second level cache = 20/40 = 50% • Global miss rate of second level cache = 20/1000 = 2% • The second level’s local miss rate is misleading and the global miss rate gives us an indication of how both caches perform overall

Cache Sizes vs. Performance • Notice in figure 5.23 we see how a large single cache performs vs. two caches, a small and a large • If the second-level cache is large enough, both schemes perform roughly the same in terms of miss rate -- but with the two level cache, we have the added benefit that the first-level cache is faster • So, we need to ask - what sizes should the caches be and what speeds should they operate at?

Second-Level Cache • Since everything in the first-level cache will most likely be in the second-level cache, the second-level cache should be substantially larger • Another question is on associativity, how does this affect the performance of the second-level cache? • See example on page 420. It indicates that an associative second-level caches make sense since the associativity lowers miss rate but does not have a dramatic affect on miss penalty • We may also increase the block size of the second-level cache to improve miss rate (again not having much of an affect on miss penalty)

Reducing Hit Time • Miss rate and Miss penalty play a large role in the average memory access time, but hit time is also important since most cache accesses are hits -- reducing this might improve performance beyond reducing miss rate and miss penalty • Hit time also has an impact on the clock speed - it doesn’t make much sense to have a faster clock than cache because the CPU would have to constantly stall for any memory fetch - whether instruction or data fetch

Small and Simple Caches • A cache access requires using the index portion of the memory address to read the cache entry tags and then compare • This can be time consuming, especially with associative caches that have large tags • It is also critical to keep the cache small so that it fits on the chip • Some designs compromise by keeping the tags on the chip and the data off the chip that provides fast tag checking and the ability for storing a lot of data • Another approach is to use direct-mapped caches

Avoid Address Translation • The CPU generates a virtual address that must first be translated into a physical memory address through hardware translation • However, if we store the virtual addresses in the cache, we can skip this translation • There are problems with this approach though: • if a process is switched out of memory then the cache must be flushed • The OS and user may share addresses in two separate virtual address spaces -- and this may cause problems if we use the virtual addresses in the cache

Pipelining Writes for Fast Write Hits • Writes will take longer than reads because the tag must be checked before the write can begin (a read can commence and if the tag is wrong, the item read can be discarded) • By pipelining writes, we can partially speed up the process • This works by having the second in the sequence of two or more writes start writing to the location after the first write assuming the tag is correct • In this way, the second write takes the same time as a read would

Cache Optimization Summary • For each category, the hardware complexity is provided as an integer from 0 (easy) to 3 (difficult) • Miss rate: Larger block size (0) (has a negative impact on miss penalty), higher associativity (1) (has a negative impact on hit time), victim caches (2), pseudo-assoc. caches (2), hardware prefetching (2), compiler prefetching (3), other compiler techniques (0) • Miss Penalty: Priority of read over write (1), Subblocks (1), Early restart (2), Nonblocking cache (3), Second-level cache (2) • Hit time: Small, simple cache (0) (negative impact on miss rate), avoiding address translation (2), pipeline writes (1)

Main Memory • The next level in the memory hierarchy between cache and I/O/storage • Although one concern of main memory has been increasing its size, this is much less of a concern today than main memory latency which directly relates to cache miss penalty which directly affects the avg memory access time • With the use of second-level caches, another concern is the main memory bandwidth - that is, how much can be transferred between main memory and cache at a time

Main Memory Technology • In order to handle larger and larger DRAM sizes • memory address translation is performed by dividing the address into two components and multiplexing them • the first used to identify the row of the mem location • the second to identify the column of the mem location • DRAMs also require refreshing (~5% of the time) • See figure 5.30 which shows how DRAM sizes have increased greatly but speed has increased very little • SRAMs, which are more expensive, do not require refreshing nor multiplexing and so are faster

Improving Memory Performance • Easier to improve memory bandwidth than memory latency • We could improve bandwidth also increasing cache block size (which may increase miss rate, but since transfer rate is greater, miss penalty would not be affected very much) • Consider a memory organization where 4 clock cycles to send an address to memory, 24 cycles for access time per word and 4 cycles to send the word back • If cache block = 4 words (16 bytes) then miss penalty = 4*(4+24+4) = 128 clock cycles! (memory bandwidth would be 16 bytes/128 cycles or 1/8 byte per clock cycles

Wider Main Memory • Main memory “width” is usually equal to the width of the second level cache (the first level cache might have a width of 1 word, equal to the width of CPU accesses) • By doubling or quadrupling the width of cache, we increase the width of main memory and thus the memory bandwidth • In our previous example, if the width is 2 words instead of 1, our miss penalty drops to 2 * (4+24+4) = 64 clock cycles. If the width is 4 words, our miss penalty drops to 1 * (4+24+4) = 32 clock cycles

Cost in Wider Main Memory • The cost is that to increase the bandwidth requires a larger bus: • 1 word bandwidth = 4 bytes = 32 bit data bus • 2 word bandwidth = 64 bit data bus • 4 word bandwidth = 128 bit data bus • Since the CPU still only accesses 1 word at a time, the remainder of the transfer is not used immediately (it will be stored in cache) • And we need a multiplexor between the CPU and cache so that only 1 word out of the 2 or 4 is sent to the CPU • However, if first-level cache only has a width of 1 word, we can move the multiplexor between the two caches

Simple Interleaved Memory • Use banks of memory chips to interleave access • If a bank is 1 word wide, the width of the bus would not have to change • Send the address to several banks and interleave the fetching so that each word arrives at the cache consecutively • Assume 4 way interleaved memory, then, as per our previous example, it takes 4 clock cycles to send the address to each bank, 24 cycles to retrieve the word from each, and 4 cycles for each bank to return a word giving a miss penalty of 4+24+4*4 = 44 clock cycles

Example • Block size=1 word, memory bus width=1 word, miss rate=3%, memory accesses per instruction=1.2, cache miss penalty=32 cycles, average cycles per instruction=2 • If we change the block size to 2 words, the miss rate falls to 2% and a four-word block has a miss rate of 1% • What is the improvement in performance of interleaving two ways and four ways versus doubling the width of memory and the bus?

Solution • Average CPI=2+(1.2*3%*32)=3.15 cycles • Increasing the block size to 2 words gives • 32 bit bus and memory, no interleaving = 2+(1.2*2%*2*32) = 3.54 • 32 bit bus and memory, interleaving = 2+(1.2*2%*(4+24+8)) = 2.86 • 64 bit bus and memory, no interleaving = 2+(1.2*2%*1*32) = 2.77 • Doubling the block size does not help, but interleaving or wider memory does

Solution continued • If we increase the block size to four, we get: • 32 bit bus and memory, no interleaving = 2+(1.2*1%*4*32) = 3.54 • 32 bit bus and memory, interleaving = 2+(1.2*1%*(4+24+16)) = 2.53 • 64 bit bus and memory, no interleaving = 2+(1.2*1%*2*32) = 2.77 • The larger block size hurts performance if we do not increase the width or use interleaving. The best option is to use interleaving but not increase this width (this is also cheaper)

More on Interleaving • Interleaving is a way to make memory wider, while memory bandwidth is not wider, the amount that can be retrieved at any time is greater • But how many banks should be used? One metric (used in vector computers) is • Banks >= number of clock cycles to access a word • This may fulfill a goal to deliver 1 word per clock cycle • Also, the more banks, the less chance of a bank conflict occurring (notice that with 4 banks, a 4-word request may require accessing all 4 words from 1 bank defeating the advantage of interleaving, but with 100 banks, there is less chance of conflict)

More on Memory Banks • Two potential problems with interleaving: • as chip capacities increase, we see less “small” capacity chips making it difficult to implement large numbers of memory banks • with interleaving, expansion must be done in increments equal to that of memory (e.g., doubling memory size, quadrupling, etc…) • Another approach is independent memory banks • One accessed by CPU/cache, one for I/O, etc… • Each bank would need its own address & data buses • The whole collection of banks is called a superbank

Avoiding Memory Bank Conflicts • Need independent requests to go to different banks or else it defeats the purpose of having multiple memory banks. Some solutions include: • loop interchange to avoid accessing the same bank for data on successive iterations • expand the size of arrays to bigger than the number of banks (or if the number of banks is not known, to not be a power of 2) • Use a prime number of banks • See page 436 for a description of address to bank calculation using mod (or skip it)

Virtual Memory • Because main memory is limited in size, and because most computers run more than 1 process at a time (multitasking) we need a large address space • Of course, the address space does not require physical memory, much of it can reside in virtual memory (secondary storage) • Virtual memory divides physical memory into blocks and allocates them to different processes • Protection and bookkeeping mechanisms are then required -- this information is typically stored in the page table

Virtual Memory and Cache • While both use address translation to determine physical locations and both must store instructions and data in a smaller space, there are many differences: • replacement strategies are implemented in hardware for cache and in the OS for vmem • strategies for vmem should make very informed decisions (because the miss penalty is very large) • size of processor address determines size of vmem but has no affect on size of cache • while vmem is physically stored on disk, the disk is primarily used to store the entire file system

Segmentation and Paging • We covered the differences in 6300 • Most systems tend to either use paging or segmentation with paging, but not segmentation alone (notable exception, Pentium) • Paging - causes internal fragmentation, and swapping is needed more often • Segmentation - causes external fragmentation • See figure 5.39 for a comparison between the two approaches

Q1: Where can a block be placed? • The miss penalty for virtual memory is high because of accessing data from a rotating disk • rotational delay, head movement delay, latency of transferring data • Because the miss penalty is high, the OS is not concerned with where the page is placed • Any free frame in memory can be used • Therefore, vmem can be thought of as fully associative since there are no restrictions on placement

Q2: How is a block found? • Page table stores for each page in a process: the frame number of each page in memory • Translation is performed by indexing into the page table using the upper part of the address and concatenating the frame number to the lower part of the address (the page offset) • To speed up the process, part of the page table might be stored in cache, a translation look-aside buffer (TLB) • To reduce the amount of memory required for the page table, the OS may use an inverted page table

Q3: Which block should be replaced • A goal of the OS is to minimize page faults to minimize average memory access time • In order to accomplish this, most OS’s replace pages that have not been used recently and so it is expected that they will not be used again soon • The least recently used (LRU) strategy is hard to implement, and so is typically approximated using one or more use (or reference) bits which are set when the page is referenced and cleared occasionally -- pages with a clear use bit are selected for replacement

Q4: What happens on a write? • The vmem write strategy is to write back so that the page in memory is updated, but not the page on disk -- until later • A dirty bit is added to the page table so that if a page is written to, the bit is set • Replacing a page that has been read but not written to is simple, just discard it, but replacing a page written to means writing it back to disk • When selecting a page to replace, favor pages whose dirty bits are not set

Fast Address Translation • Page tables are stored in memory • Every memory access now requires at least two memory references, one to the page table to find the frame number and one to the actual main memory location • Store part of the page table in the cache (usually a special cache, different from the instruction and data caches) -- the TLB • Each TLB entry stores • Tag: the virtual memory address generated by the CPU • Data: the page’s physical frame number, whether it is in memory or not (valid bit), dirty bit, other protection information such as if the page is shared or not

Selecting a Page Size • The main architectural parameter in virtual memory is selecting the page size • Larger page sizes mean fewer pages, which mean smaller page tables and having a larger percentage of the page table in the TLB at a time • Larger page sizes also simplifies cache hit time • Smaller page sizes require less miss penalty because transferring from disk (or over network) is shorter • Smaller page sizes cause less internal fragmentation • Most microprocessors support multiple page sizes

Virtual Memory Protection • With multitasking, two or more processes share main memory using different locations • But, each process refers to all of virtual memory as its own, so they will have overlapping address references • There is a need to protect one process from using another’s physical memory space • Simple mechanism: • Each process has a pair of registers, base and bound, so that every memory reference must be between these two

Accessing Base and Bound • If a user’s process could access these two registers, it could change the values and then violate the protection mechanism! • So, access to these registers must be secure • Two modes of operation: kernel an user • User mode cannot access these registers (or do other “supervisor” operations such as execute an OS routine) or change to kernel mode • Computer switches to kernel mode by system call • Kernel mode can change back to user mode

More Elaborate Mechanisms • Some architectures use more than 2 modes (we will see that the Pentium has 4) • Others use concentric “rings” of security where more and more rights are available as you move in • Still other architectures restrict access to rights through a lock and key, that is, a process can only access something if it has the proper key, obtained through OS calls or other mechanisms

Alpha AXP Memory Management • Uses a combination of segmentation and paging • 64-bit address space (that’s huge! Which could lead to an enormous number of page table entries) • All addresses split into 3 segments, seg0, seg1 (both of which are user segments) and kseg (OS segment) • Highest 2 bits of all addresses denote the segment • Seg0 grows upward from 0, seg1 downward from max, and kseg in the middle (see figure 5.42)

Alpha Paging • Within the 3 segments, the Alpha uses paging with 3 levels of page tables • See figure 5.43 for a diagram of address mapping • Page table entries are 64 bits, the first 32 bits contain the physical page frame number (4 gig) and the other half includes: • valid bit • user read enable and user write enable • kernel read enable and kernel write enable • additional fields for use by the system software

More on Alpha Page Tables • Page table entries are 8 bytes long (64 bits) • Each page is 8KB in size and each page table is 1 page in length containing 1024 page table entries • As explained on page 451-452, the Alpha’s virtual address space is actually 43 bits long instead of 64 • Illegal addresses generated by a program are caught by the page table by detecting invalid addresses -- this will lead to an OS trap • Page tables are protected against user writes so that the protection can be circumvented

Conclusion on Alpha Vmem • See figure 5.44 which shows various parameters of the Alpha memory hierarchy • The Alpha’s memory management is similar to other architectures today: • OS performs multi-process protection • page table performs address translation • Notable difference is that the Alpha’s virtual memory space is expanded past the 32 bit limitation of most modern computers • The overall Alpha memory hierarchy is shown in figure 5.47, page 462

Virtual Memory for Pentium • The original 8086 used segmentation -- not for virtual memory, but to expand the memory space because of the limited 16 bit address size • The Pentium architecture implements Virtual memory using segmentation • Both the OS and the user have their own, full address space (so that an OS instruction and a user instruction could be the same -- protection mechanisms must ensure that the proper memory space is accessed) • See pages 453-457 for details

Number of Cache ports • From Chapter 4, we saw that a CPU that could issue multiple instructions per clock cycle • This requires that the cache be able to service multiple requests per clock cycle and be nonblocking or else there is little to no benefit of issuing multiple instructions • Mechanisms for allowing multiple cache accesses: • Using separate instruction and data caches • Having the clock cycle be long enough to allow for > 1 access per cycle • Having wide data path so that more than 1 word is retrieved at a time

Speculative Execution and Memory • Another problem that must be resolved in the memory system is that a speculative execution may generate an invalid address what would not occur had the instruction not been speculative • We do not want an OS trap to occur if the speculation turns out to be wrong -- so the memory system must know if a request from a given instruction is speculative -- if so and a trap should be generated, ignore it

ILP vs. Reducing Cache Misses • The code shown on pages 458-459 demonstrate an interesting problem: • We unroll the nested for-loops to maximize the parallelism available - however, we find RAW hazards, so instead we interchange the outer and inner loops and then unroll • The new interchanged and unrolled loop maximizes the parallelism and minimizes stalls • Unfortunately, since the arrays are stored in row-major order in cache/memory, by interchanging the loops, we have greatly reduced the hit rate and so have worsened performance!

Cache Consistency • The dirty bit in the cache allows the system to modify memory in a write-back process • However, since memory can be directly accessed by I/O devices, what happens if a copy of a memory value is in the cache and I/O accesses the memory item? • We need to ensure that cache and memory are kept consistent or that a mechanisms is available to prevent I/O from using a cached item • This is known as the cache coherency problem

More on Cache Coherency • An example is shown in figure 5.46 • Here, A and B are in both memory and cache • The CPU updates A in the cache, leaving the memory value stale • I/O outputs A -- if not prevented, the output will give the stale value • An input process inputs a new B making the cache value of B stale • Notice that the cache coherency problem is also present in multiprocessor systems where two processors share memory but not the same cache

Cache Coherency Solutions • This is a serious problem and one that does not have any great solutions • Could have I/O directly access cache instead of memory -- this is impractical because I/O is slow and while I/O is accessing cache, CPU must stall • If a memory item is already in the cache, disallow I/O access (this is difficult to implement, how does I/O know what is in cache and what is not? -- we will need to store a duplicate set of the cache tags in memory) • Use write-through caches so that an update to the cache is always reflected in memory -- for this, we must ensure that if memory is being updated by the cache, no I/O is allowed to access memory until done

Reducing Cache Miss Penalties

Reducing Cache Miss Penalties

Presentation Transcript

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

Cache Replacement Algorithms with Nonuniform Miss Costs

Conditional Cash Transfer penalties Vs. no penalties

TIMELINES AND PENALTIES

Penalties and Prosecutions

Cache-Miss Prediction

Reducing Branch Penalties

Improving Data Cache Performance Under a Cache Miss

DWI PENALTIES

INFLUENCE OF THE CACHE COHERENCE PROTOCOL ON THE MISS RATE

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

Benefits of Early Cache Miss Determination

Reducing Cache Traffic and Energy with Macro Data Load

Coincidental Penalties

Penalty Games: Reducing IRS Penalties

Time Penalties Delayed Penalties

Err cache miss error fixes

Penalty Games: Reducing IRS Penalties

Cache Miss Analysis of Walsh-Hadamard Transform Algorithms

Cache Miss Rate Computations

Penalties and Prosecutions

Reducing latency in web caching with the Redis Cache