UNIT V Memory And IO

UNIT VMemory And IO

SYLLABUS – UNIT V MEMORY AND I/O Cache performance – Reducing cache miss penalty and miss rate – Reducing hit time – Main memory and performance – Memory technology. Types of storage devices – Buses – RAID – Reliability, availability and dependability – I/O performance measures 2

Memory Hierarchy Design Introduction Review of the ABCs of Caches Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Cache Miss Penalty or Miss Rate via Parallelism Reducing Hit Time Main Memory and Organizations for Improving Performance Memory Technology Virtual Memory Protection and Examples of Virtual Memory 3

Introduction Processor Input Control Memory Datapath Output The five classic components of a computer: • Where do we fetch instructions to execute? • Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) • Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches 4

Technology Trends Capacity Speed (latency) CPU: 2x in 1.5 years 2x in 1.5 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years Memory Performance Index • DRAM • Year Size Cycle Time • 1980 64 Kb 250 ns • 1983 256 Kb 220 ns • 1986 1 Mb 190 ns • 1989 4 Mb 165 ns • 1992 16 Mb 145 ns • 64 Mb 120 ns • 2000 256 Mb 100 ns 4000:1! 2.5:1! 5

Performance Gap between CPUs and Memory CPU 1.35X/yr 1.55X/yr Memory 7%/yr (improvement ratio) The gap (latency) grows about 50% per year! 6

Levels of the Memory Hierarchy Capacity Access Time Registers CPU Registers 500 bytes 0.25 ns Cache Cache 64 KB 1 ns Blocks Memory Main Memory 512 MB 100ns Pages I/O Devices Disk 100 GB 5 ms Files ??? Memory Hierarchy Upper Level Faster Capacity Speed Larger Lower Level 7

Cache: In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality(Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality(Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) ABCs of Caches 8

Memory Hierarchy: Terminology Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory (Block Y) Miss Rate= 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) main Memory To Processor cache Blk X From Processor Blk Y 9

Cache Measures CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data 10

Example Assume we have a computer where the CPI is 1.0 when all memory accesses hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache? Answer: (A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster. Example 11

Four Memory Hierarchy Questions Q1 (block placement): Where can a block be placed in the upper level? Q2 (block identification): How is a block found if it is in the upper level? Q3 (block replacement): Which bock should be replaced on a miss? Q4 (write strategy): What happens on a write? 12

Q1(block placement): Where can a block be placed? Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) # of set  # of blocks n-way: n blocks in a set 1-way = direct mapped Fully associative: # of set = 1 Example: block 12 placed in a 8-block cache 13

Simplest Cache: Direct Mapped (1-way) The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache) Block number Memory 0 4 Block Direct Mapped Cache 1 Block Index in Cache 2 0 3 1 4 2 5 3 6 7 8 9 A B C D E F 14

Example: 1 KB Direct Mapped Cache, 32B Blocks For a 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M) 31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31 15

Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desiredbytes using Block Offset Increasing associativity ↑ => shrinks index↓ expands tag ↑ Block Address Q2 (block identification): How is a block found? Block Offset (Block Size) Tag Cache/Set Index Three portions of an address in a set-associative or direct-mapped cache 16

Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result 31 Cache Data Cache Tag Valid Cache Index 9 4 0 Valid Cache Tag Cache Data Cache Tag Example: 0x50 Cache Index Byte Select Cache Block 0 Cache Block 0 Ex: 0x01 Ex: 0x00 : : : : : : Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit 0x50 17

Disadvantage of Set Associative Cache N-way Set Associative Cache v.s. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Valid Cache Data Cache Tag Cache Block 0 : : : Cache Index Valid Cache Tag Cache Data Cache Block 0 : : : Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit 18

Easy for Direct Mapped – hardware decisions are simplified Only one block frame is checked and only that block can be replaced Set Associative or Fully Associative There are many blocks to choose from on a miss to replace Three primary strategies for selecting a block to be replaced Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out) Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way Size LRU Random FIFO LRU Random FIFO LRU Random FIFO 16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5 There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes Q3 (block replacement): Which block should be replaced on a cache miss? 19

Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache access are writes Two option we can adopt when writing to the cache: Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found Pros and Cons WT: simply to be implemented. The cache is always clean, so read misses cannot result in writes WB: writes occur at the speed of the cache. And multiple writes within a block require only one write to the lower-level memory Q4(write strategy): What happens on a write? 20

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Write Stall and Write Buffer Cache Processor DRAM Write Buffer • When the CPU must wait for writes to complete during WT, the CPU is said to write stall • A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating 21

Two options on a write miss Write allocate – the block is allocated on a write miss, followed by the write hit actions Write misses act like read misses No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache Write-Miss Policy: Write Allocate vs. Not Allocate 22

Write-Miss Policy Example Example:Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations. Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate? Answer: No-write Allocate: Write allocate: Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit 4 misses; 1 hit 2 misses; 3 hits 23

Cache Performance • Example: Split Cache vs. Unified Cache • Which has the better avg. memory access time? • A 16-KB instruction cache with a 16-KB data cache (split cache), or • A 32-KB unified cache? • Miss rates • Assume • A hit takes 1 clock cycle and the miss penalty is 100 cycles • A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port • 36% of the instructions are data transfer instructions. • About 74% of the memory accesses are instruction references • Answer: • Average memory access time (split) • = % instructions x (Hit time + Instruction miss rate x Miss penalty) • + % data x (Hit time + Instruction miss rate x Miss penalty) • = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24 • Average memory access time(unified) • = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44 24

Impact of Memory Access on CPU Performance • Example: Suppose a processor: • Ideal CPI = 1.0 (ignoring memory stalls) • Avg. miss rate is 2% • Avg. memory references per instruction is 1.5 • Miss penalty is 100 cycles • What are the impact on performance when behavior of the cache is included? • Answer: • CPI = CPU execution cycles per instr. + Memory stall cycles per instr. • = CPI execution + Miss rate x Memory accesses per instr. x Miss penalty • CPI with cache = 1.0 + 2% x 1.5 x 100 = 4 • CPI without cache = 1.0 + 1.5 x 100 = 151 • CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time • CPU time without cache = IC x 151 x Clock cycle time • Without cache, the CPI of the processor increases from 1 to 151! • 75 % of the time the processor is stalled waiting for memory! (CPI: 1→4) 25

Impact of Cache Organizations on CPU Performance • Example: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? • Ideal CPI = 2.0 (ignoring memory stalls) • Clock cycle time is 1.0 ns • Avg. memory references per instruction is 1.5 • Cache size: 64 KB, block size: 64 bytes • For set-associative, assume the clock cycle time is stretched 1.25 times to accommodate the selection multiplexer • Cache miss penalty is 75 ns • Hit time is 1 clock cycle • Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. • Answer: • Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns • Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns • CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction • x Miss penalty) x Clock cycle time • = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC • CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC 26

Summary of Performance Equations 27

Improving Cache Performance • The next few sections in the text book look at ways to improve cache and memory access times. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Section 5.7 Section 5.5 Section 5.4 28

Reducing Cache Miss Penalty Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty • Five optimizations • Multilevel caches • Critical word first and early restart • Giving priority to read misses over writes • Merging write buffer • Victim caches 29

Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 +Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem Miss RateL1 < Miss RateL2 < … Definitions: Local miss rate— misses in this cache divided by the total number of memory accessesto this cache (Miss rateL1 , Miss rateL2) L1 cache skims the cream of the memory accesses Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory O1: Multilevel Caches 30

Design of L2 Cache Size Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1 Whether data in L1 is in L2 novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2 multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2 Advantage: prevent wasting space in L2 i.e. AMD Athlon: 64 KB L1 and 256 KB L2 31

Don’t wait for full block to be loaded before restarting CPU Critical Word First—Request missed word first from memory and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so it’s not clear if benefit by early restart Generally useful only in large blocks O2: Critical Word First and Early Restart block 32

Serve reads before writes have been completed Write through with write buffers SW R3, 512(R0) ; M[512] <- R3 (cache index 0) LW R1, 1024(R0) ; R1 <- M[1024] (cache index 0) LW R2, 512(R0) ; R2 <- M[512] (cache index 0) Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read O3: Giving Priority to Read Misses over Writes 33

O4: Merging Write Buffer If a write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPU’s perspective Usually a write buffer supports multi-words Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined • Write buffer with 4 entries, each can hold four 64-bit words • (left) without merging (right) Four writes are merged into a single entry • writing multiple words at the same time is faster than writing multiple times 34

O5: Victim Caches Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again rather simply discarded or swapped into L2 victim cache: a small, fully associative cache between a cache and its refill path contain only blocks that are discarded from a cache because of a miss, “victims” checked on a miss before going to the next lower-level memory Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches AMD Athlon: 8 entries 35

Reducing Miss Rate • 3 C’s of Cache Miss • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start missesorfirst reference misses.(Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misseswill occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision missesor interference misses.(Misses in N-way Associative but hits in Fully Associative Size X Cache) 36

3 C’s of Cache Miss 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 2 1 16 32 64 128 Compulsory Cache Size (KB) 2:1 Cache Rule 3Cs Absolute Miss Rate (SPEC92) miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict Compulsory vanishingly small 37

3Cs Relative Miss Rate 100% 1-way 80% 2-way Conflict 4-way 8-way 60% Miss Rate per Type 40% Capacity 20% 0% 4 8 1 2 16 32 64 128 Compulsory Flaws: for fixed block size Good: insight => invention Cache Size (KB) 38

Five Techniques to Reduce Miss Rate Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations 39

O1: Larger Block Size Using the principle of locality: The larger the block, the greater the chance parts of it will be used again. Size of Cache • Take advantage of spatial locality • The larger the block, the greater the chance parts of it is used again • # of blocks is reduced for the cache of same size => Increase miss penalty • It may increase conflict misses and even capacity misses if the cache is small • Usually high latency and high bandwidth encourage large block size 40

O2: Larger Caches Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15) May be longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 2 1 16 32 64 128 Compulsory Cache Size (KB) 41

Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2 Tradeoff: higher associative cache complicates the circuit May have longer clock cycle Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache? Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2% O3: Higher Associativity 42

O4: Way Prediction & Pseudoassociative Caches way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a latency of 3 clock cycles excess of 85% accuracy reduce conflict miss and maintain the hit speed of direct-mapped cache pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower level one fast hit and one slow hit Invert the most significant bit to the find other block in the “pseudoset” Miss penalty may become slightly longer 43

O5: Compiler Optimizations Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989]) Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache Get best performance when it was possible to prevent some instruction from entering the cache Aligning basic block: the entry point is at the beginning of a cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops Improve spatial locality => reduce misses Make data be accessed in order=> maximize use of data in a cache block before discarded /* Before: row first */ for(j=0;j<100;j=j+1) for(i=0;i<5000;i=i+1) x[i][j]=2*x[i][j]; skip through memory in strides of 100 words /* Before: row first */ for(i=0;i<5000;i=i+1) for(j=0;j<100;j=j+1) x[i][j]=2*x[i][j]; access all words in a cache block 44

Blocking: operating on submatrices or blocks Maximize accesses to the data loaded into the cache before replaced Improve temporal locality X=Y*Z /* After: B=blocking factor */ for(jj=0;jj<N;jj=jj+B) for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i=i+1) for(j=jj;j<min(jj+B,N;j=j+1){ r=0; for(k=kk;k<min(kk+B,N);k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; } /* Before */ for(i=0;i<N;i=i+1) for(j=0;j<N;j=j+1){ r=0; for(k=0;k<N;k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; } # of capacity misses depends on N and cache size • total # of memory words accessed = 2N3/B+N2 • y benefits from spatial locality • z benefits from temporal locality 45

Reducing Cache Penalty or Miss Rate via Parallelism Three techniques that overlap the execution of instructions Nonblocking caches to reduce stalls on cache misses to match the out-of-order processors Hardware prefetching of instructions and data Compiler-controlled prefetching 46

O1: Nonblocking cache to reduce stalls on cache miss For pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss separate I-cache and D-cache Continue fetching instructions from I-cache while waiting for D-cache to return missing data “Nonblocking cache (lookup-free cache) “hit under miss”: D-cache continues to supply cache hits during a miss “hit under multiple miss” or “miss under miss”: overlap multiple misses • Ratio of average memory stall time for a blocking cache to hit-under-miss schemes • first 14 are FP programs • average: 76% for 1-miss, 51% for 2-miss, 39% for 64-miss • final 4 are INT programs • average: 81%, 78% and 78% 47

O2: Hardware Prefetching of Instructions and Data Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than accessing main memory) Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss the requested block is placed in I-cache when it returns the prefetched block is placed in instruction stream buffer (ISB) 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990) UltraSPARC III: data prefetch If a load hits in the prefetch cache the block is read from the prefetch cache the next prefetch request is issued: calculating the “stride” of the next prefetched block using the difference between the current address and the previous address Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance 48

O3: Compiler-Controlled Prefetching Compiler-controlled prefetching Register prefetch: load the value into a register Cache prefetch: load data only into the cache (not register) Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction Most effective prefetch: “semantically invisible” to a program doesn’t change the contents of registers and memory, and cannot cause virtual memory faults nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being fetched Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead 49

Reducing Hit Time Importance of cache hit time Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty More importantly, cache access time limits the clock cycle rate in many processors today! Fast hit time: Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache Four techniques: Small and simple caches Avoiding address translation during indexing of the cache Pipelined cache access Trace caches 50

UNIT V Memory And IO