What impact the memory system design? Principle of Locality Temporal Locality (90% time spent in 10% code)

What impact the memory system design? • Principle of Locality • Temporal Locality (90% time spent in 10% code) • Spatial Locality (neighboring code has high probability to be executed soon) • Smaller Hardware is faster? • Speed vs. Capacity • Price/Performance Consideration (Amdahl’s Law) CS520S99 Memory

Memory Hierarchy Price’ 99 cache blocks Memory Level Price’ 93 Speed Size access time per MBytes cache (SRAM) 8-256 bytes Small Expensive Fast 16K-2M (8-35ns) $50-200 $200-$500 primary memory (DRAM) page M M M 32M-512M $2-5 $25-50 (90-120ns) disk Large 2G-17G Slow Cheap 4K-64K $0.02-0.06$1-2 (7.5-15ms) CS520S99 Memory

DRAM, SIMM, DIMM • DRAM: Dynamic RAM • SIMM: Single In-Line Memory ModuleSave space (30pin/8bits, 72pin/32bits) • DIMM: Dual In-Line Memory Module with DRAM on both sides (168pins/64bits) • SO-DIMM: Small Outline DIMMused in most notebooks (72pin/32bits) • See http://www.kingston.com/king/mg0.htmfor an informative guide Simm sockets on system board CS520S99 Memory

32pin and 72 pin SIMM • Most desktop computers use either 72- or 30-pin SIMMs. • A 30-pin SIMM supports 8 data bits (data bus width) • Typical consists of 2 banks (Bank 0 and Bank 1). Each with 4 30-pin SIMM sockets, delivery 32 bits. • A 72-pin SIMM supports 32 data bits (in one bus cycle) • Mixing different capacity of SIMMs in the same memory bank results in system booting problems. • Newer DIMM support 168 pins. CS520S99 Memory

Data Integrity • To detect / fix bit error in memory chips, parity bit checking (9th bit) or ECC (Error Correcting Code) is used. These additional bits are called check bits. • In ECC, the # of required check bit for single error correction is derived from the following formula:m+r+1 <= 2r; m: # of data bits; r: # of check bits.M=8, r=4; r=6, m<=57bits; r=7, m<=120bits. • The SIMM memory modules (x36bits, x39, x40) provide data bits and redundant check bits to the memory controller, which actually performs the data integrity check. • Most home pcs does not use check bit memory. • Most servers or high end workstations support ECC memory. CS520S99 Memory

SIMM Module Identification Note that the parity SIMMs are distinguished by the "x 9" or "x 36" format specifications. This is because parity memory adds a parity bit to every 8 bits of data. So, a 30-pin SIMM provides 8 data bits per cycle, plus a parity bit, which equals 9 bits; 72-pin SIMMs provide 32 bits per cycle, plus 4 parity bits, which equals 36 bits. CS520S99 Memory

Page Mode, EDO, SDRAM • Fast-page mode chip: same row of bits are retrieved from memory cell area and saved in a special latch buffer. If next access is on the same row, data already in latch, only need half of the time. • Extended Data Output, or EDO memory, memory access 10 to 15 percent faster than comparable fast-page mode chips. • Synchronous DRAM (SDRAM) uses a clock to coordinate with the CPU clock so the timing of the memory chips and the timing of the CPU are in `synch.' • Synchronous DRAM saves time in executing commands and transmitting data, thereby increasing the overall performance of the computer. The new PC 100MHz bus requires SDRAM. • SDRAM memory allows the CPU to access memory approximately 25 percent faster than EDO memory. CS520S99 Memory

Memory Access • A memory access is said to have a hit (miss) in a memory level, if the data is found (can not be found) in the level. • Hit rate (Miss rate)—is the fraction of memory accesses (not) found in the level. • Hit time—the time to access data in a memory level including the time to decide if the access is a hit or miss. • Miss penalty—the time to replace a block in a level with the corresponding block from the level below, plus the time to deliver the block to CPU=Time to access first word on a miss+transfer time (rest words) Access Time Transfer Time CS520S99 Memory

Memory Acces in System with Cache/Virtual Memory (Paging) CS520S99 Memory

Evaluating Performance of a Memory Hierarchy • Average Memory Access Time is a better measure than the Miss rate. • Average Memory Access Time = Hit time + Miss rate * Miss penalty Block size vs. Average access time, Miss penalty, Miss rate CS520S99 Memory

Goal of Memory Hierarchy Reduce execution time, not the no. of misses Computer designers favor a block size with the lowest average access time rather than the lowest miss rate. CS520S99 Memory

Classify Memory Hierarchy Design Four Questions for Classifying Memory Hierarchies Q1: Where to place a block in the upper memory level? (Block placement) Q2: How to find a block in a memory level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) CS520S99 Memory

Cache The memory level between CPU and main memory. Cache: a safe place for hiding or storing things.Webster’s New World Dictionary of the American Language, Second College Edition (1976) CS520S99 Memory

Q1: Where to place a block in a cache? (Block placement) Direct mapped cache—a fixed place for a block to appear in a cache. e.g., the location = (block address) modulo (no. of blocks in cache). Fully Associative cache—a block can be placed anywhere in the cache. Set Associative cache—a block can be placed in a restricted set of places. If there are n blocks in a set, the cache placement is called n-way set associative. CS520S99 Memory

Q2: How to find a block in a cache?(Block identification) • Caches include an address tag (which gives part of block address) on each block. • A valid bit is attached to a tag to indicate if the information in the block is valid. • The address from the CPU to cache are divided into two fields: block address and block offset. • The block address identifies the block. • The block offset identifies the byte in the block. The # of bits in block offset depends on the block size (bs). Bs=32B  block offset is 5 bits, 25=32. • The block address is further divided into Tag and Index field. Tag identifies the set of blocks, while index selects the block within a set. CS520S99 Memory

Search the tag value • When a cache block was brought into cache, the correspond tag value was saved in the tag field of the block. • When the address from CPU arrives, its tag field is extracted and compared with the tag values in the cache. Often Implemented as Content Address Memory Sequential search slow Simpler to implement CS520S99 Memory

Q3: Which block should be replaced on a miss?(Block replacement) • For the direct-mapped cache, this is easy since only one block is replaced. • For the fully-associative and set-associative cache, there are two strategies: • Random • Least-recently used (LRU)—replace the block that has not been access for a long time. (Principle of temporal locality) • The following sequence shows the block access history (upper row) and the LRU block at each time slot (lower row). Time axis heading to the right. Initially, by default block 0 is the LRU block. Assume there are only 4 blocks. CS520S99 Memory

Q4: What happens on a write? (Write strategy) • Reads dominate cache accesses. All instructions accesses are reads. • Write policies (options when writing to the cache): • Write–through—The information is written to both the cache and main memory. • Write–back—The information is only written to the cache; the modified cache block is written to main memory only when it is replaced. • A block in a write–back cache can be either clean or dirty, depending on whether the block content is the same as that in main memory. CS520S99 Memory

Write Back vs. Write Through • For the write back–cache, • uses less memory bandwidth, since multiple writes within a block only requires one write to main memory. • a read miss (which causes a block to be replaced and therefore) may result in writes to main memory. (delay write is fast for CPU but data blocks inconsistent) • For the write–through cache, • a read miss does not result in writes to main memory. • it is easier to implement. • the main memory has the most current copy of the data.(Data consistency, coherence) CS520S99 Memory

Dealing with Write Miss • Write Miss: Data block not in the cache. • There are two options (whether to bring the block into the cache): • Write–allocate—The block is loaded into the cache, followed by the write-hit actions above. • No–write–allocate—The block is modified in the main memory and not loaded into the cache. • In general, the write–back caches use write–allocate. hoping that there are subsequent writes to the same block. • The write–through caches often use no–write–allocate. since the subsequent writes also go to the main memory. CS520S99 Memory

Dealing with CPU write stall • CPU has to wait for the writes to complete during write-through. • This can be solved by having a write buffer and let CPU to continue while the memory is updated using data in write buffer. • If the write buffer is full, then the CPU and cache need to wait. • Write merging: allow multiple writes to the write buffer to be merged into a single entry to be transferred to the lower level memory. CS520S99 Memory

Write Merging • The subsequent writes with the address in the same write buffer entry can be combined/merged. • This leaves more vacant entries for CPU to write. • It also reduces the memory bandwidth. Without write merging Write buffer full CPU stalls With write merging 4 writes merged in one write buffer entry CPU free to proceed CS520S99 Memory

Alpha AXP 21064 Data/Instruction Cache 34 bits 8B 8KB=213B direct mapped cache, 32B block, 256 bit CS520S99 Memory

Intel Pentium II • 32 KB Level 1 cache (16KB instruction/16KB data) • Cacheable address space up to 64GB (36 bit physical address) • Dual Independent Bus (D.I.B.) architecture increases performance and provides more data to the processor core • 100MHz system bus speeds data transfer between the processor and the system • 400 MHZ offers 1 MB and 512 KB L2 cache options. 450 MHZ offers 2 MB, 1 MB, and 512 KB L2 cache options. • Error Checking and Correction (ECC) to maintain the integrity of mission-critical data CS520S99 Memory

Cache Performance Formula • AverageMemoryAccessTime = HitTime + MissRate*MissPenalty • MemoryStallClockcycle=#Reads*ReadMissRate*ReadMissPenalty +#Writes*WriteMissRate*WriteMissPenalty • CPU time=(CPU-execution clock cycles+MemoryStallClockCycles)*cycleTime. • CPU time=IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime. • CPU time=IC*(CPIexecution+MAPI*MissRate*MissPenalty)*cycleTime. • MAPI: Memory Accesses Per Instruction. • MissRate: fraction of memory access not in the cache • MissPenalty: the additional time to service the miss (related to blocksize and memory transfer time. • IC: instruction count. CS520S99 Memory

Split Caches vs. Unified Cache • Percentage of instruction references is about 75%. • Split (Data+Instruction) caches offer two memory ports per clock cycle and allow parallel access of instruction and data. • Unified cache only has one port to satisfy both accesses, resulting 1 more clock cycle in hit time. • But with more cache memory (cache size), unified cache will have lower miss rate. CS520S99 Memory

Miss Rates of Split/Unified Caches CS520S99 Memory

Design Trade-off: Split Cache vs. Unified Cache • 16KB instruction cache+16KB data cache • 32-KB unified cache • Normal HitTime=1 cycle; Load and Store HitTime=2 cycles. • Which one has lower miss rate? • Ans: Based on 75% instruction reference, the overall MissRate (Split caches)= 0.75*0.64%+0.25*6.47%=2.10%. • MissRate(Unified cache)=1.99%. • According to miss rate, unified cache performs better. • AverageMemoryAccessTime(Split)=75%x(1+0.64%x50)+25%x(1+6.47%x50)=(75%x1.32)+(25%x4.235)=0.990+1.059=2.05 cycles Better! • AverageMemoryAccessTime(Unified)=75%x(1+1.99%x50)+25%x(2+1.99%x50)=(75%x1.995)+(25%x2.995)=1.496+0.749=2.24 cycles CS520S99 Memory

Impact of Cache on Performance • Assume Cache MissPenalty=50 clockcycles • All instructions normally take 2.0 clockcycles (ignoring memory stall). • MissRate=2%. • MemoryAccessPerInstruction=1.33 • Ans: Apply CPU time=IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime. • CPUTime(with cache)=IC*(2.0+(1.33x2%x50)*cycleTime=IC*3.33*cycleTime • CPI(perfect cache)=2.0CPI(with cache) =3.33CPI(without cache)=2.0+1.33*50=68.5 ! CS520S99 Memory

2-way associative vs. direct-mappedcache • 2-way associative cache requires extra logic to select the block in the set longer hit time longer CPU clock cycle time. • Will the advantage in lower miss rate offset the slower hit time? • Example (page 387). CPIexecution=2, DataCacheSize=64KB, Miss Penalty=70ns (35CPUClockcycles), MemoryAccesPerInstuction=1.3. • CPUwith direct-mapped cacheCPUwith 2-way assoc. cache • ClockCycleTime 2ns 2*1.1=2.2ns • Miss rate 0.014 0.010 • AMAT 2+0.014*70 2.2+0.010*70 • =2.98ns worse! =2.90ns • CPU time=IC*(CPIexecution*ClockCycleTime+MemoryAccesPerInstuction*MissRate*MissPenalty*CycleTime) • CPU time IC*(2.0*2+1.3*0.014*70) IC*(2.0*2.2+1.3*0.010*70) =IC*5.27 =IC*5.31 • Since the CPU time is the bottom line evaluation metric and direct-mapped cache is simpler to build, in this case the direct-mapped cache is preferred. CS520S99 Memory

Improving Cache Performance • Caches can be improved by: • Reducing miss rate • Reducing miss penalty • Reducing hit time • Often there are related, improving in one area may impact the performance in the other areas. CS520S99 Memory

Reducing Cache Misses • Three basic types of cache misses: • Compulsory - The first access to a block not in the cache. (first reference misses, cold start misses). • Capacity - since the cache cannot contain all the blocks of a program, some blocks will be replaced and later retrieved. • Conflict - when too many blocks try to load into its set, some blocks will be replaced and later retrieved. CS520S99 Memory

Reducing Miss Rate by Larger Block Size • Good: Larger blocks takes advantage of spatial locality. • Bad: Larger blocks increase the miss penalty and reduce the number of blocks. Miss rate CS520S99 Memory

Miss Rate vs. Block Size CS520S99 Memory

Select the Block Size that Minimizes AMAT • Assume memory system takes 40 cycles overhead and delivers 16 bytes every 2 clock cycles. MissRate from Figure 5.12. Figure 5.13 shows the results on AMAT. • AMATBS=16B, CS=1KB= 1+(15.05%x(40+(16/16)*2))=7.321 clock cycles. • AMATBS=256B, CS=256KB= 1+(0.49%x(40+(256/16)*2)=1.353 clock cycles. CS520S99 Memory

Reducing Miss Rate by Victim Cache • contains blocks that are discarded from a cache miss • checked on a miss, if matched, victim block and cache block are swapped. • A four entry victim cache removed 20% to 95% of the conflict misses in a 4KB direct-mapped data cache. CS520S99 Memory

Pseudo (Column)-Associative Caches • Cache access just like direct-mapped cache for hit. • When miss, additional cache entry is checked for match. • The candidate cache entry can be the one with its most significant bit of the index field inverted. • These cache blocks form the “pseudo set”. • One cache block has fast hit time; the other has slow hit time. • What is the difference between this and a two-way set associative cache? (not affecting processor clock rate) CS520S99 Memory

Hardware Prefetching of Instruction and Data • Prefetch items before the processor request them. • Place them in cache or external buffer • Instruction Prefetch: • AXP21064 fetches two blocks (requested block into cache and the next consecutive block or prefetch block into instruction stream buffer, ISB) on a miss. • If instruction in the instruction stream buffer, cache request is cancelled, the block is read from ISB and the next prefetch request is issued. • For 4-KB direct mapped cache with BS=16B, 1 block in ISBcatch 15-25% misses4 blocks in ISBhit rate improves to ~50%16 blocks in ISBhit rate improves to 72% • Data Prefetch: • 1 single data stream buffer catches 25% misses • 4 data stream buffers increases data hit rate to 43% CS520S99 Memory

Compiler-Controlled Prefetching • Compiler generates prefetch instructions to request data before they are needed. • Register prefetch loads the value into a register • Cache prefetch loads data only into the cache • Since the prefetching of a block can result in a page (virtual memory) fault (the page containing the bock is not in main memory), the prefetches can be classified as faulting or non-faulting. • Non-binding prefetch: a nonfaulting cache prefetch that does not cause virtual memory faults. • A nonblocking (lookup-free) cache allow the processor to retrieve data/instruction, while prefetched data being fetched. • Goal overlap execution with prefetching of data. Loop is key target. CS520S99 Memory

Example of Compile-controlled Prefetching for (i=0; i<3; i=i+1)for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0]; Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate. A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns. Assume a and b not in cache at the start of program. • (Compiler) Determine the number of cache misses. Ans: In C, a two dimension array is saved in column-major order.a[0][0];a[0][1]; …;a[0][99];a[1][0];a[1][1];…a[1][99];a[2][0];…Accessing a row of a a[i][0] (separated by 800 B) called stridingThe above loop, a’s elements are written in the same order in memory. Since BS=16B, access a[0][0] will miss and bring in a block with a[0][0] and a[0][1]. The next access a[0][1] will hit.For array a, accesses with even value of j miss; accesses with odd value of j hit.300/2=150 misses. (spatial locality). CS520S99 Memory

Example of Compile-controlled Prefetching for (i=0; i<3; i=i+1)for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0]; Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate. A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns. Assume a and b not in cache at the start of program. • (Compiler) Determine the number of cache misses. Ans: The first loop instance, b[0][0] and b[1][0] are accessed. They are separated by 101*8=808 bytes and not belong to the same block. Their misses bring in blocks with b[0][0] and b[0][1] (not used); and b[1][0] and b[1][1] (not used).Next loop instance, b[1][0] and b[2][0] are accessed. B[1][0] hits and b[2][0] misses.In the loop where i=0, there will be 101 misses for b[0][0]-b[100][0]. But for loops with i=1 and i=2, all array b accesses hit CS520S99 Memory

Example of Compile-controlled Prefetching 2. Insert prefetch instructions to reduce misses. Assume miss penalty big and we need to prefetch 7 iterations in advanced. No fault for beginning and ending prefetches. Ans: First loop prefetch b and a. Second loop only prefetch a. for (j=0;j<100;j++) { prefetch(b[j+7][0]; /* b[j][0] for 7 iterations later */ prefecth(a[0][j+7]); /* a[0][j] for 7 iterations later */ a[0][j]=b[j][0]*b[j+1][0]; }; for (i=1; i<3; i++) for (j=0; j<100; j++) { prefetch(a[I][j+7]); /* a[I][j] for 7 iterations later */ a[I][j]=b[j][0]*b[j+1][0]; } The revised code prefetches a[I][7]…a[i][99?] and b[7][0]…b[99?][0] 1st loop nonprefetched misses to a[i][0-6];b[0-6][0] is 4+7=11 2nd loop nonprefetched misses to a[1][0-6]; a[2][0-6]is 4+4=8. Avoid 251-19=232 cache misses. Executing 400 prefetch instructions. CS520S99 Memory

Correction to Prefetch Results • Assume that a[i][0-6] is prefetched by a[i-1][100-106], then only a[0][0-6] causes ceil(7/2)=4 nonfetched misses. • a[0-6][0] causes 7 nonfetched misses. • a[100][0] is prefetched • The total nonfetched misses should be 4+7=11. • Prefectch a[0][7] brings in • a[0][7] and a[0][8]? Or • a[0][6] and a[0][7]? • Note that if a[0][0] is allocated at address divisible by 16, then the answer is 2. Othewise the answer is 1. CS520S99 Memory

Improvement of Compiler-Controlled Prefetch • Ignore instruction cache misses. • No conflict/capacity misses in the data cache. • Prefetches can overlap with each other and with cache misses. • 7 clockcycles per iteration in the original loop. • First prefetch loop takes 9 clock cycles per iterations2nd prefetch loop takes 8 clcok cycles per iterations • A miss takes 50 clock cycles. • Ans: Original Loop: 300 iterations, each takes 7 clockcycles2100 cycles251 cache misses, each takes 50 cycles12550 cyclestotal clock cycles = 2100+12550=14650 cycles1st prefetch loop: 100 iterations*9cycles+11misses*50cycles =1450cycles2nd prefetch loop: 200iterations*8cyccles+8misses*50cycles=2000cyclesTotal pretech loop cycles= 3450 cycles.Prefetch code is 14650/3450=4.2 times faster. CS520S99 Memory

Reducing Miss Rate Through Compiler Optimization • Pure software solution • Through profiling info and rearrange of code, miss rate can be reduced by 50%. [McFarling 98] • Data has less restriction on relocation than code. • Goal: rearrange code and data to improve the spatial and temporal locality. CS520S99 Memory

Example 1. Merging Array • Group related data into a record and hope they are fetched as a cache block. • What locality is exploited here? • Before: int val[SIZE]; int key[SIZE]; • After: structure merge { int val; int key; } struct merge merged_array[SIZE]; CS520S99 Memory

Loop Interchange • In C arrays arranged in column major order. • In Fortran array arranged in row major order. • Before: striding through in row order, a lot of misses.Block contains x[0][1] may got kicked out before being access when j=1, due to conflict misses. for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j]=2*x[i][j]; • After: access data in consecutive area. for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j]=2*x[i][j]; CS520S99 Memory

Loop Fusion • Before for (i=0; i<N; i++) for (j=0; j<N; j++) a[i][j]=1/b[i][j]*c[i][j]; for (i=0; i<N; i++) for (j=0; j<N; j++) d[i][j]=a[i][j]+c[i][j]; • After • for (i=0; i<N; i++) • for (j=0; j<N; j++) { • a[i][j]=1/b[i][j]*c[i][j]; • d[i][j]=a[i][j]+c[i][j]; • } • 2nd accesses of a and c will be hits. CS520S99 Memory

Blocking (Improve Temporal Locality) • Some algorithms access data by both row and column. The column/row major order does not help. • Idea: modify the algorithms to access by blocks (submatrices). • Before (Matrix multiplication) for (i=0; i<N; I++) for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++) r=r+y[i][k]*z[k][j]; x[i][j]=r; } • After: B=submatrix block size • for (jj=0; jj<N; jj=jj+B) • for (kk=0; kk<N; kk=kk+B) • for (i=0; i<N; I++) • for (j=jj; j<min(jj+B-1,N); j++) { • r=0; • for (k=kk; min(jj+B-1,N); k++) • r=r+y[I][k]*z[k][j]; • x[I][j]=r; • } CS520S99 Memory

Gray: old accesses Black: new accesses White: not accesses Access Pattern Without blocking (acces of matrices with i=1) With blocking CS520S99 Memory

What impact the memory system design? Principle of Locality Temporal Locality (90% time spent in 10% code)