Cache Organization and Performance Evaluation

Cache Organization and Performance Evaluation Vittorio Zaccaria

Exercise 1 • How many total bits are required for a direct mapped instruction cache with 64 KB of data and one-word blocks, assuming 32-bit address? 1 Word=4 bytes Block no. = 64KB/4Bytes=2^14 blocks Tag bits=32-14[index]-2[offset]=16 Size=[16(tag)+1(validbit)+4(blocksize)*8]*2^14=802816

Exercise 2: DM cache 64blocks x 32 bytes • Assuming byte addressing and 32-bit addresses, how many bits are there in each of the tag, Index, and Offset fields of the address? • How many total bytes of data can be stored in the cache? • How many bytes of memory does the cache use (including tags, valid bits, and data)? • How many possible blocks reference to the same cache block? • If the cache is loaded with random blocks, what is the probability of, given an address, having a match in the tag field? Index=6 bits, offset=5 bits; tag=21bits 2KB (21+1[valid])*64/8+32*64=2224Bytes 2^21 1/(2^21)

Exercise 3 • Assume a cache with: • Cache size = 128 bytes total. • 2-word blocks. • 2-way set associative. • How may blocks has the cache? • How many bits is the index? • How many bits is the tag? [128/(8*2)]=8 [2=log(8blocks/2 sets)] [32-2-3(offset)=27]

Cache Performance • CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty in cycles) x Clock cycle time • Misses per instruction = #Memory accesses per instruction x Miss rate • CPI = CPIexecution + Misses per instruction x Miss penalty cycles • AMAT= HitTime+MissRate*MissPenalty (can be expressed in cycles or in secs).

Why misses? • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur.

3Cs Absolute Miss Rate (SPEC92)

Exercise 4 • Consider a VAX-11/780 • MP=6 cycles • CPI exec = 8.5 • MR=0.11 • #mem_acc/instruction=3 • Compute arch. CPI with cache CPIrealCache= CPIexec+#memacc/instr*MR*MP= 8.5 + 3* 0.11 *6 = 10.48

Exercise 5 • Compare the previous architecture in the 100% miss rate case with the same in the 100% hit rate case. Compare the speedup of the real cache with the ideal one. • 100%miss • 100%Hit: CPInoCache=8.5 + 3*6 = 26.5 CPIidealCache=CPIexec=8.5 Speedup(idealCache, realCache)=10.48/8.5=1.23

Excercise 6 • Compute the CPI of an architecture cache with: • CPIideal=1.5 • MP=10 • MR=0.11 • #mem_acc/instr=1.4 CPIrealCache= 1.5+1.4*0.11*10=3.04

Exercise (6 cont.) • Compare the case of 100% hit rate with the case of 100% miss rate. • Speedup real-ideal cache: CPInoCache= 1.5+1.4*10=15.5 CPIidealCache= 1.5 Speedup=3.04 / 1.5 = 2

Exercise 7 • Consider two architectures: A and B • Tclk(A)=20ns, 8.5% faster than Tclk(B) • Both A and B have #mem_acc/instr=1.3 • MP(A)=MP(B)=200 ns • MR(A)=3.9%, MR(B)=3.0% • Compute AMAT(A) and AMAT(B) • Compute CPI(A) and CPI(B)

Solution 7 • CPI(A)= • CPI(B) • AMAT(A)= • AMAT(B)= 1.5+1.3*10*3.9%=2.07 1.5+1.3*[3%*round(200ns*/(20ns+8.5%*20ns))] =1.85 20ns+200ns*3.9%=27.8ns 20ns(1+8.5%)+200ns*3.0%=27.7ns

Exercise 8 • Architecture A[I$,D$]: • 1 instr. on 85% of the cycles; other cycles NOP. • Architecture B[I$,D$]: • 2 instr. on 65% of cycles; 1 instr. on 30% of the time; other cycles NOP. • Assume hit time= 1 cycle, miss time = 50 cycles. • I$ hit rate = 100% • D$ hit rate= 98% • L/S instr = 33% of all instr.

Exercise 8 (cont.) • CPI(A) and CPI(B) with a perfect memory system? • AMAT in cycles relative to D$? • CPI(A)=100cycles/85instr=1.17 • CPI(B)=100/(65*2+30)=0.62 1+0.02*49=1.98 cycles

Exercise 8 (cont.) • CPI(A) and CPI(B) with actual cache? Speedup(B,A)=1.58; • CPI(A)=1.17+0.33*0.02*49=1.49 • CPI(B)=0.62+0.33*0.02*49=0.94

Exercise 9 • 300 MHz CPU, 50 MHz bus speed • DCache has 2 64-bit words per block • Buses: • 2 bytes wide • burst transfer mode: • each block read is: 4-1-1-1-1-1-1-1 (bus clocks) • Hit time= 1 cycle • 6% miss rate. • Ideal ICache

Exercise 9 (cont.) • Consider only read data accesses. What is the effective AMAT in ns? • How would you speedup? • Doubling bus width? • Doubling bus speed? • Compute first AMAT and then speedup (1+0.06*((4+7)*300/50))CPU clocks =4.96CPU clocks, 16.5 ns

Exercise 9 • Doubling bus width? First datum in 4 bus clocks, then 1-1-1 AMAT= (1+0.06*(4+3)*6)CPUclocks =3.52CPUclocks =11.7 ns

Exercise 9 (cont.) • Doubling bus speed? 1 bus clock= 3 cpu cycles AMAT= (1+0.06*(4+7)*3)CPUclocks =2.98CPUclocks =9 ns Speedup(2Xfreq,2Xwidth)=1.18

Cache Organization and Performance Evaluation