Lecture 15 Main Memory

Lecture 15Main Memory CS510 Computer Architectures

Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time(AT): time between request and word arrives • Cycle Time(CT): time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory, a 2D matrix, is DRAM: • Dynamic since needs to be refreshed periodically (8 ms) • Difference in AT and CT, AT<CT • Addresses divided into 2 halves, multiplexing them to memory: • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: • No refresh (6 transistors/bit vs. 1 transistor/bit) • No difference in AT and CT, AT=CT • Address not divided CS510 Computer Architectures

Main Memory Background • Size: DRAM/SRAM » 4~8 • Costand Cycle time: SRAM/DRAM » 8~16 • Capacity of DRAM :4 times/3 years or 60%/year • RAS access time :7% per year CS510 Computer Architectures

1-word-wide memory Interleaved Memory Wide Memory CPU CPU CPU MUX Cache Cache Cache BUS BUS BUS M M M M M M Bank bank bank bank 0 1 2 3 Main Memory Organization • Simple: • CPU, Cache, Bus, Memory are same width (32 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) • Interleaved: • CPU, Cache, Bus 1wd: Memory N Modules (4 Modules); shows word interleave CS510 Computer Architectures

Address Bank 0 Bank1 Bank 2 Bank 3 0 4 8 12 3 7 11 15 1 5 9 13 2 6 10 14 Main Memory Performance Timing model • 1 to send address, • 6 access time, • 1 to send data • Block access time • Assuming Cache Block is 4 words Simple M.P. =4x (1+6+1) = 32 Wide M.P. =1+6+ 1= 8 Interleaved M.P.= 1+6 + (4x1)= 11 CS510 Computer Architectures

Technique for Higher BW:1. Wider Main Memory • Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory • Drawbacks • expandability • doubling the width needs doubling the capacity • bus width • need a multiplexer to get the desired word from a block • error correction - separate error correction every 32 bits • otherwise, on WRITE, read block -> modify word -> calculate the new ECC -> store CS510 Computer Architectures

block size(word) 1 2 4 • miss rate(%) 3 2 1 Technique for Higher BW:2. Interleaved Memory Interleaved Memory and Wide Memory • Consider the following description of a machine and its cache performance • mem bus width = 1 word=32 bit • memory accesses/ instr = 1.2 • cache miss penalty = 8(1+6+1) cycles • average CPI(ignoring cache misses) = 2 • What is the improvement over the base machine(block size=1) in performance of interleaving 2-way and 4-way versus doubling the width of memory and the bus CS510 Computer Architectures

Interleaved Memory Answer • CPI + (M ref/instr. x miss rate x miss penalty) • 2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty) • the CPI for the base machine(Simple Memory)(BM) • 2+(1.2 x 0.03 x 8) = 2.288 • 2-word wide memory • 32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM • 32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM • 64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM • 4-word wide memory • 32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384slower than BM • 32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word • 64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word CS510 Computer Architectures

Superbank Bank Superbank Offset Bank Number Bank Offset Superbank Number Technique for Higher BW:3. Independent Memory Banks • Interleaved Memory-Faster Sequential Accesses; • Independent Memory Banks - Faster Independent Accesses • Motivation: Higher BW for sequential accesses by interleaving sequential bank addresses - each bank shares the address line • Memory banks for independent accesses - each bank has a bank controller, separate address lines • 1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc. • If 1 controller controls all the banks, it can only provide fast access time for one operation • Benefit of memory banks for Miss under Miss in Non-faulting caches Superbank: all memory banks active on one block transfer Bank: portion within a superbank that is word interleaved CS510 Computer Architectures

Independent Memory Banks • How many banks? • For sequential accesses, a new bank delivers a word on each clock • For sequential accesses, number of banks ³ number of clocks to access a word in a bank • Otherwise will return to the original bank before it has the next word ready • Increasing capacity of a DRAM chip => fewer chips to build the same capacity memory system => harder to have banks CS510 Computer Architectures

Bank0Bank1Bank127 ,…, Bank511 0,0 0,1 ,..., 0,127 ,..., 0,511 1,0 1,1 ,..., 1,127 ,..., 1,511 ... 127,0 127,1 ,..., 127,127 ,..., 127,511 ... 128,0 128,1 ,..., 128,127 ,..., 128,511 ... 255,0 255,1 ,..., 255,127 ,..., 255,511 int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Column processing Technique for Higher BW:4. Avoiding Bank Conflicts Even a lot of banks, still bank conflict in certain regular accesses - e.g. Storing 256x512 array in 128 banks and column processing (512 is an even multiple of 128) Inner Loop is a column processing which causes bank conflicts Column elements are in the same bank CS510 Computer Architectures

Avoiding Bank Conflicts • SW approaches • Loop interchange to avoid accessing the same bank • Declaring array size not power of 2(number of banks is a power of 2) so that addresses point to the different banks, i.e., a column elements are spread around different banks • HW: Prime number of banks • bank number = (address) MOD (number of banks) • address within bank = address / number of banks • To avoid calculation of divide per memory access address within bank = (address) MOD (number words in bank ) 3=(31)MOD(7) • bank number? words per bank? • Easy if both are power of 2 CS510 Computer Architectures

bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ... Fast Bank Number Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules • and that ai and aj are co-prime if i ¹ j, then the integer x has only one solution (unambiguous mapping): • bank number = b0=(x) Mod (a0); • number of banks = a0 (= 3 in ex), 0 < b0 < a0 • address within a bank = b1=(x) Mod (a1); • size of a bank = a1 (= 8 in ex) • N words’ addresses 0 to N-1; • prime no. of banks(3); • words/bank power of 2(8) CS510 Computer Architectures

Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2 Addr in Bank: 0 0 1 2 0 16 8 13 4 5 9 1 172 6 7 8 18 10 239 10 11 3 19 11412 13 14 12 4 20515 16 17 21 13 5618 19 20 6 22 14721 22 23 15 7 23 Bank # = (5) Mod (3) = 2: 5/3 = 1 (5) Mod (8) = 5 Fast Bank Numbers Address = 5 CS510 Computer Architectures

Technique for Higher BW:5. DRAM Specific Interleaving • DRAM access - Row Access(RAS) and Column Access(CAS) • Multiple accesses to a RAS buffer: several names (page mode) • 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns • New DRAMs to address CPU-DRAM speed gap; what will they cost, will they survive? • Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock • RAMBUS: startup company; reinvent DRAM interface • Each Chip acts as a module vs. slice of memory(or bank) • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Niche memory only? or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output CS510 Computer Architectures

Main Memory Summary • Wider Memory: for independent access • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM CS510 Computer Architectures

Lecture 15 Main Memory