1 / 16

Lecture 15 Main Memory

Lecture 15 Main Memory. Main Memory Background. Performance of Main Memory: Latency: Cache Miss Penalty Access Time(AT) : time between request and word arrives Cycle Time(CT) : time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory, a 2D matrix, is DRAM :

galvin
Télécharger la présentation

Lecture 15 Main Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 15Main Memory CS510 Computer Architectures

  2. Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time(AT): time between request and word arrives • Cycle Time(CT): time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory, a 2D matrix, is DRAM: • Dynamic since needs to be refreshed periodically (8 ms) • Difference in AT and CT, AT<CT • Addresses divided into 2 halves, multiplexing them to memory: • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: • No refresh (6 transistors/bit vs. 1 transistor/bit) • No difference in AT and CT, AT=CT • Address not divided CS510 Computer Architectures

  3. Main Memory Background • Size: DRAM/SRAM » 4~8 • Costand Cycle time: SRAM/DRAM » 8~16 • Capacity of DRAM :4 times/3 years or 60%/year • RAS access time :7% per year CS510 Computer Architectures

  4. 1-word-wide memory Interleaved Memory Wide Memory CPU CPU CPU MUX Cache Cache Cache BUS BUS BUS M M M M M M Bank bank bank bank 0 1 2 3 Main Memory Organization • Simple: • CPU, Cache, Bus, Memory are same width (32 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) • Interleaved: • CPU, Cache, Bus 1wd: Memory N Modules (4 Modules); shows word interleave CS510 Computer Architectures

  5. Address Bank 0 Bank1 Bank 2 Bank 3 0 4 8 12 3 7 11 15 1 5 9 13 2 6 10 14 Main Memory Performance Timing model • 1 to send address, • 6 access time, • 1 to send data • Block access time • Assuming Cache Block is 4 words Simple M.P. =4x (1+6+1) = 32 Wide M.P. =1+6+ 1= 8 Interleaved M.P.= 1+6 + (4x1)= 11 CS510 Computer Architectures

  6. Technique for Higher BW:1. Wider Main Memory • Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory • Drawbacks • expandability • doubling the width needs doubling the capacity • bus width • need a multiplexer to get the desired word from a block • error correction - separate error correction every 32 bits • otherwise, on WRITE, read block -> modify word -> calculate the new ECC -> store CS510 Computer Architectures

  7. block size(word) 1 2 4 • miss rate(%) 3 2 1 Technique for Higher BW:2. Interleaved Memory Interleaved Memory and Wide Memory • Consider the following description of a machine and its cache performance • mem bus width = 1 word=32 bit • memory accesses/ instr = 1.2 • cache miss penalty = 8(1+6+1) cycles • average CPI(ignoring cache misses) = 2 • What is the improvement over the base machine(block size=1) in performance of interleaving 2-way and 4-way versus doubling the width of memory and the bus CS510 Computer Architectures

  8. Interleaved Memory Answer • CPI + (M ref/instr. x miss rate x miss penalty) • 2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty) • the CPI for the base machine(Simple Memory)(BM) • 2+(1.2 x 0.03 x 8) = 2.288 • 2-word wide memory • 32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM • 32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM • 64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM • 4-word wide memory • 32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384slower than BM • 32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word • 64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word CS510 Computer Architectures

  9. Superbank Bank Superbank Offset Bank Number Bank Offset Superbank Number Technique for Higher BW:3. Independent Memory Banks • Interleaved Memory-Faster Sequential Accesses; • Independent Memory Banks - Faster Independent Accesses • Motivation: Higher BW for sequential accesses by interleaving sequential bank addresses - each bank shares the address line • Memory banks for independent accesses - each bank has a bank controller, separate address lines • 1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc. • If 1 controller controls all the banks, it can only provide fast access time for one operation • Benefit of memory banks for Miss under Miss in Non-faulting caches Superbank: all memory banks active on one block transfer Bank: portion within a superbank that is word interleaved CS510 Computer Architectures

  10. Independent Memory Banks • How many banks? • For sequential accesses, a new bank delivers a word on each clock • For sequential accesses, number of banks ³ number of clocks to access a word in a bank • Otherwise will return to the original bank before it has the next word ready • Increasing capacity of a DRAM chip => fewer chips to build the same capacity memory system => harder to have banks CS510 Computer Architectures

  11. Bank0Bank1Bank127 ,…, Bank511 0,0 0,1 ,..., 0,127 ,..., 0,511 1,0 1,1 ,..., 1,127 ,..., 1,511 ... 127,0 127,1 ,..., 127,127 ,..., 127,511 ... 128,0 128,1 ,..., 128,127 ,..., 128,511 ... 255,0 255,1 ,..., 255,127 ,..., 255,511 int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Column processing Technique for Higher BW:4. Avoiding Bank Conflicts Even a lot of banks, still bank conflict in certain regular accesses - e.g. Storing 256x512 array in 128 banks and column processing (512 is an even multiple of 128) Inner Loop is a column processing which causes bank conflicts Column elements are in the same bank CS510 Computer Architectures

  12. Avoiding Bank Conflicts • SW approaches • Loop interchange to avoid accessing the same bank • Declaring array size not power of 2(number of banks is a power of 2) so that addresses point to the different banks, i.e., a column elements are spread around different banks • HW: Prime number of banks • bank number = (address) MOD (number of banks) • address within bank = address / number of banks • To avoid calculation of divide per memory access address within bank = (address) MOD (number words in bank ) 3=(31)MOD(7) • bank number? words per bank? • Easy if both are power of 2 CS510 Computer Architectures

  13. bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ... Fast Bank Number Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules • and that ai and aj are co-prime if i ¹ j, then the integer x has only one solution (unambiguous mapping): • bank number = b0=(x) Mod (a0); • number of banks = a0 (= 3 in ex), 0 < b0 < a0 • address within a bank = b1=(x) Mod (a1); • size of a bank = a1 (= 8 in ex) • N words’ addresses 0 to N-1; • prime no. of banks(3); • words/bank power of 2(8) CS510 Computer Architectures

  14. Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2 Addr in Bank: 0 0 1 2 0 16 8 13 4 5 9 1 172 6 7 8 18 10 239 10 11 3 19 11412 13 14 12 4 20515 16 17 21 13 5618 19 20 6 22 14721 22 23 15 7 23 Bank # = (5) Mod (3) = 2: 5/3 = 1 (5) Mod (8) = 5 Fast Bank Numbers Address = 5 CS510 Computer Architectures

  15. Technique for Higher BW:5. DRAM Specific Interleaving • DRAM access - Row Access(RAS) and Column Access(CAS) • Multiple accesses to a RAS buffer: several names (page mode) • 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns • New DRAMs to address CPU-DRAM speed gap; what will they cost, will they survive? • Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock • RAMBUS: startup company; reinvent DRAM interface • Each Chip acts as a module vs. slice of memory(or bank) • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Niche memory only? or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output CS510 Computer Architectures

  16. Main Memory Summary • Wider Memory: for independent access • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM CS510 Computer Architectures

More Related