第五章

第五章 Memory system

主要記憶體 處理器 k-位元位址匯流排 MAR n-位元資料匯流排高達 2k 個可定址的位置 MDR 字組長度 = n 位元控制線 R / W ( , MFC, 等等) 圖5.1 記憶體到處理器的連線

Basic Computer Organization Revisited I/O Memory Data Processor Program General- Purpose Registers ALUs MAR MDR PC Control Logic

Access time vs cycle time • Memory access time • A measurement of single access • Memory cycle time • A measurement of how quickly two back-to-back accesses of a memory chip can be made • Cycle time > access time due to latency between successive memory accesses • DRAM (For construct Main memory) • access time - 50 to 150 nanoseconds • require a pause (refresh) between back-to-back accesses • SRAM (For construct Cache memory) • access time - 10 nanoseconds • no pause between back-to-back accesses

word line Bit line b b ¢ b b ¢ b b ¢ 7 7 1 1 0 0 • • • W 0 FF FF A • • • 0 W 1 A 1 位址記憶體 • • • • • • • • • • • • • • • • • • 解碼器基本單元 A 2 A 3 • • • W 15 須4+2+8=14條外部連線 R / W Sense / Write Sense / Write Sense / Write 電路電路電路 CS 資料輸入/輸出線 b b b 7 1 0 圖5.2 一個記憶體晶片中位元基本位元的組織

RAS: Row address strobe 5-位元列位址 W 0 W 1 32 * 32 5-位元記憶體基本單元解碼器陣列 W 31 Sense / Write 電路 10-位元位址 32-to-1 R / W 輸出多工器與 CS 輸入解多工器 5-位元行位址 CAS: Column address strobe 資料輸入/輸出圖5.3 1K*1 記憶體晶片的組織

SRAM • SRAM • Static Random Access Memory • Read/write very fast • Needs 6 transistors thus high cost and needs more area • Do not need to refresh • Low power consumption • Implementation technology • CMOS • Construct cache memory

b b ┘ V supply T T 3 4 T T 1 2 X Y T T 5 6 字組線位元線圖5.5 互補金屬氧化物半導體（CMOS）記憶體基本位元的範例

DRAM • DRAM • Dynamic Random Access Memory • Needs 1 transistor and 1 capacitor • Lower cost and compact • Each bit must be refreshed periodically • Implementation technology • CMOS • Construct Main Memory

位元線 字組線 T C 圖5.6 單一電晶體的動態記憶體（RAM）基本位元

Asynchronous DRAM R A S 4096 * (512 * 8) 列位址閂列解碼器元素陣列 A ¤ A CS Sense / Write 20 - 9 8 - 0 電路 R / W 行解碼器行位址閂 C A S D D 7 0 圖5.7 2M*8 動態記憶體晶片的內部組織

Fast Page Mode • conventional DRAM requires that a row and column be sent for each access • FPM works by sending the row address just once for many accesses to memory in locations near each other, improving access time. That is Row address is decoded once with varied Column address decoded to access different bytes on the same row.(見page 5-54範例5.1)

Extended Data Out (EDO) DRAM • EDO DRAM • also called hyper page mode DRAM • EDO memory has had its timing circuits modified so one access to the memory can begin before the last one has finished (note: conventional DRAM needs some delay between two consecutive accesses)

Synchronous DRAM(SDRAM) 更新計數器列位址閂基本位元陣列列解碼器列行位址 Read/Write 行解碼器行位址閂電路與閂 Clock R A S 模式暫存器與時序控制資料輸入暫存器資料輸出暫存器 C A S R / W C S 資料圖5.8 同步動態隨機存取記憶體

SDRAM • Support burst operation • Auto Column Address increment, that is do not need external CAS cycle time to select column address • Interleaving memory • contains two banks of memory internally instead of one • This allows the second bank to be "precharging" (RAS and CAS activation) while the first bank is transferring data • Will replace older DRAM technologies

DDR SDRAM • Double data rate SDRAM • Access data both as rising and falling edge of clock • Thus doubles the bandwidth of the memory by transfering data twice per clock • Standard SDRAM takes action only at rising edge of clock • DDR II • running at 1/2 clock frequency of the I/O buffers • DDR : 100MHz driven clock -> 100MHz data buffers -> DDR applied -> 200MHz final data frequencyDDR-II: 100MHz driven clock -> 200MHz data buffers -> DDR applied -> 400MHz final data frequency

SIMM vs DIMM • SIMM • Single In-line Memory Modules • 30 pins (8 bit bus version) • 72 pins (wider bus, more address lines) • DIMM • Dual In-line Memory Modules • 168 pins

RAMBUS • RAMBUS Company • Make a single chip act more like a memory system than a memory componet • Each chip has interleaved memory and high-speed interface • RDRAM (1st generation) • Drop RAS/CAS, replacing it with a bus that allows other accesses over the bus between the sending of the address and return of the data. • Run at 300 MHz clock • DRDRAM (2nd generation) • Direct RDRAM • Separate row- and column-command buses instead of the conventional multiplexing • Run at 400 MHz clock • RIMM • 16 RDRAM

Other memory • ROM • PROM • EPROM • EEPROM • Flash • Low power consumption • Portable system such as PDA, mobile phone, digital camera, MP3

Memory hierarchy 處理器暫存器每位元成本遞增大小遞增速度遞增主要快取 L1 次要快取 L2 主記憶體磁碟次要記憶體圖5.13 記憶體的階層架構

Memory hierarchy • Level 1 • Registers • <1KB • 0.25-0.5 ns • 20,000-100,000 MB/sec • Managed by compiler • Level 2 • Cache • <16MB • 0.5-25 ns • 5000-10000 MB/sec • Managed by hardware • Level 3 • Main memory • <16GB • 80-250 ns • 1000-5000 MB/sec • Managed by OS • Level 4 • Disk storage • >100GB • 5000000 ns • 20-150 MB/sec • Managed by OS/operator

Cache Terms • Locality of reference • Temporal • spatial • Cache block (cache line) • Replacement algorithm • Read/write hit/miss • Write-through • Write-back (copy-back) • Dirty bit/modified bit • Valid bit • The valid bit is set every time a row is loaded into the cache by a cache miss, and can only be reset by the flush line

Cache mapping functions • Direct mapping(直接映射) • Fully associative mapping(完全關聯映射) • Set associative mapping (集合關聯映射) • N-way associative mapping

Direct mapping(直接映射) 主記憶體 Block 0 Block 1 Block 127 快取 tag Block 0 Block 128 tag Block 1 Block 129 tag Block 127 Block 255 Block 256 Block 257 Block 4095 標籤區塊字組 5 7 4 主記憶體位址圖5.15 直接映射的快取

Fully associative mapping(完全關聯映射) 主記憶體 Block 0 Block 1 快取 tag Block 0 tag Block 1 Block i tag Block 127 Block 4095 標籤字組主記憶體位址 12 4 圖5.16 關聯式映射的快取

Set associative mapping (集合關聯映射) 主記憶體 Block 0 Block 1 快取 tag Block 0 Set 0 Block 63 tag Block 1 Block 64 tag Block 2 Set 1 Block 65 tag Block 3 Block 127 tag Block 126 Set 63 Block 128 tag Block 127 Block 129 Block 4095 T 標籤集合字組 6 6 4 主記憶體位址圖5.17 每個集合有2個區塊的集合關聯式映射快取

Replacement algorithm • LRU • Least recently used • 最近最少使用到 • Random • 隨機 • First in First out (FIFO) • 最舊

68040 cache • 4K Data cache • 4K Instruction cache • Contains 64 set • Every set contains 4 blocks • 4-way associative mapping • 1 cache block contains 4 long words • 1 valid bit for cache block • 1 dirty bit for long word • Write-back/write-through • Random replacement

22 個位元 6 個位元 4 個位元位址 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 000BF2 0 0 8 位元組集合 0CA020 v d No 區塊 0 =? d 標籤 v d M i s s = 1 區塊 1 集合 d H i t = 0 標籤 v d 0 區塊 2 d 000BF2 v d =? 區塊 3 Yes d M i s s = 0 H i t = 1 標籤 v d 區塊 0 d 標籤 v d 區塊 1 d 集合標籤 v d 63 區塊 2 d 標籤 v d 區塊 3 d 圖 5.23 在 68040 微處理器中的資料快取組織

ARM710T cache • Only one cache for both data and instructions • 4 KB cache • 64 sets • 1 set contains 4 blocks • 4-way associative mapping • 1 cache block contains 4 words(32bits)=16bytes • Write-through • Random replacement

Pentium III cache • L1 cache • 16KB data cache • 4-way • Write-back or write-through • 16KB instruction cache • 2-way • No write strategy due to pure code • L2 cache • 512KB • 4-way • Write-back or write-through • Coppermine • L2 cache built in CPU • 256KB • 8-way

Pentium 4 cache • L1 cache • 8KB data cache • 4-way • block contains 64 bytes • Write-through • L2 cache • within CPU • 256KB • 8-way • Block contains 128 bytes • Write-back • L3 cache • Server-based CPU

處理單元 L1 資料快取 L1 指令快取匯流排介面單元系統匯流排快取匯流排 L2 快取輸入/輸出主記憶體圖 5.24 在 Pentium III 處理器中的快取與外部連線

Memory Performance • Every memory module has address buffer register (ABR) and data buffer register (DBR) • Single module continuous words • Continuous module continuous words • Interleaved memory • CPU reference to continuous memory accesses multiple module concurrently (lower bits select modules)

Caculate miss penalty • See p5-54 ~p5-59 examples • Tave= hC + (1-h)M , where h: hit rate, M: miss penalty, C: access time for cache • Tave=h1C1+(1-h1)h2C2+(1-h1)(1-h2)M, • h1 hit rate for L1 cache • C1 access time for L1 cache • h2 hit rate for L2 cache • C2 access time for L2 cache • M access time for main memory • Note: if h1=h2=0.9 then miss penalty=(1-9)(1-.9)=1% • This means if we use two level cache with 0.9 hit rate then the penalty for main memory will less than 1% memory access

Other methods to reduce miss penalty • Write buffer (improvement for write-through) • Built in CPU • Write to write buffer rather than to memory, thus CPU doesn’t need to wait memory write • Prefetch • Compiler inserts prefetch instructions (via analyzing codes) • Lockup-free • Allowing the data cache to continue to supply cache hits during a miss • Helpful for processor that supports out-of-order completion (eg. Via Tomasulo’s Algorithm)

Virtual memory • Virtual address (logical address) • MMU (built in CPU) • Physical address • Page table (in Main Memory) • Page frame • Address translation • TLB • Cache built within CPU for holding translated address just used • Page fault • Replacement algorithm • LRU

處理器 虛擬位址資料 MMU 實際位址快取資料實際位址主記憶體 DMA 傳送磁碟儲存體圖 5.26 虛擬記憶體組織

指向分頁表的起始位址 來自處理器的虛擬位址分頁表基底暫存器分頁表位址虛擬分頁編號位移 + 分頁表指向分頁表中某個entry 指向實體分頁表的起始位址控制位元記憶體中的分頁訊框分頁訊框位移 Valid bit Dirty bit Access right of the program to the page 主記憶體中的實際位址指向實體分頁表中的某個byte 圖5.27 虛擬記憶體位址轉譯

Intel IA-32 Processor’s Memory management

Intel IA-32 Page Translation The entries in the page directory point to page tables, and the entries in a page table point to pages in physical memory. This paging method can be used to address up to 220 pages, which spans a linear address space of 232 bytes (4 GBytes).

To select the various table entries, the linear address is divided into three sections: • Page-directory entry—Bits 22 through 31 provide an offset to an entry in the page directory. The selected entry provides the base physical address of a page table. • Page-table entry—Bits 12 through 21 of the linear address provide an offset to an entry in the selected page table. This entry provides the base physical address of a page in physical memory. • Page offset—Bits 0 through 11 provides an offset to a physical address in the page.

來自處理器的虛擬位址 虛擬分頁編號位移 TLB 虛擬分頁編號控制位元記憶體中的分頁訊框 No =? Yes Miss Hit 分頁訊框位移儲存CPU剛才用過的實體與虛擬位址對應表主記憶體中的實際位址圖 5.28 關聯式映射 TLB 的使用

第 0 區, 第 1軌 第 3 區, 第 n 軌第 0 區, 第 0 軌圖 5.30 硬碟的表面組織

Disk Access Time • Seek time • Rotation time (latency time) • Transfer time

處理器 主記憶體系統匯流排磁碟控制器磁碟機磁碟機圖5.31 連接到系統匯流排的磁碟機

RAID • Redundant Array of Inexpensive Disk • RAID0 : data stripping, no redundancy,Level 0 stripes data at block level • RAID1 : mirroring (shadowing) • RAID01(RAID0+1): mirrored stripes • RAID2 :Error-Correcting Coding with hamming code • Not a typical implementation and rarely used, Level 2 stripes data at the bit level rather than the block level. • RAID3:Bit-Interleaved Parity • Provides byte-level striping with a dedicated parity disk. Level 3, which cannot service simultaneous multiple requests, also is rarely used. • RAID4:Dedicated Parity Drive. • A commonly used implementation of RAID, Level 4 provides block-level striping (like Level 0) with a parity disk. If a data disk fails, the parity data is used to create a replacement disk. A disadvantage to Level 4 is that the parity disk can create write bottlenecks. • RAID5:Block Interleaved Distributed Parity, • Provides data striping at the byte level and also stripe error correction information. This results in excellent performance and good fault tolerance. Level 5 is one of the most popular implementations of RAID.

Compact Disc (CD) • CD-ROM 1X : 150KB/sec • CD-ROM 40X:150 x 40 = 6MB/sec • DVD (Digital Versatile Disk) • DVD + R • is a non-rewritable format and it is compatible with about 89%of all DVD Players and most DVD-ROMs • DVD+R/W • has some "better" features than DVD-R/W such as lossless linking and both CAV and CLV writing. • DVD – R • is a non-rewriteable format and it is compatible with about 93% of all DVD Players and most DVD-ROMs. • DVD-R/W • was the first DVD recording format released that was compatible with standalone DVD Players.

第五章

第五章

Presentation Transcript