Computer Architecture Principles Dr. Mike Frank

Computer Architecture PrinciplesDr. Mike Frank CDA 5155 (UF) / CA 714-R (NTU)Summer 2003 Module #33 Main Memory & Virtual Memory

Main Memory &Virtual Memory

Overview • 5.11 Protection & virtual memory examples • 5.12 Design of Memory Hierarchies • 5.13 Alpha 21264 Memory Hierarchy • 5.14 Sony PS2 Emotion Engine • 5.15 Sun Fire 6800 Server • 5.16 Fallacies & Pitfalls • 5.17 Concluding Remarks • 5.18 Historical Perspective • Today: Cont. H&P ch. 5 - Memory Hierarchy • 5.1 Introduction • 5.2 Basics of caches • 5.3 Cache performance • 5.4 Reducing cache miss penalty • 5.5 Reducing cache miss rate • 5.6 Using parallelism • 5.7 Reducing hit time • 5.8 Main memory & organizations for improving performance • 5.9 Memory technology • 5.10 Virtual memory

5.8 Main Memory • Some definitions: • Bandwidth (bw): Bytes read or written per unit time • Latency: Described by • Access Time: Delay between access initiation & completion • For reads: Present address till result ready. • Cycle time: Minimum interval between separate requests to memory. • Address lines: Separate bus CPU=>Mem to carry addresses. (Not usu. counted in BW figures.) • RAS (Row Access Strobe) • First half of address, sent first. • CAS (Column Access Strobe) • Second half of address, sent second.

RAS vs. CAS DRAM bit-cell array 1. RAS selects a row 2. Parallelreadout ofall row data 3. CAS selectsa column to read 4. Selected bitwritten to memory bus

Types of Memory • DRAM (Dynamic Random Access Memory) • Cell design needs only 1 transistor per bit stored. • Cell charges leak away and may dynamically (over time) drift from their initial levels. • Requires periodic refreshing to correct drift • e.g. every 8 ms • Time spent refreshing kept to <5% of BW • SRAM (Static Random Access Memory) • Cell voltages are statically (unchangingly) tied to power supply references. No drift, no refresh. • But needs 4-6 transistors per bit. • DRAM 4-8x larger, 8-16x slower, 8-16x cheaper/bit

Typical DRAM Organization (64 Mbit) Low 14 bits High14 bits

Amdahl/Case Rule • Memory size (and I/O b/w) should grow linearly with CPU speed (throughput) • Typical: 1 MB main memory, 1 Mbps I/O b/w per 1 MIPS CPU performance. • Takes a fairly constant ~8 seconds to scan entire memory (if memory bandwidth = I/O bandwidth, 4 bytes/load, 1 load/4 instructions, and if latency not a problem) • Moore’s Law: • DRAM size doubles every 18 months (up 60%/yr) • Roughly tracks processor speed improvements • Unfortunately, DRAM latency has only decreased 7%/yr! Latency a big deal.

Some DRAM trend data Since 1998, the rate of increase in chip capacity has slowed to 2x per 2 years: * 128 Mb in 1998 * 256 Mb in 2000 * 512 Mb in 2002

Latency Improvement Techniques • To reduce capacitive delay:Break memory array of n bits into m sub-blocks, each with n/m bits. • Reduces C load (& R) on each bit line/word line • From Θ(n1/2) to Θ((n/m)1/2) – reduction by Θ(m1/2) • Adds extra indexing level: block row/column lines! • These also have resistance & capacitance Θ(m1/2) • Minimum total RC delay attained when m=Θ(n1/2). • R=Θ(n1/4), C=Θ(n1/4), RC=Θ(n1/2)

Latency Improv. Techs., cont. • Really, whenever you see an RC delay that far exceeds speed-of-light delay, • You know you’re doing something wrong! • Communication just isn’t that hard! • An alternative technique : • Use fixed size blocks w. small RC • Periodically rebuffer along the blockaccess lines (wire pipelining) • Tot. propagation delay scales w. wirelength, thus as Θ(n1/2) • Additional advantage (vs. prev. slide): • Wire pipelining reduces cycle time of memory dramatically! • Can equal time to access small blocks ≈ CPU cycle time

Latency Improv. Techs., cont. • Once you are close to the lightspeed latency limit, there’s only one more thing you can do… • Move the memory closer to the processor! • Put it on same chip as a processor (PIM approach) • Also greatly improves memory↔CPU bandwidth • Make bits smaller – distance between bits decreases • Stack memory arrays in multiple layers: • Taken to extreme, reduces latency from Θ(n1/2) to Θ(n1/3) • This is a latency reduction by a factor of Θ(n1/6), • Order 10 speedup for 1Mb, order 100 speedup for 1Tb! • But, if memory access is heavily pipelined, • 3D arrangement may cause an overheating problem! • Can fix this using reversible memories (cf. my research)

Bandwidth Improvement Techs. • RAS/CAS strobing for fast readout of row data • Fast page mode: 1 RAS followed by multiple CASs • Increase number of parallel paths (data lines) between CPU and memory • Facilitated by on-chip memory, • And/or row-parallel readout. • Limited by CPU and memory widths, areas, • and by pin count of CPU and memory not on same chip • Wire pipelining • For wires between CPU & memory, • & wires across mem. array! • Bit-device technology improvements • Improve max. transistor speed • Reduces cycle time in wire-pipelined designs

Bandwidth Improvement (1 wordwide) (1 wordwide) (4 words wide) (c) Reduces cycle time for mem. access (like having multiple functional units,instead of a single pipelined unit)

Interleaved Memory • Adjacent words found in different mem. banks • Banks can be accessed in parallel • Overlap latencies for accessing each word • Can use narrow bus • to return accessed words sequentially • Fits well with sequential access patterns • E.g., of words in cache blocks

ROM, NVRAM, Flash You should also be aware of the following technologies, used for embedded systems, PC BIOS memory, etc.: • ROM (Read-Only Memory) • Nonvolatile RAMs, such as Flash • NVRAMs require no power to maintain state • Reading flash is near DRAM speeds • Writing is 10-100x slower than w. DRAM • Frequently used for upgradeable embedded SW

DRAM variations • SDRAM – Synchronous DRAM • DRAM internal operation synchronized by a clock signal provided on the memory bus • RDRAM – RAMBUS (Inc.) DRAM • Proprietary DRAM interface technology • on-chip interleaving / multi-bank technology • a high-speed packet-switched (split-transaction) bus IF • Byte-wide interface, synchronous, dual-rate • Licensed to many chip & CPU makers • Higher BW, more $ than generic SDRAM • DRDRAM – “Direct” RDRAM (2nd ed. spec.) • Separate row and column address/command buses • Higher BW (18-bit data, more banks, faster clk)

Inline Memory Modules (IMMs) • Boards w. multiple memory chips (4-16) • Typically organized at least 8 bytes (64b) wide • Some types: • SIMM – Single IMM • DIMM – Dual IMM • RIMM – RAMBUS (Inc.) IMM • Like DIMMs, but w. RDRAM, proprietary/incompatible

Overview • 5.11 Protection & virtual memory examples • 5.12 Design of Memory Hierarchies • 5.13 Alpha 21264 Memory Hierarchy • 5.14 Sony PS2 Emotion Engine • 5.15 Sun Fire 6800 Server • 5.16 Fallacies & Pitfalls • 5.17 Concluding Remarks • 5.18 Historical Perspective • Today: Cont. H&P ch. 5 - Memory Hierarchy • 5.1 Introduction • 5.2 Basics of caches • 5.3 Cache performance • 5.4 Reducing cache miss penalty • 5.5 Reducing cache miss rate • 5.6 Using parallelism • 5.7 Reducing hit time • 5.8 Main memory & organizations for improving performance • 5.9 Memory technology • 5.10 Virtual memory

Virtual Memory (5.10) Basic idea: •Use physical memory as a kind of cache for pages of “virtual memory” that may also reside on disk. •Usually combined with protection schemes, that separate the address spaces of different user processes and the OS.

Addressing Virtual Memories

Paging vs. Segmentation • Pages: Fixed-size blocks, usu. 4KB-64KB • Simple to implement! • Segments: Variable-size blocks, up to 216 to 32B. • Addressed by 2 words: segment number + offset • Can be combined: Paged segments.

The Four Questions, for VM A common set of answers: • Placement: • Anywhere in main memory (fully associative) • Identification: • Page table, or tree-structured variant. • Replacement: • Usually LRU, via a “used-lately?” bit • Write strategy: • Always use write-back! (Disks are slooooooow!) • Dirty bit usually used (for writable pages)

Page Indexing Issues • Page tables can be large! • Consider a 40b (1 TB) address space, w. 4KB pages → 256M pages! • 1 4B word per page table entry → 1GB sized page table! • Usually, a map for all virtual pages is not stored • Only a map for pages that are currently in-use • Even then, page table often too big to keep all in cache • Also, the in-use pages might not be contiguous • E.g., text (code) in low memory, stack in high memory • Page table usually must use some kind of sparse data structure • For fast searching, sometimes a tree structure is used • Hash tables also possible

Fast Address Translation • To cache results of page table lookups, sometimes a special, separate cache is used. • Prevents address resolution from interfering with other caching, increases total cache BW. • Similar to arguments for separating instruction cache. • This is called a translation lookaside buffer (TLB), or just translation buffer (TB). • Caches contents of entries in the page table. • When a page is moved, its page table entry changes, and the corresponding entry in the TLB (if any) must be kept consistent with the change.

TLB Example: Alpha 21264

An Entire Cache System Level 1 Cache Translation Lookaside Buffer Note: Virtually-addressed L1 cache, but physically- addressed L2 cache. Level 2 Cache

Multi-level Virtual Addressing From the DEC (now Compaq) Alpha 21264

Computer Architecture Principles Dr. Mike Frank