Operating System Memory Systems in Computer Science

Operating Systems & Memory Systems: Address Translation Computer Science 220 ECE 252 Professor Alvin R. Lebeck Fall 2006

Outline • Finish Main Memory • Address Translation • basics • 64-bit Address Space • Managing memory • OS Performance Throughout • Review Computer Architecture • Interaction with Architectural Decisions CPS 220

Fast Memory Systems: DRAM specific • Multiple RAS accesses: several names (page mode) • 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns • New DRAMs to address gap; what will they cost, will they survive? • Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock • RAMBUS: reinvent DRAM interface (Intel will use it) • Each Chip a module vs. slice of memory • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Cached DRAM (CDRAM): Keep entire row in SRAM CPS 220

Main Memory Summary • Big DRAM + Small SRAM = Cost Effective • Cray C-90 uses all SRAM (how many sold?) • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM, CDRAM • Niche memory or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output • IRAM: Do you know what it is? CPS 220

Review: Reducing Miss Penalty Summary • Five techniques • Read priority over write on miss • Subblock placement • Early Restart and Critical Word First on miss • Non-blocking Caches (Hit Under Miss) • Second Level Cache • Can be applied recursively to Multilevel Caches • Danger is that time to DRAM will grow with multiple levels in between CPS 220

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache CPS 220

Review: Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0 Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2 Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1 CPS 220

System Organization interrupts Processor Cache Core Chip Set I/O Bus Main Memory Disk Controller Graphics Controller Network Interface Graphics Disk Disk Network CPS 220

Applications Software Operating System Compiler This is IT CPU Memory I/O Hardware Multiprocessor Networks Computer Architecture • Interface Between Hardware and Software CPS 220

Memory Hierarchy 101 Very fast <1ns clock Multiple Instructions per cycle P $ SRAM, Fast, Small Expensive DRAM, Slow, Big,Cheap (called physical or main) Memory Magnetic, Really Slow, Really Big, Really Cheap => Cost Effective Memory System (Price/Performance) CPS 220

Virtual Memory: Motivation Virtual • Process = Address Space + thread(s) of control • Address space = PA • programmer controls movement from disk • protection? • relocation? • Linear Address space • larger than physical address space • 32, 64 bits v.s. 28-bit physical (256MB) • Automatic management Physical CPS 220

Virtual Memory • Process = virtual address space + thread(s) of control • Translation • VA -> PA • What physical address does virtual address A map to • Is VA in physical memory? • Protection (access control) • Do you have permission to access it? CPS 220

Virtual Memory: Questions • How is data found if it is in physical memory? • Where can data be placed in physical memory? Fully Associative, Set Associative, Direct Mapped • What data should be replaced on a miss? (Take Compsci 210 …) CPS 220

Segmented Virtual Memory • Virtual address (232,264) to Physical Address mapping (230) • Variable size, base + offset, contiguous in both VA and PA Virtual Physical 0x1000 0x0000 0x1000 0x6000 0x2000 0x9000 0x11000 CPS 220

Intel Pentium Segmentation Physical Address Space Logical Address Offset Seg Selector Global Descriptor Table (GDT) Segment Descriptor Segment Base Address CPS 220

Pentium Segmention (Continued) • Segment Descriptors • Local and Global • base, limit, access rights • Can define many • Segment Registers • contain segment descriptors (faster than load from mem) • Only 6 • Must load segment register with a valid entry before segment can be accessed • generally managed by compiler, linker, not programmer CPS 220

Offset Virtual page number Virtual Physical 0x1000 0x0000 0x1000 0x6000 0x2000 0x9000 0x11000 Paged Virtual Memory • Virtual address (232,264) to Physical Address mapping (228) • virtual page to physical page frame • Fixed Size units for access control & translation CPS 220

Page Table • Kernel data structure (per process) • Page Table Entry (PTE) • VA -> PA translations (if none page fault) • access rights (Read, Write, Execute, User/Kernel, cached/uncached) • reference, dirty bits • Many designs • Linear, Forward mapped, Inverted, Hashed, Clustered • Design Issues • support for aliasing (multiple VA to single PA) • large virtual address space • time to obtain translation CPS 220

L1 L2 L3 PO 21 seg 0/1 10 10 10 13 base + + + phys page frame number Alpha VM Mapping (Forward Mapped) • “64-bit” address divided into 3 segments • seg0 (bit 63=0) user code/heap • seg1 (bit 63 = 1, 62 = 1) user stack • kseg (bit 63 = 1, 62 = 0) kernel segment for OS • Three level page table, each one page • Alpha 21064 only 43 unique bits of VA • (future min page size up to 64KB => 55 bits of VA) • PTE bits; valid, kernel & user read & write enable (No reference, use, or dirty bit) • What do you do for replacement? CPS 220

Inverted Page Table (HP, IBM) • One PTE per page frame • only one VA per physical frame • Must search for virtual address • More difficult to support aliasing • Force all sharing to use the same VA Virtual page number Offset Inverted Page Table (IPT) Hash VA PA,ST Hash Anchor Table (HAT) CPS 220

Dir Table Offset Intel Pentium Segmentation + Paging Physical Address Space Logical Address Linear Address Space Offset Seg Selector Page Table Global Descriptor Table (GDT) Page Dir Segment Descriptor Segment Base Address CPS 220

The Memory Management Unit (MMU) • Input • virtual address • Output • physical address • access violation (exception, interrupts the processor) • Access Violations • not present • user v.s. kernel • write • read • execute CPS 220

Translation Lookaside Buffers (TLB) • Need to perform address translation on every memory reference • 30% of instructions are memory references • 4-way superscalar processor • at least one memory reference per cycle • Make Common Case Fast, others correct • Throw HW at the problem • Cache PTEs CPS 220

Page Number Page offset phys frame v r w tag 1 2 4 . . . . . . . . . . . . 3 48 48:1 mux Fast Translation: Translation Buffer • Cache of translated addresses • Alpha 21164 TLB: 48 entry fully associative CPS 220

TLB Design • Must be fast, not increase critical path • Must achieve high hit ratio • Generally small highly associative • Mapping change • page removed from physical memory • processor must invalidate the TLB entry • PTE is per process entity • Multiple processes with same virtual addresses • Context Switches? • Flush TLB • Add ASID (PID) • part of processor state, must be set on context switch CPS 220

Hardware Managed TLBs • Hardware Handles TLB miss • Dictates page table organization • Compilicated state machine to “walk page table” • Multiple levels for forward mapped • Linked list for inverted • Exception only if access violation CPU TLB Control Memory CPS 220

Software Managed TLBs • Software Handles TLB miss • Flexible page table organization • Simple Hardware to detect Hit or Miss • Exception if TLB miss or access violation • Should you check for access violation on TLB miss? CPU TLB Control Memory CPS 220

Mapping the Kernel 264-1 User Stack Physical Memory • Digital Unix Kseg • kseg (bit 63 = 1, 62 = 0) • Kernel has direct access to physical memory • One VA->PA mapping for entire Kernel • Lock (pin) TLB entry • or special HW detection Kernel Kernel User Code/ Data 0 CPS 220

Considerations for Address Translation Large virtual address space • Can map more things • files • frame buffers • network interfaces • memory from another workstation • Sparse use of address space • Page Table Design • space • less locality => TLB misses OS structure • microkernel => more TLB misses CPS 220

Address Translation for Large Address Spaces • Forward Mapped Page Table • grows with virtual address space • worst case 100% overhead not likely • TLB miss time: memory reference for each level • Inverted Page Table • grows with physical address space • independent of virtual address space usage • TLB miss time: memory reference to HAT, IPT, list search CPS 220

Virtual page number Offset Hash Hashed Page Table (HPT) VA PA,ST Hashed Page Table (HP) • Combine Hash Table and IPT [Huck96] • can have more entries than physical page frames • Must search for virtual address • Easier to support aliasing than IPT • Space • grows with physical space • TLB miss • one less memory ref than IPT CPS 220

VPBN VPBN VPBN next next next PA0 attrib PA0 attrib PA0 attrib Clustered Page Table (SUN) • Combine benefits of HPT and Linear [Talluri95] • Store one base VPN (TAG) and several PPN values • virtual page block number (VPBN) • block offset VPBN Boff Offset Hash ... VPBN next PA0 attrib PA1 attrib PA2 attrib PA3 attrib ... CPS 220

Reducing TLB Miss Handling Time • Problem • must walk Page Table on TLB miss • usually incur cache misses • big problem for IPC in microkernels • Solution • build a small second-level cache in SW • on TLB miss, first check SW cache • use simple shift and mask index to hash table CPS 220

Block Address TAG Index Block offset Cache Indexing • Tag on each block • No need to check index or block offset • Increasing associativity shrinks index, expands tag Fully Associative: No index Direct-Mapped: Large index CPS 220

Address Translation and Caches • Where is the TLB wrt the cache? • What are the consequences? • Most of today’s systems have more than 1 cache • Digital 21164 has 3 levels • 2 levels on chip (8KB-data,8KB-inst,96KB-unified) • one level off chip (2-4MB) • Does the OS need to worry about this? Definition: page coloring = careful selection of va->pa mapping CPS 220

TLBs and Caches CPU CPU CPU VA VA VA VA Tags $ PA Tags TLB $ TLB VA PA PA L2 $ TLB $ MEM PA PA MEM MEM Overlap $ access with VA translation: requires $ index to remain invariant across translation Conventional Organization Virtually Addressed Cache Translate only on miss Alias (Synonym) Problem CPS 220

Virtual Caches • Send virtual address to cache. Called VirtuallyAddressed Cache or just VirtualCachevs. Physical Cache or Real Cache • Avoid address translation before accessing cache • faster hit time to cache • Context Switches? • Just like the TLB (flush or pid) • Cost is time to flush + “compulsory” misses from empty cache • Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process • I/O must interact with cache CPS 220

I/O and Virtual Caches Virtual Cache interrupts Processor Physical Addresses Cache Memory Bus I/O Bridge • I/O is accomplished • with physical addresses • DMA • flush pages from cache • need pa->va reverse • translation • coherent DMA I/O Bus Main Memory Disk Controller Graphics Controller Network Interface Graphics Disk Disk Network CPS 220

Aliases and Virtual Caches 264-1 User Stack Physical Memory • aliases(sometimes called synonyms); Two different virtual addresses map to same physical address • But, but... the virtual address is used to index the cache • Could have data in two different locations in the cache Kernel Kernel User Code/ Data 0 CPS 220

Page Offset Page Address Address Tag Block Offset Index Index with Physical Portion of Address • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag • Limits cache to page size: what if want bigger caches and use same trick? • Higher associativity • Page coloring CPS 220

Page Offset Page Address Address Tag Block Offset Index Page Coloring for Aliases • HW that guarantees that every cache frame holds unique physical address • OS guarantee: lower n bits of virtual & physical page numbers must have same value; if direct-mapped, then aliases map to same cache frame • one form of page coloring CPS 220

Page Coloring to reduce misses Cache Page frames • Notion of bin • region of cache that may contain cache blocks from a page • Random vs careful mapping • Selection of physical page frame dictates cache index • Overall goal is to minimize cache misses CPS 220

Careful Page Mapping [Kessler92, Bershad94] • Select a page frame such that cache conflict misses are reduced • only choose from available pages (no VM replacement induced) • static • “smart” selection of page frame at page fault time • dynamic • move pages around CPS 220

A Case for Large Pages • Page table size is inversely proportional to the page size • memory saved • Fast cache hit time easy when cache <= page size (VA caches); • bigger page makes it feasible as cache size grows • Transferring larger pages to or from secondary storage, possibly over a network, is more efficient • Number of TLB entries are restricted by clock cycle time, • larger page size maps more memory • reduces TLB misses CPS 220

A Case for Small Pages • Fragmentation • large pages can waste storage • data must be contiguous within page • Quicker process start for small processes(??) CPS 220

Superpages • Hybrid solution: multiple page sizes • 8KB, 16KB, 32KB, 64KB pages • 4KB, 64KB, 256KB, 1MB, 4MB, 16MB pages • Need to identify candidate superpages • Kernel • Frame buffers • Database buffer pools • Application/compiler hints • Detecting superpages • static, at page fault time • dynamically create superpages • Page Table & TLB modifications CPS 220

Page Coloring • Make physical index match virtual index • Behaves like virtual index cache • no conflicts for sequential pages • Possibly many conflicts between processes • address spaces all have same structure (stack, code, heap) • modify to xor PID with address (MIPS used variant of this) • Simple implementation • Pick abitrary page if necessary CPS 220

Bin Hopping • Allocate sequentially mapped pages (time) to sequential bins (space) • Can exploit temporal locality • pages mapped close in time will be accessed close in time • Search from last allocated bin until bin with available page frame • Separate search list per process • Simple implementation CPS 220

Best Bin • Keep track of two counters per bin • used: # of pages allocated to this bin for this address space • free: # of available pages in the system for this bin • Bin selection is based on low values of used and high values of free • Low used value • reduce conflicts within the address space • High free value • reduce conflicts between address spaces CPS 220

Hierarchical • Best bin could be linear in # of bins • Build a tree • internal nodes contain sum of child <used,free> values • Independent of cache size • simply stop at a particular level in the tree CPS 220

Operating System Memory Systems in Computer Science