Virtual Memory & Address Translation

Virtual Memory &Address Translation Vivek Pai Princeton University

Memory Gedanken • For a given system, estimate the following • # of processes • Amount of memory • Are all portions of a program likely to be used? • What fraction might be unused? • What does this imply about memory bookkeeping?

Some Numbers • Bubba: desktop machine • 63 processes • 128MB memory + 12MB swap used • About 2 MB per process • Oakley: CS shared machine • 188 processes • 2GB memory + 624MB swap • About 13 MB per process • Arizona: undergrad shared machine • 818 processes • 8GB memory, but only 2GB used • About 2MB per process

Quiz 1 Answers • Microkernel – small privileged kernel with most kernel services running as user-space process • Speed? Lots of context switches, memory copying for communications • Memory layout:

Quiz 1 Breakdown • 4.0 – 5 • 3.5 – 7 • 3.0 – 8 • 2.5 – 11 • 2.0 – 1 • 1.5 – 5 • 0.5 – 3

General Memory Problem • We have a limited (expensive) physical resource: main memory • We want to use it as efficiently as possible • We have an abundant, slower resource: disk

Lots of Variants • Many programs, total size less than memory • Technically possible to pack them together • Will programs know about each other’s existence? • One program, using lots of memory • Can you only keep part of the program in memory? • Lots of programs, total size exceeds memory • What programs are in memory, and how to decide?

History Versus Present • History • Each variant had its own solution • Solutions have different hardware requirements • Some solutions software/programmer visible • Present – general-purpose microprocessors • One mechanism used for all of these cases • Present – less capable microprocessors • May still use “historical” approaches

Many Programs, Small Total Size • Observation: we can pack them into memory • Requirements by segments • Text: maybe contiguous • Data: keep contiguous, “relocate” at start • Stack: assume contiguous, fixed size • Just set pointer at start, reserve space • Heap: no need to make it contiguous Stack 2 Text 1 Data 1 Data 2 Stack 1 Text 2

Many Programs, Small Total Size • Software approach • Just find appropriate space for data & code segments • Adjust any pointers to globals/functions in the code • Heap, stack “automatically” adjustable • Hardware approach • Pointer to data segment • All accesses to globals indirected

One Program, Lots of Memory Active Heap • Observations: locality • Instructions in a function generally related • Stack accesses generally in current stack frame • Not all data used all the time • Goal: keep recently-used portions in memory • Explicit: programmer/compiler reserves, controls part of memory space – “overlays” • Note: limited resource may be address space Stack Data Text Other Heap(s)

Many Programs, Lots of Memory • Software approach • Keep only subset of programs in memory • When loading a program, evict any programs that use the same memory regions • “Swap” programs in/out as needed • Hardware approach • Don’t permanently associate any address of any program to any part of physical memory • Note: doesn’t address problem of too few address bits

Why Virtual Memory? • Use secondary storage($) • Extend DRAM($$$) with reasonable performance • Protection • Programs do not step over each other • Communications require explicit IPC operations • Convenience • Flat address space • Programs have the same view of the world

How To Translate • Must have some “mapping” mechanism • Mapping must have some granularity • Granularity determines flexibility • Finer granularity requires more mapping info • Extremes: • Any byte to any byte: mapping equals program size • Map whole segments: larger segments problematic

Translation Options • Granularity • Small # of big fixed/flexible regions – segments • Large # of fixed regions – pages • Visibility • Translation mechanism integral to instruction set – segments • Mechanism partly visible, external to processor – obsolete • Mechanism part of processor, visible to OS – pages

Translation Overview CPU • Actual translation is in hardware (MMU) • Controlled in software • CPU view • what program sees, virtual memory • Memory view • physical memory virtual address Translation (MMU) physical address Physical memory I/O device

Goals of Translation • Implicit translation for each memory reference • A hit should be very fast • Trigger an exception on a miss • Protected from user’s faults Registers Cache(s) 10x DRAM 100x paging Disk 10Mx

Base and Bound • Built in Cray-1 • A program can only access physical memory in [base, base+bound] • On a context switch: save/restore base, bound registers • Pros: Simple • Cons: fragmentation, hard to share, and difficult to use disks bound virtual address > error + base physical address

Segmentation Virtual address • Have a table of (seg, size) • Protection: each entry has • (nil, read, write, exec) • On a context switch: save/restore the table or a pointer to the table in kernel memory • Pros: Efficient, easy to share • Cons: Complex management and fragmentation within a segment segment offset > error seg size . . . + physical address

Paging Virtual address page table size • Use a page table to translate • Various bits in each entry • Context switch: similar to the segmentation scheme • What should be the page size? • Pros: simple allocation, easy to share • Cons: big table & cannot deal with holes easily VPage # offset error > Page table PPage# ... ... . . . PPage# ... PPage # offset Physical address

How Many PTEs Do We Need? • Assume 4KB page • Equals “low order” 12 bits • Worst case for 32-bit address machine • # of processes  220 • What about 64-bit address machine? • # of processes  252

Segmentation with Paging Virtual address Vseg # VPage # offset Page table seg size PPage# ... ... . . . . . . PPage# ... > PPage # offset error Physical address

Stretch Time: Getting A Line • First try: fflush (stdout) scanf(“%s”, line) • 2nd try: scanf(“%c”, firstChar) if (firstChar != ‘\n’) scanf(“%s”, rest) • Final: back to first principles while (…) getc( )

Multiple-Level Page Tables Virtual address pte dir table offset . . . Directory . . . . . . . . . What does this buy us? Sparse address spaces and easier paging

Inverted Page Tables Physical address Virtual address • Main idea • One PTE for each physical page frame • Hash (Vpage, pid) to Ppage# • Pros • Small page table for large address space • Cons • Lookup is difficult • Overhead of managing hash chains, etc pid vpage offset k offset 0 pid vpage k n-1 Inverted page table

Virtual-To-Physical Lookups • Programs only know virtual addresses • Each virtual address must be translated • May involve walking hierarchical page table • Page table stored in memory • So, each program memory access requires several actual memory accesses • Solution: cache “active” part of page table

VPage # Translation Look-aside Buffer (TLB) Virtual address offset VPage# PPage# ... Real page table Miss VPage# PPage# ... . . . VPage# PPage# ... TLB Hit PPage # offset Physical address

Bits in a TLB Entry • Common (necessary) bits • Virtual page number: match with the virtual address • Physical page number: translated address • Valid • Access bits: kernel and user (nil, read, write) • Optional (useful) bits • Process tag • Reference • Modify • Cacheable

Hardware-Controlled TLB • On a TLB miss • Hardware loads the PTE into the TLB • Need to write back if there is no free entry • Generate a fault if the page containing the PTE is invalid • VM software performs fault handling • Restart the CPU • On a TLB hit, hardware checks the valid bit • If valid, pointer to page frame in memory • If invalid, the hardware generates a page fault • Perform page fault handling • Restart the faulting instruction

Software-Controlled TLB • On a miss in TLB • Write back if there is no free entry • Check if the page containing the PTE is in memory • If no, perform page fault handling • Load the PTE into the TLB • Restart the faulting instruction • On a hit in TLB, the hardware checks valid bit • If valid, pointer to page frame in memory • If invalid, the hardware generates a page fault • Perform page fault handling • Restart the faulting instruction

Hardware vs. Software Controlled • Hardware approach • Efficient • Inflexible • Need more space for page table • Software approach • Flexible • Software can do mappings by hashing • PP#  (Pid, VP#) • (Pid, VP#)  PP# • Can deal with large virtual address space

Similarities Both cache a portion of memory Both write back on a miss Combine L1 cache with TLB Virtually addressed cache Why wouldn’t everyone use virtually addressed caches? Differences Associativity TLB is usually fully set-associative Cache can be direct-mapped Consistency TLB does not deal with consistency with memory TLB can be controlled by software Cache vs. TLBs

Similarities Both cache a portion of memory Both read from memory on misses Differences Associativity TLBs generally fully associative Caches can be direct-mapped Consistency No TLB/memory consistency Some TLBs software-controlled Caches vs. TLBs • Combining L1 caches with TLBs • Virtually addressed caches • Not always used – what are their drawbacks?

Issues • What TLB entry to be replaced? • Random • Pseudo LRU • What happens on a context switch? • Process tag: change TLB registers and process register • No process tag: Invalidate the entire TLB contents • What happens when changing a page table entry? • Change the entry in memory • Invalidate the TLB entry

Consistency Issues • “Snoopy” cache protocols can maintain consistency with DRAM, even when DMA happens • No hardware maintains consistency between DRAM and TLBs: you need to flush related TLBs whenever changing a page table entry in memory • On multiprocessors, when you modify a page table entry, you need to do “TLB shoot-down” to flush all related TLB entries on all processors

Issues to Ponder • Everyone’s moving to hardware TLB management – why? • Segmentation was/is a way of maintaining backward compatibility – how? • For the hardware-inclined – what kind of hardware support is needed for everything we discussed today?

Virtual Memory & Address Translation