Virtual Memory & Memory Management in Computer Architecture for Efficient Systems

Computer ArchitectureVirtual Memory (VM) By Dan Tsafrir, 23/5/2011Presentation based on slides by Lihu Rappoport

http://www.youtube.com/watch?v=3ye2OXj32DM(funny beginning)

DRAM (dynamic random-access memory) • Corsair 1333 MHz DDR3Laptop Memory • Price (at amazon.com): • $43 for 4 GB • $79 for 8 GB • “The physical memory”

VM – motivation • Provides isolation between processes • Processes can concurrently run on a single machine • Vm prevents them from accessing the memory of one another • (But still allows for convenient sharing when required) • Provides illusion of large memory • VM size can be bigger than physical memory size • VM decouples program from real size (can differ across machines) • Provides illusion of contiguous memory • Programmers need not worry about where data is placed exactly • Allows for memory dynamic growth • Can add memory to processes at runtime as needed • Allows for memory overcommitment • Sum of VM spaces (across all processes) can be >= physical • DRAM often one of the most costly parts in the system

VM – terminology • Virtual address space • Space used by the programmer • “Ideal” = contagious & as big is you’d like • Physical address • The real, underlying physical memory address • Completely abstracted away by OS/HW

VM – basic idea • Divide memory (virtual & physical) into fixed size blocks • “page” = chunk of contagious data in virtual space • “frame” = physical memory exactly enough to hold one page • |page| = |frame| (= size) • page size = power of 2 = 2k (bytes) • By default, k=12 almost always => page size is 4KB • While virtual address space is contiguous • Pages can be mapped into arbitrary frames • Pages can reside • In memory or on disk (hence, overcommitment) • All programs are written using vm address space • HW does on-the-fly translation from virtual and physical addresses • Use a page table to translate between virtual and physical addresses

VM – simplistic illustration • Memory acts as a cache for the secondary storage (disk) • Immediate advantages • Illusion of contiguity & of having more physical memory • Program actual location unimportant • Dynamic growth, isolation, & sharing are easy to obtain address translation frames(DRAM) pages(virtual space) disk

Translation – use a “page table” virtual address (64bit) 63 0 12 11 page offset (12bit) virtual page number (52bit) how to map? page offset (12bit) physical frame number (20bit) physical address (32bit) (page size is typically 212 byte= 4KB)

Translation – use a “page table” frameNumber V D AC page table base register access control dirty bit 1 0 valid bit (page size is typically 212 byte= 4KB)

virtual address (64bit) 63 0 12 11 page offset (12bit) virtual page number (52bit) frameNumber V D AC page table base register access control dirty bit 1 0 valid bit 31 0 11 12 page offset (12bit) physical frame number (20bit) physical address (32bit) Translation – use a “page table” (page size is typically 212 byte= 4KB)

Translation – use a “page table” frameNumber V D AC “PTE” (page table entry)

Page tables Page Table points to memory frame or disk address Virtual page number Physical Memory Valid 1 1 1 1 0 1 1 0 Disk 1 1 0 1

Checks • If ( valid == 1 ) page is in main memory at frame address stored in table  Data is readily available (e.g., can copy it to the cache) else /*page fault */ need to fetch page from disk  causes a trap, usually accompanied by a context switch: current process suspended while page is fetched from disk • Access Control • R=read-only, R/W=read/write, X=execute • If ( access type incompatible with specified access rights )  protection violation fault  traps to fault-handler • Demand paging • Pages fetched from secondary memory only upon the first fault • Rather then, e.g., upon file open

Page replacement • Page replacement policy • Decided which page to place on disk • LRU (least recently used) • Typically too wasteful (updated upon each memory reference) • FIFO (first in first out) • Simplest: no need to update upon references, but ignores usage • Second-chance • Set per-page “was it referenced?” bit (can be done by HW or SW) • Swap out first page with bit = 0, FIFO order • When traversed, if bit = 1, set it to be 0 and push the associated page to end of the list (in FIFO terms, page becomes newest) • Clock • More efficient variant of second-chance • Pages are cyclically ordered (no FIFO); search clockwise for first page with bit=0; set bit=0 for pages that have bit=1

Page replacement – cont. • NRU (not recently used) • More sophisticated LRU approximation • HW or SW maintains per-page ‘referenced’ & ‘modified’ bits • Periodically (clock interrupt), SW turns ‘referenced’ off • Replacement algorithm partitions pages to • Class 0: not referenced, not modified • Class 1: not referenced, modified • Class 2: referenced, not modified • Class 3: referenced, modified • Choose at random a page from the lowest class for removal • Underlying principles (order is important): • Prefer keeping referenced over unreferenced • Prefer keeping modified over unmodified • Can a page be modified but not referenced?

Page replacement – advanced • ARC (adaptive replacement cache) • Factors not only recency (when latest access),but also frequency (how many times accessed) • User determines which factor has more weight • Better (but more wasteful) than LRU • Develop by IBM: Nimrod Megiddo & DharmendraModha • Details: http://www.usenix.org/events/fast03/tech/full_papers/megiddo/megiddo.pdf • CAR (clock with adaptive replacement) • Similar to ARC, and comparable in performance • But, unlike ARC, doesn’t require user-specified parameters • Likewise developed by IBM: SoravBansal & DharmendraModha • Details: http://www.usenix.org/events/fast04/tech/full_papers/bansal/bansal.pdf

Page faults • Page faults: the data is not in memory  retrieve it from disk • CPU detects the situation (valid=0) • But it cannot remedy the situation (doesn’t know disk; it’s the OS job) • Thus, it must trap to OS • OS loads page from disk • Possibly writing victim page to disk (if no room & if dirty) • Possibly avoids reading from disk due to OS “buffer cache” • OS updates page table (valid=1) • OS resumes process; now, HW will retry & succeed! • Page fault incurs a significant penalty • “Major” page fault = must go get page from disk • “Minor” page fault = page already resides in OS buffer cache • Possible only for files; not for “anonymous” spaces like the stack • => pages shouldn’t be too small (as noted, typically 4KB)

Page size • Smaller page size (typically 4KB) • PROS: minimizes internal fragmentation • CONS: increase size of page table • Bigger size (called “superpages” if > 4K) • PROS: • Amortize disk access cost • May prefetch useful data • May discard useless data early • CONS: • Increased fragmentation • Might transfer unnecessary info at the expense of useful info • Lots of work to increase page size beyond 4K • HW supports it for years; OS is the “bottleneck” • Attractive because: • Bigger DRAMs, increasing memory/disk performance gap

Virtual Address TLB Access TLB Hit ? No Access Page Table Physical Addresses Yes TLB (translation lookaside buffer) • Page table resides in memory • Each translation requires a memory access • Might be required for each load/store! • TLB • Cache recently used PTEs • speed up translation • typically 128 to 256 entries • usually 4 to 8 way associative • TLB access time is comparable to L1 cache access time

TLB Valid Tag Physical Page Virtual page number 1 Physical Memory 1 1 1 0 1 Page Table Valid 1 1 Disk 1 1 0 1 1 0 Physical Page Or Disk Address 1 1 0 1 Making Address Translation Fast TLB is a cache for recent address translations:

Tag Set Way 2 Way 2 Way 3 Way 3 Way 0 Way 0 Way 1 Way 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TLB Access Virtual page number Offset Set# = = = = Way MUX PTE Hit/Miss

Unified L2 • L2 is unified (no separation for data/inst) – as the main memory • In case of a miss in either: d-L1, i-L1, d-TLB, or i-TLB=> try to get missed data from L2 • PTEs can and do reside in L2 L1 Data Cache L1 Instruction cache Memory L2 cache translations translations Data TLB Instruction TLB

Access Cache Physical Addresses VM & cache Virtual Address Access TLB L2Cache Hit ? L2Cache Hit ? L1Cache Hit ? TLB Hit ? Access Page Table In Memory No No No Access Memory No Yes Yes Data • TLB access is serial with cache access => performance is crucial! • Page table entries can be cached in L2 cache (as data)

29 12 11 0 Physical Page Number Page offset 29 14 13 6 5 0 tag set disp Overlapped TLB & cache access VM view of a Physical Address Cache view of a Physical Address • #Set is not contained within the Page Offset • The #Set is not known until the physical page number is known • Cache can be accessed only after address translation done

tag set disp Overlapped TLB & cache access (cont) Virtual Memory view of a Physical Address 29 12 11 0 Physical Page Number Page offset Cache view of a Physical Address 29 6 5 0 12 11 • In the above example #Set is contained within the Page Offset • The #Set is known immediately • Cache can be accessed in parallel with address translation • Once translation is done, match upper bits with tags Limitation: Cache ≤ (page size × associativity)

Tag Set Overlapped TLB & cache access (cont) Virtual page number Page offset set disp TLB Cache Set# Set# = = = = Way MUX Physical page number Hit/Miss = = = = = = = = Way MUX Hit/Miss Data

Overlapped TLB & cache access (cont) • Assume cache is 32K Byte, 2 way set-associative, 64 byte/line • (215/ 2 ways) / (26 bytes/line) = 215-1-6 = 28 = 256 sets • In order to still allow overlap between set access and TLB access • Take the upper two bits of the set number from bits [1:0] of the VPN • Physical_addr[13:12] may be different than virtual_addr[13:12] • Tag is comprised of bits [31:12] of the physical address • The tag may mis-match bits [13:12] of the physical address • Cache miss  allocate missing line according to its virtual set address and physical tag 29 12 11 0 Physical Page Number Page offset 29 14 13 12 11 6 5 0 set disp tag VPN[1:0]

Swap & DMA (direct memory access) • DMA copies page to disk controller • Access memory without requiring CPU involvement • Reads each line: • Executes snoop-invalidate for each line in the cache (both L1 and L2) • If the line resides in the cache: • if it is modified reads its line from the cache into memory • invalidates the line • Writes the line to the disk controller • This means that when a page is swapped-out of memory • All data in the caches which belongs to that page is invalidated • The page in the disk is up-to-date • The TLB is snooped • If the TLB hits for the swapped-out page, TLB entry is invalidated • In the page table • Assign 0 to valid bit in PTE of swapped-out pages • The rest of the PTE bits may be used by the OS for keeping the location of the page on disk

Context switch • Each process has its own address space • Akin to saying “each process has its own page table” • OS allocates frames for process => updates its page table • If only one PTE points to frame throughput the system • Only the associated process can access the corresponding frame • Shared memory • Two PTEs of two processes point to the same frame • Upon context switching • Save current architectural state to memory • Architectural registers • Register that holds the page table base address in memory • Flush TLB • Same virtual addresses are routinely resused • Load the new architectural state from memory • Architectural registers • Register that holds the page table base address in memory

Trans- lation VA PA CPU Main Memory Cache hit data Virtually-addressed cache • Cache uses virtual addresses (tags are virtual) • Only require address translation on cache miss • TLB not in path to cache hit! But… • Aliasing: 2 virtual addresses mapped to same physical address • => 2 cache lines holding data of same physical address  • => Must update all cache entries with same physical address 

Virtually-addressed cache • Cache must be flushed at task switch • Possible solution: include unique process ID (PID) in tag • How to share & synchronize memory among processes • As noted, must permit multiple virtual pages to refer to same physical frame • Problem: incoherence if they point to different physical pages • Solution: require sufficiently many common virtual LSB • With direct mapped cache, guarantied that they all point to same physical page

Virtual Memory & Memory Management in Computer Architecture for Efficient Systems

Virtual Memory & Memory Management in Computer Architecture for Efficient Systems

Presentation Transcript

High-Level Language Virtual Machine Architecture

Virtual Memory

Virtual memory

Advanced Computer Architecture Memory Hierarchy Design

Virtual Memory: Concepts

Computer Architecture Virtual Memory

Virtual Memory: Concepts CS220: Computer Systems II

Virtual Memory: Concepts

Virtual Memory: Concepts

High-Level Language Virtual Machine Architecture

Chapter 9: Virtual Memory

Computer Architecture Virtual Memory (VM) – x86

Virtual Memory

Virtual Memory

High-Level Language Virtual Machine Architecture

Virtual Memory

Principles of Virtual Memory

Principles of Virtual Memory

Computer Architecture Virtual Memory (VM)

Virtual Memory

Virtual Memory (I)

Computer Architecture Virtual Memory (VM) – x86