Fabi án E. Bustamante, Spring 2007

Virtual Memory Today • Motivations for VM • Address translation • Accelerating translation with TLBs • Dynamic memory allocation – mechanisms & policies • Memory bugs Fabián E. Bustamante, Spring 2007

A system with physical memory only Addresses generated by the CPU correspond directly to bytes in physical memory Memory 0: 1: Physical Addresses CPU E.g. most Cray machines, early PCs, nearly all embedded systems, etc. N-1: 2 EECS 213 Introduction to Computer SystemsNorthwestern University

A system with virtual memory Modern processors use virtual addresses Hardware converts virtual addresses to physical addresses via OS-managed lookup table (page table)‏ Memory 0: 1: Page Table Virtual Addresses Physical Addresses 0: 1: CPU E.g. workstations, servers, modern PCs, etc. P-1: N-1: Disk 3 EECS 213 Introduction to Computer SystemsNorthwestern University

Motivations for virtual memory • Use physical DRAM as a cache for the disk • Address space of a process can exceed physical memory size • Sum of address spaces of multiple processes can exceed physical memory • Simplify memory management • Multiple processes resident in main memory. • Each process with its own address space • Only “active” code and data is actually in memory • Allocate more memory to process as needed. • Provide protection • One process can’t interfere with another. • because they operate in different address spaces. • User process cannot access privileged information • different sections of address spaces have different permissions. 4 EECS 213 Introduction to Computer SystemsNorthwestern University

80 GB: ~$110 1GB: ~$200 4 MB: ~$500 SRAM DRAM Disk Motivation #1: DRAM a “cache” for disk • Full address space is quite large: • 32-bit addresses: ~4,000,000,000 (4 billion) bytes • 64-bit addresses: ~16,000,000,000,000,000,000 (16 quintillion) bytes • Disk storage is ~300X cheaper than DRAM storage • 80 GB of DRAM: ~ $33,000 • 80 GB of disk: ~ $110 • To access large amounts of data in a cost-effective manner, the bulk of the data must be stored on disk 5 EECS 213 Introduction to Computer SystemsNorthwestern University

CPU disk regs Levels in memory hierarchy cache virtual memory C a c h e Memory 8 B 32 B 4 KB Register Cache Memory Disk Memory size: speed: $/Mbyte: line size: 32 B 1 ns 8 B 32KB-4MB 2 ns $125/MB 32 B 1024 MB 30 ns $0.20/MB 4 KB 100 GB 8 ms $0.001/MB larger, slower, cheaper 6 EECS 213 Introduction to Computer SystemsNorthwestern University

DRAM vs. SRAM as a “cache” • DRAM vs. disk is more extreme than SRAM vs. DRAM • Access latencies: • DRAM ~10X slower than SRAM • Disk ~100,000X slower than DRAM • Importance of exploiting spatial locality: • First byte is ~100,000X slower than successive bytes on disk • vs. ~4X improvement for page-mode vs. regular accesses to DRAM • Bottom line: • Design decisions made for DRAM caches driven by enormous cost of misses 7 EECS 213 Introduction to Computer SystemsNorthwestern University

Impact of properties on design • If DRAM was to be organized similar to an SRAM cache, how would we set the following design parameters? • Line size? Large, since disk better at transferring large blocks • Associativity? High, to minimize miss rate • Write through or write back? • Write back, since can’t afford to perform small writes to disk • What would the impact of these choices be on: • Miss rate: Extremely low. << 1% • Hit time: Must match cache/DRAM performance • Miss latency: Very high. ~20ms • Tag storage overhead: Low, relative to block size 8 EECS 213 Introduction to Computer SystemsNorthwestern University

D 243 Object Name X 17 X J 105 Locating an object in a “Cache” • SRAM Cache • Tag stored with cache line • Maps from cache block to memory blocks • From cached to uncached form • Save a few bits by only storing tag • No tag for block not in cache • Hardware retrieves information • Can quickly match against multiple tags “Cache” Tag Data 0: = X? 1: • • • • • • N-1: 9 EECS 213 Introduction to Computer SystemsNorthwestern University

Object Name D: 0: X J: 1: X: N-1: Locating an object in “Cache” (cont.)‏ • DRAM Cache • Each allocated page of virtual memory has entry in page table • Mapping from virtual pages to physical pages • From uncached form to cached form • Page table entry even if page not in memory • Specifies disk address • Only way to indicate where to find page • OS retrieves information Page Table “Cache” Location Data 0 243 17 On Disk • • • • • • 1 105 10 EECS 213 Introduction to Computer SystemsNorthwestern University

Page faults (like “cache misses”)‏ • What if an object is on disk rather than in memory? • Page table entry indicates virtual address not in memory • OS exception handler invoked to move data from disk into memory • current process suspends, others can resume • OS has full control over placement, etc. Before fault After fault Memory Memory Page Table Page Table Virtual Addresses Physical Addresses Virtual Addresses Physical Addresses CPU CPU Disk Disk 11 EECS 213 Introduction to Computer SystemsNorthwestern University

disk Disk Servicing a page fault • Processor signals controller • Read block of length P starting at disk address X and store starting at memory address Y • Read occurs • Direct Memory Access (DMA)‏ • Under control of I/O controller • I / O controller signals completion • Interrupt processor • OS resumes suspended process (1) Initiate Block Read Processor Reg (3) Read Done Cache Memory-I/O bus (2) DMA Transfer I/O controller Memory disk Disk 12 EECS 213 Introduction to Computer SystemsNorthwestern University

Motivation #2: Memory management • Multiple processes can reside in physical memory. • How do we resolve address conflicts? • what if two processes access something at the same address? memory invisible to user code kernel virtual memory stack %esp Memory mapped region forshared libraries Linux/x86 process memory image the “brk” ptr runtime heap (via malloc)‏ uninitialized data (.bss)‏ initialized data (.data)‏ program text (.text)‏ forbidden 0 13 EECS 213 Introduction to Computer SystemsNorthwestern University

Solution: Separate virtual addr. spaces • Virtual and physical address spaces divided into equal-sized blocks • blocks are called “pages” (both virtual and physical)‏ • Each process has its own virtual address space • operating system controls how virtual pages as assigned to physical memory 0 Physical Address Space (DRAM)‏ Address Translation Virtual Address Space for Process 1: 0 VP 1 PP 2 VP 2 ... N-1 (e.g., read/only library code)‏ PP 7 Virtual Address Space for Process 2: 0 VP 1 PP 10 VP 2 ... M-1 N-1 14 EECS 213 Introduction to Computer SystemsNorthwestern University

VP 0: VP 1: VP 2: VP 0: VP 1: VP 2: Motivation #3: Protection • Page table entry contains access rights information • hardware enforces this protection (trap into OS if violation occurs)‏ Page Tables Memory Read? Write? Physical Addr 0: Yes No PP 9 1: Process i: Yes Yes PP 4 No No XXXXXXX • • • • • • • • • Read? Write? Physical Addr Yes Yes PP 6 Process j: Yes No PP 9 N-1: No No XXXXXXX • • • • • • • • • 15 EECS 213 Introduction to Computer SystemsNorthwestern University

VM address translation • Virtual Address Space • V = {0, 1, …, N–1} • Physical Address Space • P = {0, 1, …, M–1} • M < N • Address Translation • MAP: V  P U {} • For virtual address a: • MAP(a) = a’ if data at virtual address a is at physical address a’ in P • MAP(a) =  if data at virtual address a is not in physical memory • Either invalid or stored on disk 16 EECS 213 Introduction to Computer SystemsNorthwestern University

VM address translation: Miss Processor Hardware Addr Trans Mechanism Main Memory a a' virtual address part of the on-chip memory mgmt unit (MMU)‏ physical address 17 EECS 213 Introduction to Computer SystemsNorthwestern University

VM address translation: Miss page fault fault handler Processor  Hardware Addr Trans Mechanism Secondary memory Main Memory a a' OS performs this transfer (only if miss)‏ part of the on-chip memory mgmt unit (MMU)‏ virtual address physical address 18 EECS 213 Introduction to Computer SystemsNorthwestern University

VM address translation • Parameters • P = 2p = page size (bytes). • N = 2n = Virtual address limit • M = 2m = Physical address limit n–1 p p–1 0 virtual address virtual page number page offset address translation m–1 p p–1 0 physical address physical page number page offset Page offset bits don’t change as a result of translation 19 EECS 213 Introduction to Computer SystemsNorthwestern University

Page tables Virtual Page Number Memory resident page table (physical page or disk address)‏ Physical Memory Valid 1 1 0 1 1 1 0 1 Disk Storage (swap file or regular file system file)‏ 0 1 20 EECS 213 Introduction to Computer SystemsNorthwestern University

virtual address page table base register n–1 p p–1 0 VPN acts as table index virtual page number (VPN)‏ page offset physical page number (PPN)‏ access valid if valid=0 then page not in memory m–1 p p–1 0 physical page number (PPN)‏ page offset physical address Address translation via page table 21 EECS 213 Introduction to Computer SystemsNorthwestern University

Page table operation • Translation • Separate (set of) page table(s) per process • VPN forms index into page table (points to a page table entry)‏ 22 EECS 213 Introduction to Computer SystemsNorthwestern University

Page table operation • Computing physical address • Page Table Entry (PTE) provides info about page • if (valid bit = 1) then the page is in memory. • Use physical page number (PPN) to construct address • if (valid bit = 0) then the page is on disk - page fault 23 EECS 213 Introduction to Computer SystemsNorthwestern University

Page table operation • Checking protection • Access rights field indicate allowable access • e.g., read-only, read-write, execute-only • typically support multiple protection modes • Protection violation fault if user doesn’t have necessary permission 24 EECS 213 Introduction to Computer SystemsNorthwestern University

Multi-level page tables Level 2 Tables • Given: • 4KB (212) page size • 32-bit address space • 4-byte PTE • Problem: • Would need a 4 MB page table! • 220 *4 bytes • Common solution • multi-level page tables • e.g., 2-level table (P6)‏ • Level 1 table: 1024 entries, each of which points to a Level 2 page table. • Level 2 table: 1024 entries, each of which points to a page Level 1 Table ... 25 EECS 213 Introduction to Computer SystemsNorthwestern University

miss VA PA Trans- lation Cache Main Memory CPU hit data Integrating VM and cache • Most caches “Physically Addressed” • Accessed by physical addresses • Allows multiple processes to have blocks in cache at a time • Allows multiple processes to share pages • Cache doesn’t need to be concerned with protection issues • Access rights checked as part of address translation • Perform address translation before cache lookup • But this could involve a memory access itself (of the PTE)‏ • Of course, page table entries can also become cached 26 EECS 213 Introduction to Computer SystemsNorthwestern University

hit miss VA PA TLB Lookup Cache Main Memory CPU miss hit Trans- lation data Speeding up translation with a TLB • “Translation Lookaside Buffer” (TLB)‏ • Small hardware cache in MMU • Maps virtual page numbers to physical page numbers • Contains complete page table entries for small number of pages 27 EECS 213 Introduction to Computer SystemsNorthwestern University

Address translation with a TLB n–1 p p–1 0 virtual address virtual page number page offset valid tag physical page number TLB . . . = TLB hit physical address tag byte offset index valid tag data Cache = data cache hit 28 EECS 213 Introduction to Computer SystemsNorthwestern University

Taken stock – main themes • Programmer’s view • Large “flat” address space • Can allocate large blocks of contiguous addresses • Processor “owns” machine • Has private address space • Unaffected by behavior of other processes • System view • Virtual address space created by mapping to set of pages • Need not be contiguous • Allocated dynamically • Enforce protection during address translation • OS manages many processes simultaneously • Continually switching among processes • Especially when one must wait for resource • E.g., disk I/O to handle page fault 29 EECS 213 Introduction to Computer SystemsNorthwestern University

13 12 11 10 9 8 7 6 5 4 3 2 1 0 VPN VPO 11 10 9 8 7 6 5 4 3 2 1 0 PPN PPO Simple memory system • Memory is byte addressable • Access are to 1-byte words • 14-bit virtual addresses, 12-bit physical address • Page size = 64 bytes (26)‏ (Virtual Page Offset)‏ (Virtual Page Number)‏ (Physical Page Number)‏ (Physical Page Offset)‏ 30 EECS 213 Introduction to Computer SystemsNorthwestern University

VPN PPN Valid VPN PPN Valid 00 28 1 08 13 1 01 – 0 09 17 1 02 33 1 0A 09 1 03 02 1 0B – 0 04 – 0 0C – 0 05 16 1 0D 2D 1 06 – 0 0E 11 1 07 – 0 0F 0D 1 Simple memory system page table Only show first 16 entries 31 EECS 213 Introduction to Computer SystemsNorthwestern University

TLBT TLBI 13 12 11 10 9 8 7 6 5 4 3 2 1 0 VPN VPO Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid 0 03 – 0 09 0D 1 00 – 0 07 02 1 1 03 2D 1 02 – 0 04 – 0 0A – 0 2 02 – 0 08 – 0 06 – 0 03 – 0 3 07 – 0 03 0D 1 0A 34 1 02 – 0 Simple memory system TLB • TLB • 16 entries • 4-way associative 32 EECS 213 Introduction to Computer SystemsNorthwestern University

CI CT CO 11 10 9 8 7 6 5 4 3 2 1 0 PPN PPO Idx Tag Valid B0 B1 B2 B3 Idx Tag Valid B0 B1 B2 B3 0 19 1 99 11 23 11 8 24 1 3A 00 51 89 1 15 0 – – – – 9 2D 0 – – – – 2 1B 1 00 02 04 08 A 2D 1 93 15 DA 3B 3 36 0 – – – – B 0B 0 – – – – 4 32 1 43 6D 8F 09 C 12 0 – – – – 5 0D 1 36 72 F0 1D D 16 1 04 96 34 15 6 31 0 – – – – E 13 1 83 77 1B D3 7 16 1 11 C2 DF 03 F 14 0 – – – – Simple memory system cache • Cache • 16 lines • 4-byte line size • Direct mapped 33 EECS 213 Introduction to Computer SystemsNorthwestern University

TLBT TLBI 13 12 11 10 9 8 7 6 5 4 3 2 1 0 VPN VPO CI CT CO 11 10 9 8 7 6 5 4 3 2 1 0 PPN PPO Address translation problem 10.12 • Virtual address 0x03a9 VPN ___ TLBI ___ TLBTag ____ TLB Hit? __ Page Fault? __ PPN: ____ • Physical address Offset ___ CI___ CT ____ Hit? __ Byte returned: ____ 34 EECS 213 Introduction to Computer SystemsNorthwestern University

TLBT TLBI 13 12 11 10 9 8 7 6 5 4 3 2 1 0 VPN VPO CI CT CO 11 10 9 8 7 6 5 4 3 2 1 0 PPN PPO Address translation problem 10.13 • Virtual address 0x0040 VPN ___ TLBI ___ TLBTag ____ TLB Hit? __ Page Fault? __ PPN: ____ • Physical address Offset ___ CI___ CT ____ Hit? __ Byte returned: ____ 35 EECS 213 Introduction to Computer SystemsNorthwestern University

Harsh reality • Memory matters • Memory is not unbounded • It must be allocated and managed • Many applications are memory dominated • Especially those based on complex, graph algorithms • Memory referencing bugs especially pernicious • Effects are distant in both time and space • Memory performance is not uniform • Cache and virtual memory effects can greatly affect program performance • Adapting program to characteristics of memory system can lead to major speed improvements

Dynamic memory allocation • Explicit vs. implicit memory allocator • Explicit: application allocates and frees space • E.g., malloc and free in C • Implicit: application allocates, but does not free space • E.g. garbage collection in Java, ML or Lisp • Allocation • In both cases the memory allocator provides an abstraction of memory as a set of blocks • Doles out free memory blocks to application • Will discuss simple explicit memory allocation today Application Dynamic Memory Allocator Heap Memory

Process memory image memory invisible to user code kernel virtual memory stack %esp Memory mapped region for shared libraries Allocators request additional heap memory from the operating system using the sbrk function. the “brk” ptr run-time heap (via malloc)‏ uninitialized data (.bss)‏ initialized data (.data)‏ program text (.text)‏ 0

Malloc package • #include <stdlib.h> • void *malloc(size_t size)‏ • If successful: • Returns a pointer to a memory block of at least size bytes, (typically) aligned to 8-byte boundary. • If size == 0, returns NULL • If unsuccessful: returns NULL (0) and sets errno. • void *realloc(void *p, size_t size) • Changes size of block p and returns pointer to new block. • Contents of new block unchanged up to min of old and new size. • void free(void *p)‏ • Returns the block pointed at by p to pool of available memory • p must come from a previous call to malloc or realloc.

Malloc example void foo(int n, int m) { int i, *p; /* allocate a block of n ints */ if ((p = (int *) malloc(n * sizeof(int))) == NULL) { perror("malloc"); exit(0); } for (i=0; i<n; i++)‏ p[i] = i; /* add m bytes to end of p block */ if ((p = (int *) realloc(p, (n+m) * sizeof(int))) == NULL) { perror("realloc"); exit(0); } for (i=n; i < n+m; i++)‏ p[i] = i; /* print new array */ for (i=0; i<n+m; i++)‏ printf("%d\n", p[i]); free(p); /* return p to available memory pool */ }

Allocation examples p1 = malloc(4)‏ p2 = malloc(5)‏ p3 = malloc(6)‏ free(p2)‏ p4 = malloc(2)‏

Constraints • Applications: • Can issue arbitrary sequence of allocation and free requests • Free requests must correspond to an allocated block • Allocators • Can’t control number or size of allocated blocks • Must respond immediately to all allocation requests • i.e., can’t reorder or buffer requests • Must allocate blocks from free memory • i.e., can only place allocated blocks in free memory • Must align blocks so they satisfy all alignment requirements • 8 byte alignment for GNU malloc (libc malloc) on Linux boxes • Can only manipulate and modify free memory • Can’t move the allocated blocks once they are allocated • i.e., compaction is not allowed

Goals of good malloc/free • Primary goals • Good time performance for malloc and free • Ideally should take constant time (not always possible)‏ • Should certainly not take linear time in the number of blocks • Good space utilization • User allocated structures should be large fraction of the heap. • Want to minimize “fragmentation”. • Some other goals • Good locality properties • Structures allocated close in time should be close in space • “Similar” objects should be allocated close in space • Robust • Can check that free(p1) is on a valid allocated object p1 • Can check that memory references are to allocated space

Performance goals: throughput • Given some sequence of malloc and free requests: • R0, R1, ..., Rk, ... , Rn-1 • Want to maximize throughput and peak memory utilization. • These goals are often conflicting • Throughput: • Number of completed requests per unit time • Example: • 5,000 malloc calls and 5,000 free calls in 10 seconds • Throughput is 10,000 operations/second.

Performance goals: Peak mem utilization • Given some sequence of malloc and free requests: • R0, R1, ..., Rk, ... , Rn-1 • Def: Aggregate payload Pk: • malloc(p) results in a block with a payload of p bytes.. • After request Rk has completed, the aggregate payload Pk is the sum of currently allocated payloads. • Def: Current heap size is denoted by Hk • Assume that Hk is monotonically nondecreasing • Def: Peak memory utilization: • After k requests, peak memory utilization is: • Uk = ( maxi<k Pi ) / Hk

Internal fragmentation • Poor memory utilization caused by fragmentation. • Comes in two forms: internal and external fragmentation • Internal fragmentation • For some block, internal fragmentation is the difference between the block size and the payload size. • Caused by overhead of maintaining heap data structures, padding for alignment purposes, or explicit policy decisions (e.g., not to split the block). • Depends only on the pattern of previous requests, and thus is easy to measure. block Internal fragmentation Internal fragmentation payload

External fragmentation Occurs when there is enough aggregate heap memory, but no single free block is large enough p1 = malloc(4)‏ p2 = malloc(5)‏ p3 = malloc(6)‏ free(p2)‏ p4 = malloc(6)‏ oops! External fragmentation depends on the pattern of future requests, and thus is difficult to measure.

Implementation issues How do we know how much memory to free just given a pointer? How do we keep track of the free blocks? What do we do with the extra space when allocating a structure that is smaller than the free block it is placed in? How do we pick a block to use for allocation -- many might fit? How do we reinsert freed block?

Knowing how much to free • Standard method • Keep the length of a block in the word preceding the block. • This word is often called the header field orheader • Requires an extra word for every allocated block p0 = malloc(4)‏ p0 5 free(p0)‏ Block size data

Keeping track of free blocks • Method 1: Implicit list using lengths -- links all blocks • Method 2: Explicit list among the free blocks using pointers within the free blocks • Method 3: Segregated free list - Different free lists for different size classes • Method 4: Blocks sorted by size • Can use a balanced tree (e.g. Red-Black tree) with pointers within each free block, and the length used as a key 5 4 6 2 5 4 6 2

Fabi án E. Bustamante, Spring 2007

Fabi án E. Bustamante, Spring 2007

Presentation Transcript

CSCE 590E Spring 2007

Graduate Orientation Spring 2007

Jesus Bustamante

Spring 2007

Spring 2007 due dates

Spring 2007

Jose bustamante submarine

Spring 2007

MGMT 102 Spring 2007

FABI

Annual Update: Spring 2007 Charles E. Daye, Chair

CSCE 932, Spring 2007

Dezső Sima Spring 2007

Computer Programming Spring-2007

STA 291 Spring 2007

MCAS Performance Spring 2007

Alessandra Fabi

Spring 2007 Symposia Series

Spring 2007 Symposia Series