Outline for Today’s Lecture

Outline for Today’s Lecture • Administrative: • Next Friday - midterm exam • Change on web – now over Chap 1-3, 7.4, 8.2, and 10.1-10.3 • Review at Monday discussion – bring your ?s • Next programming assignment will be posted this weekend • Objective for today: • Replacement policies • Distributed shared memory • Superpages

transitions size of locality set stable time Variable / Global Algorithms • Not requiring each process to live within a fixed number of frames, replacing only its own pages. • Can apply previously mentioned algorithms globally to victimize any process’s pages • Algorithms that make number of frames explicit.

Variable Space Algorithms • Working SetTries to capture what the set of active pages currently is. The whole working set should be resident in memory for the process to bother running. WS is set of pages referenced during window of time (now-t, now). • Working Set Clock - a hybrid approximation • Page Fault FrequencyMonitor fault rate, if pff > high threshold, grow # frames allocated to this process, if pff < low threshold, reduce # frames.Idea is to determine the right amount of memory to allocate.

Working Set Model • Working set at time t is the set of pages referenced in the interval of time (t-w, t) where w is the working set window. • Implies per-reference information captured. • How to choose w? • Identifies the “active” pages. Any page that is resident in memory but not in any process’s working set is a candidate for replacement. • Size of the working set can vary with locality changes

Example Reference String …0 { 1 2 { 2 3 4 5 15 5 20 5 6}m2 6 { 2 3 4 5 16 5 21 5 6}m 2 6 { 2 3 4 5 17 5 22 5 6}m 2 6 … { 2 3 4 5 19 5 24 5 6}m 2 6 7 }n i j A B

Newest 1st Candidate Reminder: Clock Algorithm • Maintain a circular queue with a pointer to the next candidate (clock hand). • At fault time: scan around the clock, looking for page with usage bit of zero (that’s your victim), clearing usage bits as they are passed. • We now know whether or not a page has been used since the last time the bits were cleared

WSClock • The implementable approximation • At fault time: scan usage bits of resident pages. For each page i {if (usedi) {time_of_refi = vtowner[i] /*virtual time of owning process*/; usedi = 0;} else if( | vtowner[i]- time_of_refi | >= w ) replaceable; //else still in “working set”}

WSClock – Virtual Times W=2 {x,y} {y,z} {a,f} {a,b,c} vt=3 vt=7 vt=6 vt=3 rt=0 rt=3 rt=6 rt=10 rt=13

WSClock – in operation ref = 6 ref = 3 VT1 = 6 P1 x x x VT0 = 8 x x x P0 ref = 2 ref = 5 ref = 8 Let w = 5 Then this is no longer in working set of P0

Summary:Pros and Cons of VM • Demand paging gives the OS flexibility to manage memory... • programs may run with pages missing • unused or “cold” pages do not consume real memory • improves degree of multiprogramming • program size is not limited by physical memory • program size may grow (e.g., stack and heap) • …but VM takes control away from the application. • With traditional interfaces, the application cannot tell how much memory it has or how much a given reference costs. • Fetching pages on demand may force the application to incur I/O stalls for many of its references.

Another problem: Sharing • The indirection of the mapping mechanism in paging makes it tempting to consider sharing code or data - having page tables of two processes map to the same page, but… • Interaction with caches, especially if virtually addressed cache. • What if there are addresses embedded inside the page to be shared?

proc foo( ) proc foo( ) Paging and Sharing (difficulties with embedded addresses) VAS0 VAS1 VAS2 • Virtual address spaces still look contiguous. • Virtual Address Space for Process 0 links foo into pink address region • Virtual Address Space for Process 1 links bar into blue region • Then along comes Process 2 ... BR 42 BR 42 42: proc bar( ) proc bar( ) BR 42 BR 42 VAS2 wants to share both foo and bar.

Segmentation virtual addr • A better basis for sharing. Naming is by logical unit (rather than arbitrary fixed size unit) and then offset within unit (e.g. procedure). • Segments are variable size • Segment table is like a bunch of base/limit registers. seg# offset segment table base + effective addr

proc foo( ) Sharing in Segmentation VAS1 VAS0 segment table segment table • Naming is by logical objects. • Offsets are relative to base of object. • Address spaces may be sparse as well as being non-contiguous. VAS2 proc bar( ) BR foo, 30 BR bar, 2

Combining Segmentation and Paging virtual addr segment page offset • Sharing supported by segmentation. Programs name shared segments. • Physical storage management simplified by paging of each segment. page table segment table frame offset effective addr yet another page table for diff seg

Motivation: How to exploit remote memory? P $ Memory Low latency networks make retrieving data across the network faster than local disk “I/O bottleneck” VM and file caching RemoteMemory

Distributed Shared Memory (DSM) Allows use of a shared memory programming model (shared address space) in a distributed system (processors with only local memory) mem mem proc proc network msg msg mmu mmu

DSM Issues • Can use the local memory management hardware to generate fault when desired page is not locally present or when write attempted on read-only copy. • Locate the page remotely - current “owner” of page (last writer) or “home” for page. • Page sent in message to requesting node (read access makes copy; write migrates) • Consistency protocol - invalidations or broadcast of changes (update) • directory kept of “caches” holding copies

DSM States Forced faults are key to consistency operations • Invalid local mapping, attempted read access - data flushed from most recent writer, set write-protect bit for all copies. • Invalid local mapping, attempted write access - migrate data, invalidate all other copies. • Local read-only copy, write-fault - invalidate all other copies

Consistency Models • Sequential consistency • All memory operations appear to execute one at a time. A write is considered done only when invalidations or updates have propagated to all copies. • Weaker forms of consistency • Guarantees associated with synchronization primitives; at other times, it doesn’t matter • For example:acquire lock - make sure others’ writes are done release lock - make sure all my writes are seen by others

A = 0; A = 1; if (B == 0) succ[0] = true; B = 0; B = 1; if (A == 0) succ[1] = true; Example - the Problem time (A==1 & B==1)

A = 0; A = 1; if (B == 0) succ[0] = true; B = 0; B = 1; if (A == 0) succ[1] = true; Example - Sequential Consistency time B = 1 delayed A = 1 delayed

A = 0; A = 1; acquire (mutex0); if (B == 0) succ[0] = true; release (mutex0); B = 0; B = 1; acquire (mutex1); if (A == 0) succ[1] = true; release (mutex1); Example - Weaker Consistency

False Sharing • Page bouncing sets in when there are too frequent coherency operations. • One cause: false sharingHalf used by one proc.Other half by another • Subpage management? Packaged &managed as one page

OS Support for SuperpagesJuan Navarro, Sitaram Iyer, Peter Druschel, Alan CoxOSDI 2002 • Increasing cost in TLB miss overhead • growing working sets • TLB size does not grow at same pace • Processors now provide superpages • one TLB entry can map a large region • OSs have been slow to harness them • no transparent superpage support for apps • Proposed: a practical and transparent solution to support superpages

Translation look-aside buffer • TLB caches virtual-to-physical address translations • TLB coverage • amount of memory mapped by TLB • amount of memory that can be accessed without TLB misses

30% TLB miss overhead: 5% 5-10% TLB coverage trend TLB coverage as percentage of main memory Factor of 1000 decrease in 15 years

How to increase TLB coverage • Typical TLB coverage  1 MB • Use superpages! • large and small pages • Increase TLB coverage • no increase in TLB size • no internal fragmentation If only large pages: larger working sets, more I/O.

What are these superpages anyway? • Memory pages of larger sizes • supported by most modern CPUs • Otherwise, same as normal pages • power of 2 size • use only one TLB entry • contiguous • aligned (physically and virtually) • uniform protection attributes • one reference bit, one dirty bit

Alpha: 8,64,512KB; 4MB Itanium: 4,8,16,64,256KB; 1,4,16,64,256MB A superpage TLB virtual memory base page entry (size=1) physical address virtual address superpage entry (size=4) TLB physical memory

The superpage problem

A B C D A C D D A C Issue 1: superpage allocation virtual memory B superpage boundaries physical memory B • How / when / what size to allocate?

Wait for app to touch pages? May lose opportunity to increase TLB coverage. Create small superpage? May waste overhead. Issue 2: promotion • Promotion: create a superpage out of a set of smaller pages • mark page table entry of each base page • When to promote? Forcibly populate pages? May cause internal fragmentation.

Issue 3: demotion Demotion: convert a superpage into smaller pages • when page attributes of base pages of a superpage become non-uniform • during partial pageouts

Issue 4: fragmentation • Memory becomes fragmented due to • use of multiple page sizes • persistence of file cache pages • scattered wired (non-pageable) pages • Contiguity: contended resource • OS must • use contiguity restoration techniques • trade off impact of contiguity restoration against superpage benefits

Design

Key observation Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object • Example: array initialization • Opportunistic policies • superpages as large and as soon as possible • as long as no penalty if wrong decision

A C D A C D Superpage allocation Preemptible reservations virtual memory B superpage boundaries physical memory B reserved frames How much do we reserve? Goal: good TLB coverage,without internal fragmentation.

Allocation: reservation size Opportunistic policy • Go for biggest size that is no larger than the memory object (e.g., file) • If size not available, try preemption before resigning to a smaller size • preempted reservation had its chance

Allocation: managing reservations largest unused (and aligned) chunk 4 2 1 best candidate for preemption at front: • reservation whose most recently populated frame was populated the least recently

Incremental promotions Promotion policy: opportunistic 2 4 4+2 8

Speculative demotions • One reference bit per superpage • How do we detect portions of a superpage not referenced anymore? • On memory pressure, demote superpages when resetting ref bit • Re-promote (incrementally) as pages are referenced

Demotions: dirty superpages • One dirty bit per superpage • what’s dirty and what’s not? • page out entire superpage • Demote on first write to clean superpage write • Re-promote (incrementally) as other pages are dirtied

Fragmentation control • Low contiguity: modified page daemon • restore contiguity • move clean, inactive pages to the free list • minimize impact • prefer pages that contribute the most to contiguity • keep contents for as long as possible(even when part of a reservation: if reactivated, break reservation) • Cluster wired pages

Experimentalevaluation

Experimental setup • FreeBSD 4.3 • Alpha 21264, 500 MHz, 512 MB RAM • 8 KB, 64 KB, 512 KB, 4 MB pages • 128-entry DTLB, 128-entry ITLB • Unmodified applications

Best-case benefits • TLB miss reduction usually above 95% • SPEC CPU2000 integer • 11.2% improvement (0 to 38%) • SPEC CPU2000 floating point • 11.0% improvement (-1.5% to 83%) • Other benchmarks • FFT (2003 matrix): 55% • 1000x1000 matrix transpose: 655% • 30%+ in 8 out of 35 benchmarks

Why multiple superpage sizes Improvements with only one superpage size vs. all sizes

Conclusions • Superpages: 30%+ improvement • transparently realized; low overhead • Contiguity restoration is necessary • sustains benefits; low impact • Multiple page sizes are important • scales to very large superpages

Outline for Today’s Lecture