Virtual Memory Primitives for User Programs

Virtual Memory Primitives for User Programs Presentation by David Florey CS533 - Concepts of Operating Systems

Overview • This paper provides basic primitives, how there used and the implementation details on various OSs • Discuss the various primitives and how they are used (in user level algorithms) • Discuss the performance on various OSs • Discuss the ramifications of these uses (algorithms) on system design CS533 - Concepts of Operating Systems

The Primitives (VM Services) • TRAP • Facility allowing user level handling of page faults (protection or otherwise) • An event that is raised (in the form of a message or signal from OS) • PROT1 • Decreases accessibility of a single page • A procedure call (via messaging, trap to OS, etc) • PROTN • Decreases accessibility of n pages • A procedure call (via messaging, trap to OS, etc) • UNPROT • Increases the accessibility of a single page • A procedure call (via messaging, trap to OS, etc) • DIRTY • Returns a set of pages that have been touched since the last call to dirty • A procedure call (via messaging, trap to OS, etc) • MAP2 • Map two different virtual addresses to point to the same physical page • Each virtual address has its own protection level • This is in the same address space (not two different processes or tasks or address spaces) • A procedure call (via messaging, trap to OS, etc) CS533 - Concepts of Operating Systems

VM Service UsageConcurrent Garbage Collection • Stop all threads • Divide memory into from-space and to-space • Copy all objects reachable from “roots” and registers into to-space • Use PROTN to protect all pages in unscanned area • Use MAP2 to allow collector access to all pages while preventing mutators from accessing the same pages • Restart threads • As mutator threads attempt to access pages in to-space that are unscanned, TRAP event: • Stops mutator in its tracks • Calls collector, collector scans, forwards and UNPROTs page • Mutator allowed to continue • At some point this process is restarted and all objects left in from-space are considered garbage and removed CS533 - Concepts of Operating Systems

Concurrent Garbage Collection

VM Service Usage Shared Virtual Memory • Each CPU (or machine) has its own memory and memory mapping manager • Memory mapping managers keep CPU memory consistent with the “shared” memory • When a page is shared, it is marked “read-only” (PROT1) • Upon writing this page, a fault occurs in the writing thread causing TRAP event associated Mapping Manager • Mapping Manager uses trap to notify other MMs, which in turn flush their copy of the page (this mechanism may also be used to get an up-to-date copy of the page) • Page is then marked writable (UNPROT) and written • MAP2 is used to allow the trap-handler to access the protected page while the client cannot • TRAP is also used by MM to pull down a page from another CPU or disk when not available locally CS533 - Concepts of Operating Systems

Shared Memory

VM Service Usage Concurrent Checkpointing • Checkpointing is the process of state such as heap, stack, etc – which can be slow • Instead of a synchronous save, we can simply use PROTN to mark the pages that need to be saved to disk read-only • A second thread can then run concurrently with the user threads writing out pages and UNPROTing each page as its written • If a user thread hits a “read-only” page, a fault occurs TRAPping to the concurrent thread which quickly writes the page and allows the faulting thread to continue • Could also just do this with the DIRTY pages using PROT1 CS533 - Concepts of Operating Systems

Concurrent Checkpointing CS533 - Concepts of Operating Systems

Concurrent Checkpointing With DIRTY

VM Service Usage Generational Garbage collection • Objects are kept in generations • The longer an object lives, the older its generation • Typically garbage is in younger generations, but an old object might be pointing at a young object so… • Use DIRTY checkpointing to see if pages containing old objects were changed, objects in these DIRTY pages can be scanned to see where they point • Or • PROTN all old pages and TRAP to a handler when old page is written to, save page id in a list for later scanning and UNPROT page so writer can write • Later, collector can scan the list of pages to see if any objects within the pages are pointing to younger generations • Why use a small page size here? CS533 - Concepts of Operating Systems

VM Service UsageOthers… • Persistent Stores • Can use VM services to protect pages, trap on writes and persist dirty pages on commit or toss them on abort • TRAP, UNPROT and PROTN, UNPROT, MAP2 • Extending addressability • After translating 64-bit32-bit pages may need to be protected so that a TRAP handler can properly “load” the page for suitable access, then UNPROT it • TRAP, UNPROT, PROT1 or PROTN and MAP2 • Data-compression Paging • Compressing n pages into a couple of pages may be faster than writing these pages to disk. The compressed pages can then be access-protected. When user then tries to access such a page, TRAP, decompress, UNPROT • Could also use PROT1 to test access frequency of page • TRAP, PROT1 or PROTN, TRAP, UNPROT • Heap overflow detection • Terminate memory allocation with a “guard” (PROT1) page • Upon access to this page call TRAP-handler which triggers collector • Alternative is conditional branch • PROT1, TRAP CS533 - Concepts of Operating Systems

Persistent Store Example& Data Compression Example

Performance in OSs • Devised Appel1 and Appel2 based on algorithms’ patterns of primitive usage • Appel1 • PROT1, TRAP, UNPROT • e.g. Shared Virtual Memory • Appel2 • PROTN, TRAP, UNPROT • e.g. Concurrent garbage collection, CS533 - Concepts of Operating Systems

Performance in OSs CS533 - Concepts of Operating Systems

Performance of Primitives • All data normalized based on speed of Add instruction on CPU • Some OSs didn’t implement Map2 • Some OSs did a crummy job of implementing these primitives • mprotect does not flush the TLB correctly • OS designers seem to be relying on old notions like disk latency • Not relevant with CPU-based algorithms like these • One OS performed exceptionally well showing that these instructions don’t have to perform poorly CS533 - Concepts of Operating Systems

Ramifications on System Design • Fault handling must be fast because we are no longer at the mercy of the disk – we can do it all in the CPU • TLB Consistency • Making memory more accessible is good for TLB consistency • One less thing you need to worry about • Making memory less accessible in the multi-processor case forces TLB “shootdown” • Stop all processors and tell each to flush entry 123 in TLB • Better if done in batches • In fact, paging out could improve if done in batches too CS533 - Concepts of Operating Systems

Ramifications on System Design • Optimal Page Size • Some operations depend on the size of the page • “HEY OS DESIGNERS LISTEN UP!” • Disk latency can no longer be counted on for crummy design • Computations linearly proportional to page size are now going to be noticed, so we might benefit by cutting down the page size • Those algorithms that do a lot of scanning – like the Generational Garbage collector – would benefit from a smaller page size • Also be aware that shrinking page sizes will cause more page faults and more calls to the fault trap handler, so its overhead must also be very small CS533 - Concepts of Operating Systems

Ramifications on System Design • Access to Protected Pages • Mapping same page two different ways with two different protections in same address space is FAST • Although it does add some bookkeeping overhead • And cache consistency could be a problem • You could achieve the same results by copying memory around – only 65 copies and you’re there! • Or pounding your head on the desk – that works too • You could also use a heavyweight process and super heavy RPC to context switch heavily, relying on the shared page between processes support in OSs • Techniques employeed in LRPC and URPC can alleviate the context switch problem CS533 - Concepts of Operating Systems

Ramifications on System Design • What about pipelined processors? • Out-of-order execution • Dependence on sequential execution • Only a problem in the heap overflow detection case • Register tweaking can be a problem • All other algorithms work just like a typical page fault handler – handle fault, pull page in, make page accessible CS533 - Concepts of Operating Systems

Final Considerations • Making memory more accessible one page at a time, and less accessible in large batches is good for TLB consistency • The total performance effect of page size should be considered (fixed costs vs variable costs) • Locality of reference is exploited in these algorithms • Better locality improves fault handling overhead (as data is closer to CPU) • Pages should be accessible in different ways in a single address space CS533 - Concepts of Operating Systems

Virtual Memory Primitives for User Programs