Improving IPC by Kernel Design

Improving IPC by Kernel Design Jochen Liedtke Presentation: Rebekah Leslie

Microkernels and IPC: • Microkernel architectures introduce a heavy reliance on IPC, particularly in modular systems • Mach pioneered an approach to highly modular and configurable systems, but had poor IPC performance • Poor performance leads people to avoid microkernels entirely, or architect their design to reduce IPC • Paper examines a performance oriented design process and specific optimizations that achieve good performance

Performance vs. Protection: • Mach provided strong isolation between tasks using indirect IPC transfer (ports) with limited access controlled by capabilities (port rights) • L3 removes indirect transfer, capabilities, and RPC validation to achieve better performance • Provides basic address space protection • Does not provide true isolation • Recent L4 designs incorporate the use of capabilities to achieve isolation in security-critical systems

L3 System Architecture: • Mach-like design with a focus on highly modular systems • Limit kernel features to functionality that absolutely cannot be implemented at user-level • Reliance on user level servers for many “traditional” OS features: page-fault handling, exception handling, device drivers • System organized into tasks and threads • IPC allows direct data transfer and memory sharing • Direct communication between threads via thread ID

Performance-centric Design: • Focus on IPC • Any feature that will increase cost must be closely evaluated • When in doubt, design in favor of IPC • Design for Performance • A poorly performing technique is unacceptable • Evaluate feature cost compared to concrete baseline • Aim for a concrete performance goal • Comprehensive design • Consider synergistic effects of all methods and techniques • Cover all levels of implementation, from design to code

Performance Baseline: • The cost of each feature must be evaluated relative to a concrete performance baseline • For IPC, the theoretical minimum is an empty message: this measures the overhead without data transfer cost 127 cycles without prefetching delays or cache misses + 45 cycles for TLB misses = 172 cycle minimum time GOAL: 350 cycles (7 s) for short messages

Messages in L3: • Tag: Description of message contents • Direct string: Data to be transferred directly from send buffer to receive buffer • Indirect string: Location and size of data to be transferred by reference • Memory object: Description of a region of memory to be mapped in receiver address space (shared memory) • System calls: Send, receive, call (send and receive), reply/wait (receive and send) tag direct string indirect strings memory objects

Basic Message Optimizations: • Ability to transfer long, complex messages reduces the number of messages that need to be sent (system calls) • Indirect strings avoid copy operations at user level • User specifies data location, rather than copying data to buffer • Receiver specifies destination, rather than copying from buffer • Memory objects transferred lazily, i.e., page table is not modified until access is required • Combined send/receive calls reduce number of traps

copy mapped with kernel-only permission Optimization - Direct Transfer via Temporary Mapping: • Two copy message transfer costs 20 + 0.75n cycles • L3 copies data once to a special communication window in kernel space • Window is mapped to the receiver for the duration of the call (page directory entry) A kernel add mapping to space B B kernel

Optimization - Transfer Short Messages in Registers: • IPC messages are often very short • Example: Device driver ack or error replies • On average, between 50% and 80% of L3 messages are less than eight bytes long • Even on the register poor x86, 2 registers can be set aside for short message transfer • Register transfer implementation saved 2.4 s, even more than the overhead of temporary mapping (1.2 s) because it enabled further optimizations

Thread Scheduling in L3: • Scheduler maintains several queues to keep track relevant thread-state information • Ready queue stores threads that are able to run • Wakeup queues store threads that are blocked waiting for an IPC operation to complete or timeout (organized by region) • Polling-me queue stores threads waiting to send to some thread • Efficient representation of data structures • Queues are stored as doubly-linked lists distributed across TCBs • Scheduling never causes page faults

Optimization - Lazy Dequeueing • A significant in a microkernel is the scheduler overhead for kernel threads (recall the user-level threads papers) • Sometimes, threads are removed from a queue, only to be inserted again a short while later • With weak invariants on scheduling queues, you can delay deleting an a from a queue and save overhead • The ready queue contains at least all ready threads • A wakeup queue contains at least all waiting threads

Optimization - Store Task Control Blocks in Virtual Arrays • A task control block (TCB) stores kernel data for a particular thread • Every operation on a thread requires lookup, and possibly modification, of that thread’s TCB • Storing TCBs in a virtual array provides fast access to TCB structures

Optimization - Compact Structures with Good Locality • Access TCBs through pointer to the center of the structure so that short displacements can be used • One-byte long registers reach twice as much TCB data as with a pointer to the start of a structure • Group related TCB information on cache line boundaries to minimize cache misses • Store frequently accessed kernel data in same page as hardware tables (IDT, GDT, TSS)

Optimization - Reduce Segment Register Loads • Loading segment registers is expensive (9 cycles register), so many systems use a single, flat segment • Kernel preservation of the segment registers requires 66 cycles for the naive approach (always reload registers) • L3 instead checks if the flat value is still intact, and only does a load if not • Checking alone costs 10 cycles

Performance Impact of Specific Optimizations: • Large messages dominated by copy overhead • Small messages get benefit of faster context switching, fewer system calls, and fast access to kernel structures

IPC Performance Compared to Mach (Short Message): • Measured using pingpong micro-benchmark that makes use of unified send/receive calls • For an n-byte message, the cost is 7 + 0.02n s in L3

IPC Performance Compared to Mach (Long Messages): • Same benchmark with larger messages. • For n-byte messages larger than 2k, cache misses increase and the IPC time is 10 + 0.04n s • Slightly higher base cost • Higher per-byte cost • By comparison, Mach takes 120 + 0.08n s

Comparison of L3 RPC to Previous Systems:

Conclusions: • Well-performing IPC was essential in order for microkernels to gain wide adoption, which was a major limitation of Mach • L3 demonstrates that good performance is attainable in a microkernel system with IPC performance that is 10 to 22 times better than Mach • The performance-centric techniques demonstrated in the paper can be employed in any system, even if the specific optimizations cannot

Improving IPC by Kernel Design