Enhancing IPC Performance in Microkernel Designs: Lessons from L3 and L4

Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993

The Performance of u-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Proceedings of the 16th Symposium on Operating Systems Principles October 1997, pp. 66-77

Jochen Liedtke (1953 – 2001) • 1977 – Diploma in Mathematics from University of Beilefeld. • 1984 – Moved to GMD (German National Research Center). Build L3. Known for overcoming ipc performance hurdles. • 1996 – IBM T.J Watson Research Center. Developed L4, a 12kb second generation microkernel.

The IPC Dilemma • Inter-process communication (ipc) by message passing is one of the central paradigms of u-kernel and client / server architectures. • Increase modularity, flexibility, security and scalability. • But, most ipc implementations of the time performed poorly (1st generation micro-kernels such as Mach or Chorus). Really fast message passing systems were needed to run device drivers and other performance critical components at the user-level. • So, programmers started to circumvent ipc. For example, co-locating device drivers and other components back into the kernel. • To gain acceptance, ipc has to become a very efficient basic mechanism.

What to Do? • The author sets out to construct a u-kernel that will achieve a tenfold improvement in ipc performance over comparable systems. • “ipc performance is the master” is a key design principle. • Result is L3 is micro-kernel based operating system built by GMD (German National Research Center for Computer Science) and finally L4. • Use a synergistic approach, no single “silver bullet” exists.

Summary of Techniques Seventeen Total

Measured Performance Gains • Note synergistic effect. For 8-byte ipc; • 49% + 23% + 21% + 18% + 13% + 10% = 134% • 49% means that that removing that item would increase ipc time by 49%.

Client (Sender) Server (Receiver) L4_ipc_send ( ); system call, Enter kernel Exit kernel L4_ipc_receive ( ); system call, Enter kernel Exit kernel Client is not Blocked L4_ipc_send ( ); system call, Enter kernel Exit kernel L4_ipc_receive ( ); system call, Enter kernel Exit kernel Standard System Calls (Send, Receive) Kernel entered and exited four times, 107 cycles each time.

Client (Sender) Server (Receiver) L4_ipc_call ( ); system call, Enter kernel Allocate Processor to Server Suspend L4_ipc_reply_and_wait ( ); Resume from being suspended Return to user (exit kernel) Client IS Blocked Inspect message L4_ipc_reply_and_wait ( ); Enter kernel Send Reply Wait for next message L4_ipc_receive ( ); system call, Processor allocate to Client Exit kernel Add New System Calls Kernel entered and exited two times, half as much.

Complex Message Structure Combine a sequence of send operations into a single operation by supporting complex messages. • Benefit: reduces number of sends.

Direct Transfer by Temporary Mapping • LRPC and RPC share user level memory of client and server to transfer messages. But this may effect security. • Other micro-kernels transfer messages by a twofold copy, process A space into kernel space into process b space. • L4 provides single-copy transfers by temporarily sharing the target region with the sender.

Scheduling, Conventional • Conventionally, ipc operations call or reply & receive requires scheduling actions: • Delete sending thread from the ready queue. • Insert sending thread into the waiting queue • Delete the receiving thread from the waiting queue. • Insert receiving thread into the ready queue. • These operations, together with 4 expected TLB misses will take at least 1.2 us (23%T).

Solution, Lazy Scheduling • Conventional IPC requires updating of thread scheduler queues. Performance can be improved by delaying the movement of threads within/between queues until the queues are queried. This ``lazy'' scheduling is achieved by setting state flags (ready / waiting) in the Thread Control Blocks (tcb – contains basic information about a thread) and then scanning queues at query time for threads which should be moved to different queues.

Pass Short Messages in Register • Typically, a high proportion of messages are very short, 8 bytes (plus 8 bytes of sender id). Examples would be ack/error replies from device drivers or hardware initiated interrupt messages. • The 486 processor had enough registers to allow direct transfer of short messages via cpu registers. • Performance gain of 2.4 us or 48%T.

IPC Performance • For an eight byte message, ipc time for L3 is 5.2 us compared to 115 us for Mach, a 22 fold improvement. • For large message (4K) a 3 fold improvement is seen.

Monolithic Kernel vs. Microkernel

L4 Performance

Conclusion • Use a synergistic approach to achieve greater ipc performance, a single “silver bullet” may not exist. • A thorough understanding of the interaction between the hardware architecture and the operating system is key to many of the improvements. Microkernels are not portable between hardware architectures. • L4 demonstrated the viability of running applications on top of a micro-kernel.

References • http://i30www.ira.de/aboutus/people/liedtke/inmemoriam.php • Microkernels; Ulfar Erlingsson, Athanasios Kyparlis • Monolithic Kernel vs. Microkernel; Benjamin Roch; TU Wien

Enhancing IPC Performance in Microkernel Designs: Lessons from L3 and L4

Enhancing IPC Performance in Microkernel Designs: Lessons from L3 and L4

Presentation Transcript

Improving the Design

Improving IPC by Kernel Design

RTS: Kernel Design

Improving IPC by Kernel Design

Improving Design Quality by Managing Process Variability

IPC

Improving IPC by Kernel Design Jochen Liedtke

IPC

Improving IPC by Kernel Design

IPC

IPC

Analyzing and Improving Linux Kernel Memory Protection

Improving MuCal Design

IPC-2221 5.2.7 Vibration Design

IPC

IPC

IPC

Improving the Design

IPC

Improving IPC by Kernel Design

IPC