Operating System Design

Operating System Design LINUX SYSTEM DESIGN & KERNEL (2.6, 3.X) Ref: Linux Kernel Development by R. Love Ref: Operating System Concepts by Silberschatz…

Introduction • Monolithic & dynamically loadable kernel module • SMP support (run queue per CPU, load balance) • Kernel preemptive, schedulable, thread support • CPU (soft & hard) affinity • Kernel memory not pageable • Source in GNU C (not ANSI C) with extension, in-line for efficiency, • Kernel source tree – architecture indep/dep. part • Portable to different architecture

CPU Affinity • CPU affinity: less overhead, in cache • Soft affinity means that processes do not frequently migrate between processors. • Hard affinity means that processes run on processors you specify Reason 1: You have a hunch – computations Reason 2: Testing complex applications – linear scalability? Reason 3: Running time-sensitive, deterministic processes sched_setaffinity (…) set CPU affinity mask

Process (Task) Basics • Process States • TASK_RUNNING (run or ready) • TASK_INTERRUPTIBLE (sleeping or blocked, may be waken by signal) • TASK_UNTERRUPTIBLE (sleeping/blocked, only event can wake this task) • TASK_STOPPED (SIGSTOP, SIGTTIN, SIGTTOU signals) • TASK_ZOMBIE (pending for parent task to issue wait)

Process (Task) Basics - Continue • Context • Process context – user code or kernel (from system calls) • Interrupt context – kernel interrupt handling • Task (Process) Creation • Fork (may be implemented by: COW i.e.Copy On Write) • Vfork :same as fork (but shared, page table, parent wait) • Clone system call is used to implement fork and vfork • Threads are created the same as normal tasks except that the clone system call is passed with spec. resources shared • Task (Process) Termination • Memory/files/timers/semaphores released, notify parent

Process (Task) Scheduling • Preemptive • Scheduler Classes (priority for classes) • Real-time: FIFO and RR (timeslice), fixed priority • Normal (SCHED_NORMAL) • SMP (Run queue/structure per CPU, why?) • Processor Affinity (Soft & Hard) • Load balancing

Process (Task) Scheduling Cont. • Two process-scheduling Classes: • Normal time-sharing (dynamic) (Nice value: 19 to -20, with default 0 = 120) • Real-time algorithm (FIFO/RR) - Soft Absolute priorities (static): 0-99 FIFO run till Exit , Yield, or Block RR run with time slice Preemption possible with priority • Normal Processes: to be studied here

Early Kernel 2.6 - O(1) Scheduler • O(1) Scheduler (Early Kernel 2.6) Improved scheduler with O(1) operations using bit map operations to search highest priority queue • Active and Expired Array (Run Queues per CPU) • Scalable • Heuristics for CPU/IO bound, Interactivities

O(1) Scheduler Priority Array

O(1) Scheduler Summary • Implements a priority-based array of task entries that enables the highest-priority task to be found quickly (by using a priority bitmap with a fast instruction). • Recalculates the timeslice of an expired task before it places it on the expired queue. When all the tasks expire, the scheduler simply needs to swap the active and expired queue pointers and schedule the next task. Long scans of runqueues are, thus, eliminated • This process takes the same amount of processing, irrespective of the number of tasks in the system. It no longer depends on the value of n, but is a fixed constant

O(1) Scheduler Problems • Although O(1) scheduler performed well and scaled effortlessly for large systems with many tens or hundreds of processors, IT FAILS ON: • Slow response to latency-sensitive applications i.e. interactive processes for typical desktop systems • Not achieving Fair (Equal) CPU Allocation

Current: Completely Fair Scheduler (CFS) • Since Kernel 2.6.23 • CFS Aiming at • Giving each task a fair share (portion) of the processor time (Completely Fair) • Improving the interactive performance of O(1) scheduler for desktop. While O(1) scheduler is ideal for large server workloads • Introduces simple/efficient algorithmic approach (red-black tree) with O(log N). While O(1) scheduler uses heuristics and the code is large and lacks algorithm substance.

Completely Fair Scheduler (CFS)

CFS– Processor Time Allocation • Select next that has run the least. Rather than assign each process a time slice, CFS calculates how long a process should run as a function of the total number of runnable processes (default: 1 ms as minimum granularity) • Nice values are used to weight the portion of processor a process is to receive (not by additive increases, but by geometric differences). Each process will run for a “timeslice” proportional to its weight divided by total weight of all runnable processes. Assume TARGETED_LANTENCY = 20ms: Two threads: the niceness are 0(10), and 5(15), CFS assigns relative weight 3 : 1 (approx.) – *particular algorithm Niceness 0(10) receives 15ms and Niceness 5(15) receives 5ms Here, CPU portion is determined only by the relative value.

CFS – The Virtual Runtime (vruntime) • The virtual runtime (vruntime) is the actual runtime (the amount of time spent running) normalized (or weighted) by the number of runnable processes • The virtual runtime is measured in nano seconds • The virtual runtime is updated: Every time a thread runs for t ns, vruntime += t (weighted by task niceness i.e. priority) • The virtual runtime (vruntime) is used to account for how long a process has run. CFS will then pick up the process with the smallest vruntime.

CFS– Process Selection • CFS select the process with the minimum virtual runtime i.e. vruntime • CFS use a red-black tree (rbtree – a type of self-balancing binary search tree) to manage the list of runnable processes and efficiently (algorithm) find the process with the smallest vruntime • The selected process with the smallest vruntime is the leftmost node in the (rbtree) tree.

CFS – Process just Created or Awaken • A new process is created The new process is assigned the current Minimum Virtual Runtime (adjusted) and inserted into the rbtree • A process is awakened from blocking vruntime = Maximum (old vruntime, current Min_vruntime substracted by adjusted TARGETED_LANTENCY) This can prevent a process that blocked for a long time from monopolizing the CPU

CFS – Group Scheduling • In plain CFS, if there are 25 runnable processes, CFS will allocate 4% to each (assume same). If 20 belong to user A, and 5 belong to user B, then user B is at an inherent disadvantage. • Group scheduling will first try to be fair to the group and then individual in the group, i.e. 50% to user A and 50% to user B. Thus for A, the allocated 50% of A will be divided fairly among A’s 20 tasks. For B, the allocated 50% will be divided fairly among B’s 5 tasks.

CFS– Run Queue (Red-Black Tree) • Tasks are maintained in a time-ordered (i.e. vruntime) red-black tree for each CPU • Red-Black Tree A variant of binary search tree with nodes colored No leaf can be more than twice as deep as others Self-balancing Operations on the tree in O(log N) time • CFS will switch to the leftmost task in the tree, that is, the one with the lowest virtual runtime (most need for CPU) to maintain fairness.

CFS – Red-Black Tree(www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/)

Interrupt handling • Interrupts • Devices ->Interrupt Controller->CPU-->Interrupt Handlers • Device has unique value for each interrupt = IRQ On PC, IRQ 0 is timer interrupt, IRQ 1 is keyboard interrupt • Exceptions (Soft Interrupts) • Synchronous interrupts • Programming error, abnormal operation (page fault), syscall (trap)

Top Halves and Bottom Halves • Top Half • Interrupt Handler (line disabled, local interrupts disabled) • Run immediately, quickly • ACK receipt of interrupt or resetting the hardware, copy data • Bottom Half • Interrupt enabled • Detailed and more work to be processed • Example of Network Card • Top half: alert the kernel to optimize network throughput, latency and avoid timeout. ACK hardware, copy packets to memory and ready network card for more packets • The rest will be left to bottom half

Top-Half • Registering an Interrupt Handler int request_irq (irq, *handler, irqflags, *devname, *dev_id) • Writing an Interrupt Handlers • When kernel receives interrupt, it invokes sequentially each registered handler on the line (till dev is found)

Interrupt Context • Top Half and bottom half (non work queue type) • Not associated with a process. • Without a backing process, it can not sleep (because it can not be rescheduled). Can not call function that sleeps. So it limits functions that one can call from interrupt context • Time critical code that should be quick and simple

Bottom Halves and Deferring Work • Softirqs – interrupt context (can not block) Time-critical, high frequency, highly-threaded • Tasklets: interrupt context (can not block) Special Softirq, simple interface, Less locking requirement • Work Queues: process context (can block) Can sleep/schedule, but higher overhead with context switch, most easy to use

Bottom Halves - Ksoftirqd • When the system is overwhelmed with softirqs Heavy softirq activity may starve user • Per-CPU Kernel Thread: ksoftirqd (lowest priority 139) Having a thread per CPU to service softirqs, but at lowest priority to allow user processes to run ksoftirqd awakened when do_softirq run MAX_RESTART (i.e. 10) times

Which Bottom-Half ? • Softirqs – Interrupt context For highly threaded code with per-CPU data; time critical and high frequency uses • Tasklets – Interrupt context Not finely threaded, simpler interface, easier to implement • Work Queues – Process context Schedulable, can sleep Highest overhead (context switch), but easiest to use If no need to sleep, softirqs and tasklets are better

Kernel Synchronization • Kernel has concurrency (threads) and need synchronization • Code safe from concurrent access - Terminology Interrupt safe (from interrupt handler), SMP safe, Preempt safe (kernel preemption) • Atomic Integer Operations Atomic Bitwise Operations • Spinlock, R/W spinlock, semaphore, R/W semaphore, sequence lock, completion variables

Spin Locks • Spin locks: Lightweight For short durations to save context switch overhead • Spin Locks (Not semaphores) can be used in interrupt handlers Kernel must disable local interrupts before obtaining the spin locks. Otherwise it is possible for an IH to interrupt kernel while lock is held and attempt to reacquire the lock. • Spin Locks and Bottom Halves Bottom halves may preempt process context code (must disable bottom halves first). Interrupt may preempt bottom half: (must disable interrupt first)

Reader-Writer Spin Locks • Shared/Exclusive Locks • Reader and Writer Path read_lock(&my_rwlock) write_lock(…) CR CR read_unlock(…) write_unlock(…) • Linux 2.6 favors readers over writers (starvation of writers) for Reader-Writer Spin Locks

Semaphores • Semaphores for long wait • Semaphores are for process context (can sleep) • Can not hold a spin lock while acquiring a semaphore (may sleep) • Processes do not need to disable kernel preemption before obtaining a semaphore (code holding semaphore can be preempted) • Using Semaphores: down, up down_interruptible(&mr_sem) up(&mr_sem) down_trylock()

Reader-Writer Semaphore • Reader-Writer flavor of semaphores • Reader-Writer Semaphores are mutexes • Reader-Writer Semaphores : locks use uninterruptible sleep • As with semaphores, the following are provided: down_read_trylock(), down_write_trylock() down_read. down_write, up_read, up_write

Completion variables • A task signals other task for an event One task waits on the completion variable while other task performs work. When it completes, it uses a completion variable to wake up the other task init_completion(struct completion *) or DECLARE_COMPLETION (mr_comp) wait_for_completion (struct completion *) complete (struct completion *)

Sequential Locks • Simple mechanism for reading and writing shared data by maintaining a sequence counter write  lock obtained  seq# incr; unlock -> seq# incr. Prior to and after read: the sequence number is read The sequence number must be even (prior read) and equal at end • Writer always succeed (if no other writers), Readers never block • Favors writers over readers • Readers does not affect writer’s locking • Seq locks provide very light weight and scalable lock for use with many readers and a few writers

Seq Locks (Cont.) • Example: seqlock_t mr_seq_lock *s1 WRITE: write_seq_lock (s1); // spin_lock(s1->lock); ++s1->sequence; SMP_wmb() /* Write Data */ write_sequnlock (s1); //SMP_wmb(); s1->sequence++; spin_unlock(s1-> lock); READ: do { seq = read_seqbegin (s1); // ret = s1->sequence; SMP_rmb(); return ret; /* read data */ } while (read_seqretry (s1, seq));// SMP_rmb(); return (seq&1) | // (s1->sequence^seq) • Pending writers continually cause read loop to repeat until writers Done.

Ordering and Barriers • Both compiler and CPU can reorder reads/writes: Compiler: optimization, CPU: performance i.e. pipeline • Instruct CPU not to reorder R/W Barrier() call to instruct compiler not to reorder R/W • Memory Barrier and Compiler Barrier Methods barrier() // compiler barrier - load/store smp_rmb(), wmb(), mb() Intel X86 processors: do not ever reorder writes

Memory Management • Main Memory : Three (3) parts kernel memory (never paged out), kernel memory for memory map (never paged out) pageable page frames (user pages, paging cache, etc.) • Memory Map : mem_map array of page descriptor for each page frame in system a pointer to address space it belongs to (if not free) or linked list for free frames

Physical Memory Management • Various Uses of Physical Memory Kernel memory (kernel, memory map) User pages (pageable page frames) Paging cache (pageable page frmaes) Arbitrary size, contiguous kernel memory (for LKM – dev. Drivers) • Memory Allocation Mechanisms (see next slide) Page allocator - buddy algorithm Slab allocator Kmalloc

Physical Memory Management (Cont.) • Page allocator- buddy algorithm (2**i split or combined) 65 page chunk->ask for 128 page chunk • Slab allocator: carves chunk (from buddy algorithm) into slabs - one or more physically contiguous pages A cache (for each kernel data structure, TCB...): one or more slabs and is populated with kernel objects (instances, TCBs, semaphores) Example: To allocate a new task_struct, Kernel looks in the object cache. First try partially full slab. If not, look through empty slabs. Finally, it allocates a new slab • kmalloc(): allocate entire pages on demand, then splits into smaller pieces and return a pointer to (physically contiguous) region of memory with bytes requested

Virtual Memory • Virtual Address Space Homogeneous, contiguous, page-aligned areas (text, mapped files) Page size: 4KB (Pentium), 8KB (Alpha) – Linux also support 4MB • Memory Descriptor A process address space is represented by mm_struct (pointed to by mm field of task_struct) struct mm_struct { struct vm_area_struct *mmap; // list of memory areas – text, data,… pgd_t *pgd; // page global directory atomic_t mm_users // addr. space users – 2 for 2 threads atomic_t mm_count; // primary reference count struct list_head mmlist; // list of all mm_struct … // lock, semaphore… …. // start/end addr. Of code, data, heap, stack }

Virtual Memory - Paging • Four-level paging (for 64 bit architectures) global/upper/middle directory, and page table Pentium using two-level paging (global directory points to page table) • Demand paging (no pre-paging, no working set concept) With only user structure (PCB), and page tables need to be in memory Page daemon (process 2): awaken (periodically or demand) – check ‘free’

Page Replacement • Modified Version of LRU Two-list strategy Active list – working set of all processes Inactive list – reclaim candidates Pages are placed on active list only when they are accessed while already residing on the inactive list (put here when first allocated) Both lists are maintained in a pseudo-LRU: add to tail and removed from head (like a queue) Lists in balanced: if active list becomes larger, items will be moved from active list head to inactive list tail.

Page Replacement (Cont.) • A Global Policy All reclaimable pages are contained in just two lists and pages belonging to any process may be reclaimed, rather than just those belonging to a faulting process, • The two-list strategy enables simpler, pseudo-LRU semantics to perform well Solves the only-used-once failure in a classical LRU

The Filesystem • To the user, Linux’s file system appears as a hierarchical directory tree obeying UNIX semantics • Internally, the kernel hides implementation details and manages the multiple different file systems via an abstraction layer, that is, the virtual file system (VFS) • The Linux VFS is designed around object-oriented principles: • Write -> sys_write() // VFS • Then --> filesystem’s write method --> physical media • VFS Objects • Primary: superblock, inode(cached), dentry (cached), and file objects • An operation object is contained within each primary object: super_operations, inode_operation, dentry_operation, file_operations • Other VFS Objects: file_system_type, vfsmount, and three per-process structures such as file_struct, fs_struct and namespace structures

File System and Device Drivers User applications Libraries User mode Kernel mode File subsystem Buffer/page cache Character device driver Block device driver Hardware control

Virtual File System

Operating System Design