Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College of Engineering Semester B 2010

A Brief Intro. To The Linux Kernel • Created by a Finnish computer science student Linus Torvalds (hence, Linux) in 1991 as a hobby to experiment with the Intel 80386 CPU. • Kernel code is mostly C and assembler. • First version contained 10,239 lines of code, nowdays: 12,990,041 (!). • 1992 marked the year the code had become self-hosted. • GUI X Windowing System was added into the OS. • Distributed under the GPL license to evade commercial clutches.

Monolithic Kernel The linux kernel type is called Monolithic Kernel in which the OS is operating under A jurisdiction called Kernel Mode (or Kernel Space). User code and OS code operate in separate Spaces, thus the kernel code is secured by Using hardware dependent CPU flags. • Kernel has a hold on the following routinely tasks: • Handling drivers - communicating with the hardware. • File System management. • Memory Management. • Scheduling of tasks.

Present Today, Linux Kernel spans a variety of distros for all kinds of architectures: Lightweight Netbooks Music Players Servers Mobile Devices Desktop Computers Various Other Embedded Systems Cluster Computing

Program Building Blocks Programs and Processes • A program is a box which consists of Data and Instructions which give meaning to it. • A process is an instance of a program (same as an object to a class in OOP). • The purpose of a process is to embody the data of a program such as: threads and their data, hardware registers and the contents of the program’s address space.

Program Building Blocks Threads • A Process can have multiple threads of execution that work together to accomplish its goals. • An OS kernel must keep state information for every thread it creates. • Threads of a process sometimes share the same address space, at times they overlap or have completely separate address space. • Only one thread is allowed to execute on the same CPU at a time. • An example of threads can be seen in almost every video games, as there are multiple threads handling various modules of the game – one for user input, one for graphics, one for sound processing and another for AI.

Program Building Blocks Scheduling A multitasking kernel, allows multiple processes to run alongside: • Processes are not aware of each other unless programmed to be so. • A Scheduler shifts the programmer’s mind from the tedious task of scheduling and focuses it on actual programming (Imagine how “fun” it would’ve been otherwise). • The Scheduler’s goals are to decide the policy of each thread in the system: • How long will it run? • On which CPU? • In What priority? • The scheduler itself is a thread, it is scheduled routinely using a timer interrupt.

Program Building Blocks CPU and I/O Bound Threads • Threads divide into two categories: • CPU Bound – Threads that perform computations heavily dependent on a CPU’s resources. • I/O Bound – Threads that wait for certain I/O device, these threads are CPU friendly since they sleep more. A Scheduler cannot know for sure which thread is CPU bound and which is I/O bound, however, it can guesstimate with a reasonable precision. The common practice is to prioritize I/O bound tasks since they take a long time to wait for their resource, its best to begin them ASAP.

Linux Scheduling Goals Real World Needs In the real world, beyond theory, a scheduler's goals are not to follow blindly a set of mathematically proven algorithms, but to adjust itself to the needs of its target market through real experience. Linux has gone far beyond its 80386 origins, now days its code has to be adjusted to support: Desktops, Servers, Embedded Systems, Cluster Computers and NUMA environments. As a result of the uses mentioned above the Linux scheduler has been tailored accordingly.

Linux Scheduling Goals Core Requirements • Efficiency - Less context switches the better. • Interactivity - Responsiveness . • Preventing Starvation – Fairness among threads. • Support for soft RT (real time) tasks.

Linux Scheduling Goals Multiprocessing: SMP, SMT and NUMA • Era of Multiprocessing is at hand, 2.6 includes support for: • SMP – Symmetric MultiProcessing – the case of several cores on the same CPU die. The goals: • Efficient division of workload. • Keeping a thread running on the same CPU it started to avoid re-caching. • SMT – Symmetric MultiThreading – a concept first implemented by Intel, which in term named it HyperThreading (HT).One CPU virtualizes multiple CPUs, the caveat: Virtual cores cannot be completely treated as true separate cores as they share the same cache resources. • NUMA – Non-Uniform Memory Access, a type of clustering technique which involves tight coupling of nodes (node=motherboard), high volume of CPUs. • Major problem: Memory locality – threads should keep residence on the same node they started so efficiency will not suffer.

The Linux 2.6.x Scheduler Introduction: O(1) The current 2.6.x scheduler guarantees constant runtime O(1) thanks to O(1) algorithms it is composed of. This is a vast improvement over the 2.4 scheduler which ran at O(n), The base reason for it is its independence of the amount of tasks in the system.

The Linux 2.6.x Scheduler Introduction: Priority & Timeslices • Process Priority – dynamic process priority: • Threads which use all their timeslices considered CPU bound – little to no priority boost. • Threads which sleep a lot are considered nice to the CPU in term gain a priority boost. • Priority levels are divided into 2 ranges with a total of 140 levels: • nice value – [-20,+19], default: 0. (lower number = higher priority) • real time priority – [0,99] (again, lower number = higher priority) • *Timeslice – time in milliseconds which specifies for how long can a given thread run until it’s preempted. • It is not a trivial task to decide the perfect length of a timeslice:* The longer the time slice the less interactive the system can become. • * Too short timeslices cause heavy resource loses over to context switches. • * Timeslices don’t have to be exhausted all at once – e.g a 100ms time slice can be exhausted in 5 different reschedules, 20ms each.* Large timeslices benefit interactive tasks – make them run first and foremost, they won’t really use up all of the timeslice anyways. • * A process which exhausts all it’s timeslices is not elligible to run until all other processes have exhausted theirs.

The Linux 2.6.x Scheduler Introduction: Preemption • The Linux operating system is preemptive. • The scheduler executes in the following scenarios: • A process enters the TASK_RUNNING state and it’s priority is higher than the current process. • A process has exhausted all it’s timeslices (=0).

The Linux 2.6.x Scheduling Algorithm Code Orientation • Code is located in kernel/sched.c and kernel/sched.h • Interesting facts: • Some types are declared within sched.c and not in sched.h, this is done in order to abstract away scheduler private types. • System calls which are public are to be found within sched.h.

2.6.x Scheduler Data Structures Runqueue • Monitors running/expired task for each CPU (1 rq per CPU). • Declared in kernel/sched.c: structrunqueue { spinlock_t lock;/* spin lock that protects this runqueue */ unsigned long nr_running; /* number of runnable tasks */ unsigned long nr_switches; /* context switch count */ unsigned long expired_timestamp; /* time of last array swap */ unsigned long nr_uninterruptible; /* uninterruptible tasks */ unsigned long longtimestamp_last_tick; /* last scheduler tick* / structtask_struct *curr; /* currently running task */ structtask_struct *idle; /* this processor's idle task */ structmm_struct *prev_mm; /* mm_struct of last ran task */ structprio_array *active; /* active priority array */ structprio_array *expired; /* the expired priority array */ structprio_array arrays[2]; /* the actual priority arrays */ structtask_struct *migration_thread; /* migration thread */ structlist_headmigration_queue; /* migration queue*/ atomic_tnr_iowait; /* number of tasks waiting on I/O*/ };

2.6.x Scheduler Data Structures Runqueue - Safety • A runqueue can be accessed from multiple threads, however, only one thread • is allowed to access it at a time, to obtain thread safety locks are used: structrunqueue *rq; rq = this_rq_lock(); /* manipulate this process's current runqueue, rq */ rq_unlock(rq); To avoid deadlocks when accessing multiple runqueues, a convention is used – locks are obtained in the order of ascending runqueue address: /* to lock ... */ if (rq1 == rq2) spinlock(&rq1->lock); else { if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { spin_lock(&rq2->lock); spin_lock(&rq1->lock); } } /* manipulate both runqueues ... */ /* to unlock ... */ spin_unlock(&rq1->lock); if (rq1 != rq2) spin_unlock(&rq2->lock); In order to automate the steps above, the functions double_rq_lock(rq1, rq2) and double_rq_unlock(rq1, rq2)can be used.

2.6.x Scheduler Data Structures Priority Arrays • Priority arrays are the data structures that provide O(1) scheduling by mapping each running task to a priority queue. • Each runqueue contains pointer to 2 priority array objects: active, expired. • Priority arrays defined in kernel/sched.c: structprio_array { intnr_active; /* number of tasks in the queues */ unsignedlong bitmap[BITMAP_SIZE]; /* priority bitmap */ structlist_head queue[MAX_PRIO]; /* priority queues */ }; queue[MAX_PRIO]– an array of queues (linked lists) 1 queue per priority (MAX_PRIO == 140 by default). bitmap[BITMAP_SIZE] – a map of bits (from which MAX_PRIO==140 are used), a bit is turned on whenever at least one task exists at a given priority level. * The bitmap makes it easy to find the task with the highest priority using a macro called sched_find_first_bit() which operates at O(1). • nr_active– number of tasks active in all of the priority queues.

2.6.x Calculations Recalculating Timeslices – The Past • Here’s a naïve (pre 2.6.x) algorithm: for (each task in the system) { recalculate priority recalculate timeslice } Issues arise by the above algorithm: • Worst case O(n) complexity (n=number of task in the system) • Recalculation must be performed under some kind of lock protection, result if high lock contetnion. • Nondeterminstic recalculation which leads to problems in deterministic real-time tasks.

2.6.x Calculations Recalculating Timeslices – The Present #define BASE_TIMESLICE(p) (MIN_TIMESLICE + \ ((MAX_TIMESLICE - MIN_TIMESLICE) * \ (MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1))) • A simple O(1) call to BASE_TIMESLICE(task) does the trick. • Timeslice calculation has unlimited forking protection, when a child is forked the parent divides its timeslice between the child.

2.6.x Calculations Recalculating Timeslices – The Present • The new scheduler alleviates the need for a recalculation loop. • Instead, it maintains two priority arrays for each CPU (located in its runqueue): • active – contains all tasks that still have nonzero timeslices. • expired – contains all tasks which have exhausted their timeslice. • When a task’s timeslice reaches zero it is recalculated and it is put in to the expired array, and when all tasks are done with their timeslices the arrays are swapped: structprio_array *array = rq->active; if (!array->nr_active) { rq->active = rq->expired; rq->expired = array; }

2.6.x Calculations Calculating Priority • As you can recall, interactive tasks are of priority, how can the scheduler tell the difference? task->sleep_avg += sleep_time; • //sleep_avg is bound by MAX_SLEEP_AVG (10ms by default) task->sleep_avg -= runt_time; • The scheduler uses the value above to reward or give penalty to a task’s • dynamic priority, this pentalty/rewards range is: [-5,5] • #define CURRENT_BONUS(p) \ • NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG) inteffective_prio(task_struct *p) { • /* Compute bonus based on sleep_avg (see CURRENT_BONUS above) */ bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; • prio = p->static_prio - bonus; /* add bonus to dynamic priority */}

2.6.x Decisions Deciding which task goes next • The act of picking the next task and switching to it is implemented via the schedule() function: • Called when a task goes to sleep. • Called when a task is preempted. • Runs independently on each CPU. • schedule() is relatively simple for all it must accomplish, here is how it determines the next task to be run: structtask_struct *prev, *next; structlist_head *queue; structprio_array *array; intidx; prev = current; array = rq->active; idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, structtask_struct, run_list); if (prev != next) context_switch();

2.6.x Decisions Deciding which task goes next - Illustration

2.6.x Decisions Sleeping And Waking Up • Sleep and waking is an important aspect which must be implemented propertly: • *When a task declares it needs to sleep it is marked as sleeping, it’s moved away from the runqueue and put in the wait queue, then it calls schedule() to notify the scheduler continue scheduling, waking is vice versa. • Sleeping tasks have 2 states associated with them: • TASK_INTERRUPTIBLE - can be awaken prematurely (respond to a signal). • TASK_UNINTERRUPTIBLE – cannot be disturbed. • Wait queues are represented in the kernel code by the wait_queue_head_tstruct which is a linked list.

2.6.x Decisions Sleeping And Waking Up: The Code /* 'q' is the wait queue we wish to sleep on */ DECLARE_WAITQUEUE(wait, current); add_wait_queue(q, &wait); while (!condition) { /* condition is the event that we are waiting for */ set_current_state(TASK_INTERRUPTIBLE);/* or TASK_UNINTERRUPTIBLE */ if (signal_pending(current)) /* handle signal */ schedule(); } set_current_state(TASK_RUNNING); remove_wait_queue(q, &wait);

2.6.x Decisions Sleeping And Waking Up: Illustration

2.6.x Scheduler – Multiprocessor Env The Load Balancer On a multiprocessor environment each CPU has its own dedicated runqueue (5). It is imperative to balance out the tasks on each runqueue so we won't come to a situation for example, where one CPU has 20 tasks and the other has 15 tasks. The idea is this: pull tasks from busy runqueues and put them into less busy runqueues.

2.6.x Scheduler – Multiprocessor Env The Load Balancer Load balancer is implemented via the load_balance() function. It has two methods of invocation: • From the schedule() function whenever current runqueue is empty. • By a timer interrupt: every 1ms whenever the system is idle and every 200ms otherwise. • Note: On uniprocessor systems it is never called or even not compiled into the kernel.

2.6.x Scheduler – Multiprocessor Env The Load Balancer • The load_balance() function and related methods are fairly large and complicated, although the steps they perform are comprehensible: • First, load_balance() calls find_busiest_queue() to determine the busiest runqueue. In other words, this is the runqueue with the greatest number of processes in it. If there is no runqueue that has at least 25% more processes than the current, find_busiest_queue() returns NULL and load_balance() returns. Otherwise, the busiest runqueue is returned. • Second,load_balance() decides from which priority array on the busiest runqueue it wants to pull. The expired array is preferred because those tasks have not run in a relatively long time and thus are most likely not in the processor's cache (that is, they are not "cache hot").If the expired priority array is empty, the active one is the only choice. • Next, load_balance() finds the highest priority (smallest value) list that has tasks, because it is more important to fairly distribute high-priority tasks than lower-priority ones. • Each task of the given priority is analyzed to find a task that is not running, not prevented to migrate via processor affinity, and not cache hot. If the task meets this criteria, pull_task() is called to pull the task from the busiest runqueue to the current runqueue. • As long as the runqueues remain imbalanced, the previous two steps are repeated and more tasks are pulled from the busiest runqueue to the current. Finally, when the imbalance is resolved, the current runqueue is unlocked and load_balance()returns.

2.6.x Scheduler – Multiprocessor Env The Load Balancer Code staticintload_balance(intthis_cpu, runqueue_t *this_rq, structsched_domain *sd, enumidle_type idle) { structsched_group *group; runqueue_t *busiest; unsignedlong imbalance; intnr_moved; spin_lock(&this_rq->lock); group = find_busiest_group(sd, this_cpu, &imbalance, idle); if (!group) gotoout_balanced; busiest = find_busiest_queue(group); if (!busiest) gotoout_balanced; nr_moved = 0; if (busiest->nr_running > 1) { double_lock_balance(this_rq, busiest); nr_moved = move_tasks(this_rq, this_cpu, busiest, imbalance, sd, idle); spin_unlock(&busiest->lock); } spin_unlock(&this_rq->lock);

2.6.x Scheduler – Multiprocessor Env The Load Balancer Code if (!nr_moved) { sd->nr_balance_failed++; if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) { int wake = 0; spin_lock(&busiest->lock); if (!busiest->active_balance) { busiest->active_balance = 1; busiest->push_cpu = this_cpu; wake = 1; } spin_unlock(&busiest->lock); if (wake) wake_up_process(busiest->migration_thread); sd->nr_balance_failed = sd->cache_nice_tries; } } else sd->nr_balance_failed = 0; sd->balance_interval = sd->min_interval; returnnr_moved; out_balanced: spin_unlock(&this_rq->lock); if (sd->balance_interval < sd->max_interval) sd->balance_interval *= 2; return 0; }

2.6.x Scheduler Soft Real-Time (RT) Scheduling The scheduler supports real time scheduling quite well, it will do its best to meet predetermined dead lines but does not guarantee it. RT tasks priority range is [0,99], this tasks will always preempt user tasks as user tasks are ranged [100,139] and the lower the priority the higher. Two Scheduling schemes are available: 1. SCHED_FIFO – As the name implies, first in, first out. Timeslices are irrelveant in this scheme, the tasks with the highest priority runs until is finishes. 2. SCHED_RR – Round Robin, tasks are scheduled by priority, task s in the same priority run in a round robin fashion for an pre-allotted timeslice. T1 T2 T1 T1 T1 T2 T2 T1 T1 T1 T2 T2

2.6.x Scheduler Conclusion The Linux scheduler has come a long way to becoming a reliable, smart and adjustable piece of code which drives the heart of what computer were meant to do – run tasks as efficiently as possible. Give proper means for the various computing environments in the market – be it embedded systems, personal computers - uniprocessor and SMP, NUMA and general clustering. It has adapted itself quite well to all of the above, but still has some open problems which were eased but no eradicated, but as in engineering, if it is accurate enough and usable for the tasks its assigned to then it achieves its goals.

Performance Graphs

Performance Graphs Served Hours

Thanks For Listening

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering

Presentation Transcript

The Linux Scheduler 2.4 vs 2.6

Linux Kernel Internals

Il kernel di Linux

Linux Kernel Internals

Linux Completely Fair Scheduler

Linux Kernel Module Programming

Linux Scheduler

Linux Kernel Porting

Linux Kernel

Linux Kernel 101

Trace Linux Kernel Source

Linux Kernel 2.6.24.3

Trace Linux Kernel Source

Linux Kernel Networking

Linux Kernel Development

Linux Scheduler