Concurrency and Race Conditions

Concurrency and Race Conditions Linux Kernel Programming CIS 4930/COP 5641

Motivation:Example Pitfall in Scull

Pitfalls in scull • Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kzalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull • Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull • Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } } Memory leak

Managing Concurrency

Concurrency and Its Management • Sources of concurrency • Multiple user-space processes • Multiple CPUs • Device interrupts • Timers

Some guiding principles • Try to avoid concurrent access entirely • Global variables • Apply locking and mutual exclusionprinciples • Implications to device drivers • Use sufficient concurrency mechanisms (depending on context) • No object can be made available to the kernel until it can function properly • References to such objects must be tracked for proper removal • Avoid “roll your own” solutions

Managing Concurrency • Atomic operation: all or nothing from the perspective of other threads • Critical section: code executed by only one thread at a time • Not all critical sections are the same • Access from interrupt handlers • Latency constraints

Lock Design Considerations • Context • Can another thread be scheduled on the current processor? • Assumptions of kernel operation • Breaking assumptions will break code that relies on them • Time expected to wait for lock • Considerations • Amount of time lock is expected to be held • Amount of expected contention • Long • Other threads can make better use of the processor • Short • Time to switch to another thread will be longer than just waiting a short amount of time

Kernel Locking Implementations • mutex • Sleep if lock cannot be acquired immediately • Allow other threads to use the processor • spinlock • Continuously try to grab the lock • Generally do not allow sleeping • Why?

Mutex

Mutex Implementation • Architecture-dependent code • Optimizations • Initialization • DEFINE_MUTEX(name) • void mutex_init(struct mutex *lock); • Various routines • void mutex_lock(struct mutex *lock); • int mutex_lock_interruptible(struct mutex *lock); • int mutex_lock_killable(struct mutex *lock); • void mutex_unlock(struct mutex *lock);

Using mutexes in scull • scull_devstructure revisited struct scull_dev { struct scull_qset *data; /* Pointer to first quantum set */ int quantum; /* the current quantum size */ int qset; /* the current array size */ unsigned long size; /* amount of data stored here */ unsigned int access_key; /* used by sculluid & scullpriv */ struct mutex mutex; /* mutual exclusion */ struct cdev cdev; /* Char device structure */ };

Using mutexes in scull • scull_devinitialization for (i = 0; i < scull_nr_devs; i++) { scull_devices[i].quantum = scull_quantum; scull_devices[i].qset = scull_qset; mutex_init(&scull_devices[i].mutex); /* before cdev_add */ scull_setup_cdev(&scull_devices[i], i); }

Using mutexes in scull • scull_write() if (mutex_lock_interruptible(&dev->mutex)) return -ERESTARTSYS; • scull_write ends with out: mutex_unlock(&dev->mutex); return retval;

mutex_lock_interruptible() returns nonzero • If can be resubmitted • Undo visible changes if any and restart • Otherwise return ‑EINTR • E.g., could not undo changes

mutex_lock_interruptible()(returns non-zero) • If can be resubmitted • Undo visible changes if any and restart • Otherwise return ‑EINTR • E.g., could not undo changes

Restartable system call • Automatic restarting of certain interrupted system calls • Retry with same arguments (values) • Simplifies user-space programming for dealing with "interrupted system call“ • POSIX permits an implementation to restart system calls, but it is not required. • SUS defines the SA_RESTART flag to provide a means by which an application can request that an interrupted system calls be restarted. • http://pubs.opengroup.org/onlinepubs/009604499/functions/sigaction.html • return ‑ERESTARTSYS

Restartable system call • Arguments may need to be modified • return ‑ERESTARTSYS_RESTARTBLOCK • Specify callback function to modify arguments • http://lwn.net/Articles/17744/

Userspace write()and kernelspace*_interruptible() • From POSIX man page • If write() is interrupted by a signal before it writes any data, it shall return -1 with errno set to [EINTR]. • If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written. • http://pubs.opengroup.org/onlinepubs/009604499/functions/sigaction.html

mutex_lock_killable() • mutex_lock() • Process assumes that it cannot be interrupted by a signal • Breaking assumption breaks user-kernel space interface • If process receives fatal signal and mutex_lock() never returns • Results in an immortal process • Assumptions/expectations do not apply if process receives fatal signal • Process that called system call will never return • Does not break assumption since process does not continue • http://lwn.net/Articles/288056/

Mutex Usage as Completion (Error)https://lkml.org/lkml/2013/12/2/997

General Pattern • refcount variable for deciding which thread to perform cleanup • Usage • Initialize shared object • Set refcount to number of concurrent threads • Start multiple threads • Last thread cleans up <do stuff> mutex_lock(obj->lock); dead = !--obj->refcount; mutex_unlock(obj->lock); if (dead) free(obj);

fs/pipe.c __pipe_lock(pipe); … spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); __pipe_unlock(pipe); if (kill) free_pipe_info(pipe);

CPU 1 CPU 2 mutex_lock(obj->lock); dead = !--obj->refcount; // refcount was 2, is now 1, dead = 0. mutex_lock(obj->lock); // blocks on obj->lock, goes to slowpath // mutex is negative, CPU2 is in optimistic // spinning mode in __mutex_lock_common mutex_unlock(obj->lock);__mutex_fastpath_unlock()fastpath fails (because mutex is nonpositive__mutex_unlock_slowpath:if (__mutex_slowpath_needs_to_unlock())atomic_set(&lock->count, 1); if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) { .. and now CPU2 owns the mutex, and goes on dead = !--obj->refcount; // refcount was 1, is now 0, dead = 1. mutex_unlock(obj->lock); if (dead) free(obj); but in the meantime, CPU1 is busy still unlocking: if (!list_empty(&lock->wait_list)) {

Conclusion • Mutex serializes what is insidethe mutex, but not necessarily the lock ITSELF • Use spinlocks and/or atomic ref counts • "don't use mutexes to implement completions"

Completions

Completions • Start and wait for operation to complete (outside current thread) • Common pattern in kernel programming • E.g., wait for initialization to complete • Reasons to use instead of mutexes • Wake up multiple threads • More efficient • More meaningful syntax • Subtle races with mutex implementation code • Cleanup of mutex itself • http://lkml.iu.edu//hypermail/linux/kernel/0107.3/0674.html • https://lkml.org/lkml/2008/4/11/323 • completions • #include <linux/completion.h>

Completions • To create a completion DECLARE_COMPLETION(my_completion); • Or struct completion my_completion; init_completion(&my_completion); • To wait for the completion, call void wait_for_completion(struct completion *c); void wait_for_completion_interruptible(struct completion *c); void wait_for_completion_timeout(struct completion *c, unsigned long timeout);

Completions • To signal a completion event, call one of the following /* wake up one waiting thread */ void complete(struct completion *c); /* wake up multiple waiting threads */ /* need to call INIT_COMPLETION(struct completion c) to reuse the completion structure */ void complete_all(struct completion *c);

Completions • Example: misc-modules/complete.c DECLARE_COMPLETION(comp); ssize_t complete_read(struct file *filp, char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) going to sleep\n", current->pid, current->comm); wait_for_completion(&comp); printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm); return 0; /* EOF */ }

Completions • Example ssize_t complete_write(struct file *filp, const char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) awakening the readers...\n", current->pid, current->comm); complete(&comp); return count; /* succeed, to avoid retrial */ }

Spinlocks

Spinlocks • Generally used in code that should not sleep • (e.g., interrupt handlers) • Usually implemented as a single bit • If the lock is available, the bit is set and the code continues • If the lock is taken, the code enters a tight loop • Repeatedly checks the lock until it become available

Spinlocks • Actual implementation varies for different architectures • Protect a process from other CPUs and interrupts • Usually does nothing on uniprocessor machines • Exception: changing the IRQ masking status

Introduction to Spinlock API • #include <linux/spinlock.h> • To initialize, declare spinlock_t my_lock = SPIN_LOCK_UNLOCKED; • Or call void spin_lock_init(spinlock_t *lock); • To acquire a lock, call void spin_lock(spinlock_t *lock); • Spinlock waits are uninterruptible • To release a lock, call void spin_unlock(spinlock_t *lock);

Spinlocks and Atomic Context • While holding a spinlock, be atomic • Do not sleep or relinquish the processor • Examples of calls that can sleep • Copying data to or from user space • User-space page may need to be on disk… • Memory allocation • Memory might not be available • Disable interrupts (on the local CPU) as needed • Hold spinlocks for the minimum time possible

The Spinlock Functions • Four functions to acquire a spinlock void spin_lock(spinlock_t *lock); /* disables interrupts on the local CPU */ void spin_lock_irqsave(spinlock_t *lock, unsigned long flags); /* only if no other code disabled interrupts */ void spin_lock_irq(spinlock_t *lock); /* disables software interrupts; leaves hardware interrupts enabled (e.g. tasklets)*/ void spin_lock_bh(spinlock_t *lock);

The Spinlock Functions • Four functions to release a spinlock void spin_unlock(spinlock_t *lock); /* need to use the same flags variable for locking */ /* need to call spin_lock_irqsave and spin_unlock_irqrestore in the same function, or your code may break on some architectures */ void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags); void spin_unlock_irq(spinlock_t *lock); void spin_unlock_bh(spinlock_t *lock);

Locking Traps • It is very hard to manage concurrency • What can possibly go wrong?

Ambiguous Rules • Shared data structure D, protected by lock L function A() { lock(&L); /* call function B() that accesses D */ unlock(&L); } • If function B() calls lock(&L), we have a deadlock

Ambiguous Rules • Solution • Have clear entry points to access data structures • Document assumptions about locking

function A() { lock(&L1); lock(&L2); /* access D */ unlock(&L2); unlock(&L1) } function B() { lock(&L2); lock(&L1); /* access D */ unlock(&L1); unlock(&L2) } Lock Ordering Rules - Multiple locks should always be acquired in the same order - Easier said than done

function A() { lock(&L1); X(); unlock(&L1) } function X() { lock(&L2); /* access D */ unlock(&L2); } function B() { lock(&L2); Y(); unlock(&L2) } function Y() { lock(&L1); /* access D */ unlock(&L1); } Lock Ordering Rules

Lock Ordering Rules of Thumb • Choose a lock ordering that is local to your code before taking a lock belonging to a more central part of the kernel • Lock of central kernel code likely has more users (more contention) • Obtain the mutex first before taking the spinlock • Grabbing a mutex (which can sleep) inside a spinlock can lead to deadlocks

Fine- Versus Coarse-Grained Locking • Coarse-grained locking • Poor concurrency • Fine-grained locking • Need to know which one to acquire • And which order to acquire • At the device driver level • Start with coarse-grained locking • Refine the granularity as contention arises • Can enable lockstat to check lock holding time

BKL • Kernel used to have “big kernel lock” • Giant spinlock introduced in Linux 2.0 • Only one CPU could be executing locked kernel code at any time • BKL has been removed • https://lwn.net/Articles/384855/ • https://www.linux.com/learn/tutorials/447301:whats-new-in-linux-2639-ding-dong-the-big-kernel-lock-is-dead

Alternatives to Locking • Lock-free algorithms • Atomic variables • Bit operations • seqlocks • Read-copy-update (RCU)

Concurrency and Race Conditions