Threads and Synchronization

Threads and Synchronization Jeff Chase Duke University

A process can have multiple threads int main(intargc, char *argv[]) { if (argc != 2) { fprintf(stderr, "usage: threads <loops>\n"); exit(1); } loops = atoi(argv[1]); pthread_t p1, p2; printf("Initial value : %d\n", counter); pthread_create(&p1, NULL, worker, NULL); pthread_create(&p2, NULL, worker, NULL); pthread_join(p1, NULL); pthread_join(p2, NULL); printf("Final value : %d\n", counter); return 0; } volatile int counter = 0; int loops; void *worker(void *arg) { inti; for (i = 0; i < loops; i++) { counter++; } pthread_exit(NULL); } data [code from OSTEP]

Threads • A thread is a stream of control. • defined by CPU register context (PC, SP, …) • Note: process “context” is thread context plus protected registers defining current VAS, e.g., ASID or “page table base register(s)”. • Generally “context” is the register values and referenced memory state (stack, page tables) • Multiple threads can execute independently: • They can run in parallel on multiple cores... • physical concurrency • …or arbitrarily interleaved on a single core. • logical concurrency • Each thread must have its own stack.

Threads and the kernel process • Modern operating systems have multi-threaded processes. • A program starts with one main thread, but once running it may create more threads. • Threads may enter the kernel (e.g., syscall). • Threads are known to the kernel and have separate kernel stacks, so they can block independently in the kernel. • Kernel has syscalls to create threads (e.g., Linux clone). • Implementations vary. • This model applies to Linux, MacOS-X, Windows, Android, and pthreads or Java on those systems. VAS user mode user space data threads trap fault resume kernel mode kernel space

Portrait of a thread “Heuristic fencepost”: try to detect stack overflow errors Thread Control Block (“TCB”) Storage for context (register values) when thread is not running. name/status etc 0xdeadbeef Stack ucontext_t Thread operations (parent) a rough sketch: t = create(); t.start(proc, argv); t.alert(); (optional) result = t.join(); Self operations (child) a rough sketch: exit(result); t = self(); setdata(ptr); ptr = selfdata(); alertwait(); (optional) Details vary.

A thread This slide applies to the process abstraction too, or, more precisely, to the main thread of a process. active ready or running User TCB sleep wait wakeup signal wait blocked kernel TCB kernel stack user stack When a thread is blocked its TCB is placed on a sleep queue of threads waiting for a specific wakeup event. Program

Java threads: the basics class RunnableTaskimplements Runnable { public RunnableTask(…) { // save any arguments or input for the task (optional) } public void run() { // do task: your code here } } … RunnableTask task = new RunnableTask(); Thread t1= new Thread(task, "thread1"); t1.start(); …

Java threads: the basics If you prefer, you may extend the Java Thread class. public void MyThread extends Thread { public void run() { // do task: your code here } } … Thread t1 = new MyThread(); t1.start();

CPU Scheduling 101 The OS scheduler makes a sequence of “moves”. • Next move: if a CPU core is idle, pick a ready threadfrom the ready pool and dispatch it (run it). • Scheduler’s choice is “nondeterministic” • Scheduler and machine determine the interleaving of execution (a schedule). blocked threads If timer expires, or wait/yield/terminate ready pool Wakeup GetNextToRun SWITCH()

Non-determinism and ordering Thread A Thread B Thread C Global ordering Time Why do we care about the global ordering? Might have dependencies between events Different orderings can produce different results Why is this ordering unpredictable? Can’t predict how fast processors will run

Non-determinism example • y=10; • Thread A: x = y+1; • Thread B: y = y*2; • Possible results? • A goes first: x = 11 and y = 20 • B goes first: y = 20 and x = 21 • Variable y is shared between threads.

Another example • Two threads (A and B) • A tries to increment i • B tries to decrement i i= 0; Thread A: while (i < 10){ i++; } print “A done.” Thread B: while (i > -10){ i--; } print “B done.”

Example continued • Who wins? • Does someone have to win? Thread A: i = 0; while (i < 10){ i++; } print “A done.” Thread B: i = 0; while (i > -10){ i--; } print “B done.”

Two threads sharing a CPU concept reality context switch

Resource Trajectory Graphs Resource trajectory graphs (RTG) depict the “random walk” through the space of possible program states. Sm Sn So • RTG is useful to depict all possible executions of multiple threads. I draw them for only two threads because slides are two-dimensional. • RTG for N threads is N-dimensional. • Thread i advances along axis i. • Each point represents one state in the set of all possible system states. • Cross-product of the possible states of all threads in the system

Resource Trajectory Graphs This RTG depicts a schedule within the space of possible schedules for a simple program of two threads sharing one core. Every schedule ends here. Blue advances along the y-axis. EXIT The diagonal is an idealized parallel execution (two cores). Purple advances along the x-axis. The scheduler chooses the path (schedule, event order, or interleaving). context switch From the point of view of the program, the chosen path is nondeterministic. EXIT Every schedule starts here.

A race This is a valid schedule. But the schedule interleaves the executions of “x = x+ 1” in the two threads. The variable x is shared (like the counter in the pthreads example). This schedule can corrupt the value of the shared variable x, causing the program to execute incorrectly. This is an example of a race: the behavior of the program depends on the schedule, and some schedules yield incorrect results. x=x+1 x=x+1 start

Reading Between the Lines of C load x, R2 ; load global variable x add R2, 1, R2 ; increment: x = x + 1 store R2, x ; store global variable x Two threads execute this code section. x is a shared variable. load add store load add store Two executions of this code, so: x is incremented by two. ✔

Interleaving matters load x, R2 ; load global variable x add R2, 1, R2 ; increment: x = x + 1 store R2, x ; store global variable x Two threads execute this code section. x is a shared variable. load add store load add store X In this schedule, x is incremented only once: last writer wins. The program breaks under this schedule. This bug is a race.

Concurrency control • Each thread executes a sequence of instructions, but their sequences may be arbitrarily interleaved. • E.g., from the point of view of loads/stores on memory. • Each possible execution order is a schedule. • It is the program’s responsibility to exclude schedules that lead to incorrect behavior. • It is called synchronization or concurrency control. • The scheduler (and the machine) select the execution order of threads

This is not a game • But we can think of it as a game. • You write your program. • The game begins when you submit your program to your adversary: the scheduler. • The scheduler chooses all the moves while you watch. • Your program may constrain the set of legal moves. • The scheduler searches for a legal schedule that breaks your program. • If it succeeds, then you lose (your program has a race). • You win by not losing.  x=x+1 x=x+1

The need for mutual exclusion  The program may fail if the schedule enters the grey box (i.e., if two threads execute the critical section concurrently). The two threads must not both operate on the shared global x “at the same time”. x=??? x=x+1 x=x+1

A Lock or Mutex Locks are the basic tools to enforce mutual exclusion in conflicting critical sections. • A lock is a special data item in memory. • API methods: Acquire and Release. • Also called Lock() and Unlock(). • Threads pair calls to Acquire and Release. • Acquire upon entering a critical section. • Release upon leaving a critical section. • Between Acquire/Release, the thread holds the lock. • Acquire does not pass until any previous holder releases. • Waiting locks can spin (a spinlock) or block (a mutex). A A R R

Definition of a lock (mutex) • Acquire + release ops on L are strictly paired. • After acquire completes, the caller holds (owns) the lock L until the matching release. • Acquire + release pairs on each L are ordered. • Total order: each lock L has at most one holder at any given time. • That property is mutual exclusion; L is a mutex.

OSTEP pthread example (2) load add store load add store pthread_mutex_t m; volatile int counter = 0; int loops; void *worker(void *arg) { inti; for (i = 0; i < loops; i++) { Pthread_mutex_lock(&m); counter++; Pthread_mutex_unlock(&m); } pthread_exit(NULL); } A A “Lock it down.” R R 

Portrait of a Lock in Motion The program may fail if it enters the grey box. A lock (mutex) prevents the schedule from ever entering the grey box, ever: both threads would have to hold the same lock at the same time, and locks don’t allow that.  R x=??? x=x+1 A R A x=x+1

Handing off a lock serialized (one after the other) First I go. release acquire Then you go. Handoff The nth release, followed by the (n+1)th acquire

Mutual exclusion in Java • Mutexes are built in to every Java object. • no separate classes • Every Java object is/has a monitor. • At most one thread may “own” a monitor at any given time. • A thread becomes owner of an object’s monitor by • executing an object method declared as synchronized • executing a block that is synchronized on the object public void increment() { synchronized(this) { x = x + 1; } } public synchronized void increment() { x = x + 1; }

New Problem: Ping-Pong • void • PingPong() { • while(not done) { • … • if (blue) • switch to purple; • if (purple) • switch to blue; • } • }

Ping-Pong with Mutexes? • void • PingPong() { • while(not done) { • Mx->Acquire(); • … • Mx->Release(); • } • } ???

Mutexes don’t work for ping-pong

Condition variables • A condition variable (CV) is an object with an API. • wait: block until the condition becomes true • Not to be confused with Unix wait* system call • signal (also called notify): signal that the condition is true • Wake up one waiter. • Every CV is bound to exactly one mutex, which is necessary for safe use of the CV. • “holding the mutex”  “in the monitor” • A mutex may have any number of CVs bound to it. • CVs also define a broadcast (notifyAll) primitive. • Signal all waiters.

Condition variable operations wait (){ release lock put thread on wait queue go to sleep // after wake up acquire lock } Atomic Lock always held Lock usually held Lock usually held Lock always held signal (){ wakeup one waiter (if any) } Atomic broadcast (){ wakeup all waiters (if any) } Atomic

Ping-Pong using a condition variable void PingPong() { mx->Acquire(); while(not done) { … cv->Signal(); cv->Wait(); } mx->Release(); }

Lab #1

OSTEP pthread example (1) volatile int counter = 0; int loops; void *worker(void *arg) { inti; for (i = 0; i < loops; i++) { counter++; } pthread_exit(NULL); } int main(intargc, char *argv[]) { if (argc != 2) { fprintf(stderr, "usage: threads <loops>\n"); exit(1); } loops = atoi(argv[1]); pthread_t p1, p2; printf("Initial value : %d\n", counter); pthread_create(&p1, NULL, worker, NULL); pthread_create(&p2, NULL, worker, NULL); pthread_join(p1, NULL); pthread_join(p2, NULL); printf("Final value : %d\n", counter); return 0; } data

Threads on cores load load add add store intx; worker() { while (1) {x++}; } store A A jmp jmp load R load add add R store store jmp jmp load load load add add add store store store jmp jmp jmp

Interleaving matters load x, R2 ; load global variable x add R2, 1, R2 ; increment: x = x + 1 store R2, x ; store global variable x Two threads execute this code section. x is a shared variable. load add store load add store X In this schedule, x is incremented only once: last writer wins. The program breaks under this schedule. This bug is a race.

“Lock it down” Use a lock (mutex) to synchronize access to a data structure that is shared by multiple threads. A thread acquires (locks) the designated mutex before operating on a given piece of shared data. The thread holds the mutex. At most one thread can hold a given mutex at a time (mutual exclusion). Thread releases (unlocks) the mutex when done. If another thread is waiting to acquire, then it wakes. context switch  R x=x+1 A R A x=x+1 start Themutex bars entry to the grey box: the threads cannot both hold the mutex.

Spinlock: a first try Spinlocks provide mutual exclusion among cores without blocking. int s = 0; lock() { while (s == 1) {}; ASSERT (s == 0); s = 1; } unlock (); ASSERT(s == 1); s = 0; } Global spinlock variable Busy-wait until lock is free. Spinlocks are useful for lightly contended critical sections where there is no risk that a thread is preempted while it is holding the lock, i.e., in the lowest levels of the kernel.

Spinlock: what went wrong Race to acquire. Two (or more) cores see s == 0. ints = 0; lock() { while (s == 1) {}; s = 1; } unlock (); s = 0; }

We need an atomic “toehold” • To implement safe mutual exclusion, we need support for some sort of “magic toehold” for synchronization. • The lock primitives themselves have critical sections to test and/or set the lock flags. • Safe mutual exclusion on multicore systems requires some hardware support: atomic instructions • Examples: test-and-set, compare-and-swap, fetch-and-add. • These instructions perform an atomic read-modify-write of a memory location. We use them to implement locks. • If we have any of those, we can build higher-level synchronization objects like monitors or semaphores. • Note: we also must be careful of interrupt handlers…. • They are expensive, but necessary.

Atomic instructions: Test-and-Set load test store load test store Spinlock::Acquire () { while(held); held = 1; } Wrong load 4(SP), R2 ; load “this” busywait: load 4(R2), R3 ; load “held” flag bnz R3, busywait ; spin if held wasn’t zero store #1, 4(R2) ; held = 1 Right load 4(SP), R2 ; load “this” busywait: tsl 4(R2), R3 ; test-and-set this->held bnz R3, busywait ; spin if held wasn’t zero One example: tsl test-and-set-lock (from an old machine) Problem: interleaved load/test/store. Solution: TSL atomically sets the flag and leaves the old value in a register. (bnz means “branch if not zero”)

Threads on cores: with locking tsl L tsl L bnz bnz load tsl L add bnz intx; worker() while (1) { acquire L; x++; release L; }; } store tsl L zero L bnz tsl L jmp tsl L bnz tsl L bnz load bnz load add tsl L add store bnz store zero L tsl L zero L jmp bnz jmp tsl L tsl L

Threads on cores: with locking tsl L tsl L bnz load atomic add spin intx; worker() while (1) { acquire L; x++; release L; }; } store A A zero L R jmp tsl L tsl L bnz R load add spin store zero L jmp tsl L tsl L

Spinlock: IA32 Idle the core for a contended lock. Spin_Lock: CMP lockvar, 0 ;Check if lock is free JE Get_Lock PAUSE ; Short delay JMP Spin_Lock Get_Lock: MOV EAX, 1 XCHG EAX, lockvar ; Try to get lock CMP EAX, 0 ; Test if successful JNE Spin_Lock Atomic exchange to ensure safe acquire of an uncontended lock. XCHG is a variant of compare-and-swap: compare x to value in memory location y; if x!= *y then exchange x and *y. Determine success/failure from subsequent value of x.

Locking and blocking H T If thread T attempts to acquire a lock that is busy (held), T must spin and/or block (sleep) until the lock is free. By sleeping, T frees up the core for some other use. Just sitting and spinning is wasteful! A A R STOP wait running R yield preempt sleep dispatch Note: H is the lock holder when T attempts to acquire the lock. blocked ready wakeup

Sleeping in the kernel Any trap or fault handler may suspend (sleep) the current thread, leaving its state (call frames) on its kernel stack and a saved context in its TCB. syscall traps faults sleep queue ready queue interrupts A later event/action may wakeup the thread.

Locking and blocking H T T enters the kernel (via syscall) to block. Suppose T is sleeping in the kernel to wait for a contended lock (mutex). When the lock holder H releases, H enters the kernel (via syscall) to wakeup a waiting thread (e.g., T). A A R STOP wait running R yield preempt Note: H can block too, perhaps for some other resource! H doesn’t implicitly release the lock just because it blocks. Many students get that idea somehow. sleep dispatch blocked ready wakeup

Blocking This slide applies to the process abstraction too, or, more precisely, to the main thread of a process. active ready or running sleep wait wakeup signal wait When a thread is blocked on a synchronization object (e.g., a mutex or CV) its TCB is placed on a sleep queue of threads waiting for an event on that object. blocked kernel TCB sleep queue ready queue

Threads and Synchronization