Threads Cannot be Implemented As a Library

Andrew Hobbs Threads Cannot be Implemented As a Library

As a library...what does that mean? • Language specification doesn't say anything about it • The specification defines what compilers should do • So the compiler doesn't know about them either

How does this affect programming? • The compiler transforms your code to hopefully make it as fast as possible • It has some restrictions, depending on the language specification • But if the compiler doesn't know about concurrency... • It can make optimizations that are valid in sequential programs, but can cause bugs in multiprocessor environments

An example Assuming x and y are both set to 0, suppose we have 2 threads: Thread 1: x = 1; r1 = y; Thread 2: y = 1; r2 = x; What are the possible values of r1 and r2 at the end of both threads executing?

An example But what if our compiler changes our code to the following? Thread 1: r1 = y; x = 1; Thread 2: r2 = x; y = 1; What are the possible values of r1 and r2 at the end of both threads executing? This results could turn out differently...but from the compiler's view, everything is fine, because it doesn't know each thread can interact with others.

Why did this happen? • The compiler didn't know about concurrency, so it performed optimizations assuming sequential execution • Some of these don't work with concurrency! • In fact, the hardware itself can also do this in an attempt to speed up execution, by (for example) putting loads before unrelated stores

The Pthreads approach No threads shall read or modify memory that another thread is modifying (such an activity is called a race condition) To restrict access, the programmer uses synchronization routines: • pthread_mutex.lock() • pthread_mutex.unlock() • …

The Pthreads approach If the programmer uses the synchronization methods correctly to prevent race conditions, then they should have no issues But this isn't quite true...

Concurrent modification Suppose we had the following two threads: Thread 1: if (x == 1) ++y; Thread 2: if (y == 1) ++x; Is there a data race in this program?

Concurrent modification What if our compiler modified our code a little? Thread 1: ++y; if (x != 1) --y; Thread 2: ++x; if (y != 1) --x; Is there a data race in this program?

Adjacent data Suppose we had the following structure definition: struct { int a:17; int b:15 } x; There are probably no machines that have a 17-bit wide store, so if someone were to attempt to execute: x.a = 42; it would probably be done like this: { tmp = x; // Read both fields into // 32-bit variable tmp &= ~0x1ffff; // Mask off old a tmp |= 42; x = tmp; // Overwrite all of x }

Adjacent data Suppose we had the following structure definition: struct { char a; char b; char c; char d; Char e; char f; char g; char h; } x; Where a is the only field that needs to be protected by a lock. If that was the case, some programmer might write the following code: x.b = ’b’; x.c = ’c’; x.d = ’d’; x.e = ’e’; x.f = ’f’; x.g = ’g’; x.h = ’h’; x = ’hgfedcb\0’ | x.a; But a compiler might realize that it could just write all of the data at once as a 64-bit quantity (not exact syntax):

Register Promotion Suppose we had a global shared variable x, protected by a lock...but only conditionally, perhaps only if we had actually created other threads: for (...) { ... if (mt) pthread_mutex_lock(...); x = ... x ... if (mt) pthread_mutex_unlock(...); } r = x; for (...) { ... if (mt) { x = r; pthread_mutex_lock(...); r = x; } r = ... r ... if (mt) { x = r; pthread_mutex_unlock(...); r = x; } } x = r; If the conditionals are rarely taken, it might decide to promote x to a register to increase the performance:

What does this mean? Pthreads says that as long as we prevent race conditions with the synchronization functions, we will be fine But since our compiler doesn't know, it might make optimizations that break it, even though it looks perfectly fine to us We can't use locks at a high level if the presence of race conditions depends on the compiler and the hardware

Performance So why are we running multiple threads? To (hopefully) get better performance out of our program But locking is expensive! Atomic updates are hundreds of times slower than normal ones

Is synchronization always needed? Consider the following Sieve of Eratosthenes implementation: for (my_prime = start; my_prime < 10000; ++my_prime) if (!get(my_prime)) { for (multiple = my_prime; multiple < 100000000; multiple += my_prime) if (!get(multiple)) set(multiple); } What happens if we run this on multiple threads, with all of them accessing one shared data block?

The conclusions? Sometimes there are times when you can gain large performance benefits without directly using atomic operations But if we use a library that disallows this (like Pthreads), we are throwing away this ability But we are allowed to, then we need the compiler and hardware to somehow know about it and help us

The conclusions? So how do we get the compiler and hardware to help us? We need to have the programming language itself define a memory model so that the programmer knows whether there are races Only if we have that can we reason about our programs

Threads Cannot be Implemented As a Library