Parallel and Distributed Programming 236370 Spring 2001

Parallel and Distributed Programming 236370Spring 2001 Course website: www.cs.technion.ac.il/236370 Lecture: Thursday, 10:30 Lecturer: Assaf Schuster, Reception hr. Thu 14:00-15:00, Room 626 Frontal exercises: see course website. Teaching Assistants: Ran Wolff (in charge), Nili Efargan Checking Exercises: Grading Policy: 3 programming home assignments in Java, 1 programming home assignments in MPI (speedups), Midterm in the last lecture (June 19) Moed B in Moed A of exam (???). Relative weight: Midterm 30%, have to pass midterm to get exe grades.

Sources no books  MPI programming literature – find on the WWW and in library Java programming – find on the WWW or in the library Doug Lea: “Concurrent Programming in Java”, Addison-Wesley, 1996. Papers in the library Other papers, listed in the transparencies.

Planned Syllabus see course site.

Basic Paradigms process = a unit of sequential instruction execution program = a collection of processes Process communication: • Shared Memory; in the language level we find: • Shared variables • Semaphores for synchronization • Mutual exclusion, Critical Code, Monitors/Locks • Message Passing: • Local variables for each process • Send/receive parameters and data • Remote Procedure Call (Java’s Remote Message Invocation) • Barrier synchronization • Many variants: Linda’s tupple space, Ada’s randevous, CSP’s guarded execution

Reality is Different from Paradigm • In shared memory reading and writing is non-atomic because of queues and caching effects. • Message passing is by way of point to point jumping and packetization, no direct connection. • OS should present to the user one of the simpler models. User may assume everything works as in the spec. More often than not – implementation is buggy, or exposes details of a native view different from the spec. Sometimes – model is being complicated to enhance performance and reduce communication – relaxed consistency.

Common Types of Parallel Systems Communication Efficiency (bandwidth+ latency) • Multi-threading on a uni-processor (your home PC) • Multi-threading on a multi-processor (SMP) • Tightly-coupled parallel computer (Compaq’s Proliant, SGI’s Origin 2000, IBM’s MP/2, Cray’s T3D) • Distributed system (cluster) • Internet computing (peer-to-peer) Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only. However: things change! Most importantly: recent systems in 3 move towards presenting a shared memory interface to a physically distributed system. Is this an indication for the future? Scalability, Level of Parallelism

x x x x x x x x x x x o o o o o o o o o o o o time Execution Order • Process execution is a-synchronic, no global bip, no global clock. Each process has a different execution speed, which may change over time. For an observer, on the time axis, instruction execution is ordered in execution order. Any order is legal. (Sometimes different processes may observe different global orders, TBD). • Execution order for a single process is called program order. x P1 P2

But what if: INC(i) = Read Rx,i Add Rx,1 Store Rx,i Then, possible order of execution: Read R1,i Read R2,i Add R1,1 Add R2,1 Store R1,i Store R2,i i := i+1 Atomicity of Instruction Execution Consider: P1 INC(i) P2 INC(i) i := i+2 The atomicity model is important for answering the question: “Is my parallel program correct?”

Correctness of Concurrent Programs Correctness is proven by means of invariants, or, properties. Necessity: Recall that the speed of instruction execution varies in time. Hence, if a certain property is true for any program execution, then it is necessarily true for each and every execution order. Sufficiency: We will assume the other direction as well: if a property is true for any execution order, then it is true for the program. Sufficiency is not always true; it may fail to hold when “true” concurrency prevails. However, there is commonly a refinement of the model in which it holds (see above INC example). The intuitive reason: there exists a software/hardware level in which instruction are ordered (say, when accessing a joint bus).

Correctness cntd. Sufficiency implies a general method for proving correctness of parallel systems/programs: By induction on all possible execution orders. There are a lot of execution orders. For p processes of n instructions each, about p^(np). With a little luck – induction is not too complicated.

Program Properties – Safety Properties are kept throughout computation, always true something “bad” cannot happen if does not hold, we will know within finite number of steps Example: deadlock freedom חסר חבק There is always a process that can execute another instruction (However, not necessarily does execute it). Example: mutual exclusion ביצוע זר It is not allowed for two given code regions (in two different processes) to execute concurrently. Example: if x>y holds then x>y holds for the rest of the execution. However: mutual exclusion as above holds even if the program does not allow any of the processes to execute any of the code regions!

Liveness Properties Guarantee progress in computation Something “good” must happen (in finite number of steps) Example: no starvation חסר רעב Any process wishes to execute an instruction will eventually be able to execute. Example: Program/process eventually terminates. Example: One of the processes will enter critical section. (note the difference from deadlock freedom)

Fairness Properties Liveness properties are relatively weak guarantee of access to a shared resource. Weak fairness – if a process awaits on a certain request then eventually it will be granted. “Eventually” is not good enough for OS and real-time systems, when response time counts. Strong fairness – if the process performs the request sufficiently frequently then eventually it will be granted. Linear waiting – if a process performs the request it will be allowed previous to any other process granted twice. FIFO - …. previous to granting any other process that asked later. Easy to implement in a centralized system. However, in a distributed system it is not clear what “before or “later” mean.

Mutual Exclusion x N processes perform an infinite loop of instruction sequence, which is composed of a critical section and a non-critical section. Mutual exclusion property: instructions from critical sections of two or more processes must not be interleaved in the (global observer’s) execution order. (x x x) x x x x x x x x P1 o (o o o) o o o o o o o o P2 time

The Solution The solution is by way of additional instructions executed by every process which is to enter or leave its critical section. The pre_protocol הפרוטוקול המקדים The post_protocol הפרוטוקול המסיים Loop Non_critical_section; Pre_protocol; Critical_section; Post_protocol; End_loop;

Solution must guarantee A process cannot stop for indefinite time in the critical_section or the protocols. The solution must ensure that such a stop at the non_critical_section by one of the processes will not violate the ability of the other processes to enter the critical section. No deadlock. It may be that several processes perform inside their pre_protocols. Eventually, one of them will succeed to enter the critical_section. No starvation. If a process enters its pre_protocol with the intention to enter the critical section, it will eventually succeed. No self exclusion (מניעה עצמית). In the absence of other processes trying to enter the critical_section, a single process will always succeed doing so in a very short time.

Solution try 1 – Give them a token to decide whose turn is it P2: begin loop non_crit_2; loop exit when Turn = 2; end loop; crit_sec_2; Turn := 1; end loop; end P2; (Note: atomic Read/Write) Integer Turn = 1; P1: begin loop non_crit_1; loop exit when Turn = 1; end loop; crit_sec_1; Turn := 2; end loop; end P1;

Solution try 2 – Let’s give each process a variable it can use to announce that it is in its crit_sec Problem: no mutual exclusion Execution example: P1 sees C2=1 P2 sees C1=1 P1 sets C1 := 0 P2 sets C2 := 0 P1 enters critical sec P2 enters critical sec P2: Loop non_crit_sec_2; loop exit when C1=1; end loop; C2 := 0; crit_sec_2; C2 := 1; End Loop; Integer C1=1, C2=1; P1: Loop non_crit_sec_1; loop exit when C2=1; end loop; C1 := 0; crit_sec_1; C1 := 1; End Loop;

Solution try 3 – Let’s set announcing variable before the loop P2: Loop non_crit_sec_2; C2 := 0; loop exit when C1=1; end loop; crit_sec_2; C2 := 1; End Loop; Problem: deadlock Execution example: P1 sets C1:=0 P2 sets C2:=0 P1 checks C2 forever P2 checks C1 forever Integer C1=1, C2=1; P1: Loop non_crit_sec_1; C1 := 0; loop exit when C2=1; end loop; crit_sec_1; C1 := 1; End Loop;

P2: Loop non_crit_sec_2; C2 := 0; loop exit when C1=1; C2 := 1; C2 := 0; end loop; crit_sec_1; C2 := 1; End Loop; Solution try 4 – Let’s allow other process to enter its crit_sec if we fail to do so Can other process enter between Ci:=1 and Ci:=0 ? Problem: starvation Between C1:=1 and C1:=0 P2 completed a full “round”. Problem: livelock Integer C1=1, C2=1; P1: Loop non_crit_sec_1; C1 := 0; loop exit when C2=1; C1 := 1; C1 := 0; end loop; crit_sec_1; C1 := 1; End Loop;

P2: Loop non_crit_sec_2; C2 := 0; loop exit when C1=1; if Turn = 1 then C2 := 1; loop exit when Turn = 2; end loop; C2 := 0; end if; end loop; crit_sec_2; C2 := 1; Turn := 1; End Loop; Dekker’s algorithm – let’s give processes a priority token that will give holder the right of way when competing • Algorithm Correct!!! • P1 is performing inside the • “insisting loop”: • If C2==0 then P1 knows P2 wants to enter crit_sec • If, in addition, Turn=2, then P1 gives turn to P2, and waits for P2 to finish. • Clearly, while P1 does all these, P2 itself will not give up because it is his Turn. • All characteristics for a valid solution exist. Integer C1=1, C2=1, Turn=1; P1: Loop non_crit_sec_1; C1 := 0; loop exit when C2=1; if Turn = 2 then C1 := 1; loop exit when Turn = 1; end loop; C1 := 0; end if; end loop; crit_sec_1; C1 := 1; Turn := 2; End Loop;

Shared arrays: array(1..N) of integer Choosing, Number; Process Pi performs: integer i := process id; Bakery Algorithm – mutual exclusion for N processes The idea is to have processes take tickets with numbers on them (just like in the city hall, or health care). Other processes give turn to process holding the ticket with minimal number (he got there first). If two tickets happen to be the same, the process having minimal id enters. Loop non_crit_sec_i; choosing(i) := 1; number(i) := 1 + max(number); choosing(i) := 0; for j in 1..N loop if j /= i then loop exit when choosing(j) = 0; end loop; loop exit when number(j) = 0 or number(i) < number(j) or number(i) = number (j) and i < j); end loop; end if; end loop; crit_sec_i; number(i) := 0; End loop;

Changing the rules of the game – increasing atomicity (load+store) Loop: non_crit_sec_i; loop T&S(Bi); exit when Bi=0; end loop; crit_sec_i; C := 0; End loop; C – shared variable Bi – Pi’s private variable T&S (Test and Set) = Bi := C; C := 1; C&S (Compare and Swap) = if Bi /= C tmp := C; C := Bi; Bi := tmp; end if; Such strong op’s are usually supported by the underlying hardware/OS.

Proc. 0 The Price of Atomic [load+store]or: Why not Simply Always use Strong Operations? Proc. 1 Proc. 2 Proc. 3 The “Set” of C must be seen immediately by all other processors, in case they execute competing code. Since communication between processors is via the main memory, need to cut through cache levels. Price: dozens to hundreds of clock cycles, and growing. B0 B2 Local cache and registers Load+Store L2/L3 cache L2/L3 cache Load+Store T&S Main Memory C

Semaphores NOTE: [Load+Store] are embedded in both WAIT and SIGNAL. Thus, Mutual Exclusion using semaphores is easy. A semaphore is a special variable. After initialization, only two atomic operations are applicable: Busy-Wait Semaphore: P(S) = WAIT(S):: When S>0 then S:= S-1 V(S) = SIGNAL(S):: S:= S+1 Another definition: Blocked-Set Semaphore: WAIT(S):: if S>0 then S:= S-1 else “wait on S” SIGNAL(S):: if there are processes waiting on S then let one of them proceed, else S:=S+1

Semaphores cntd. Note: in blocked-set and busy-wait semaphores starvation is possible. Blocked-Queue Semaphore: Change blocked-set semaphore definition, so that blocked processes are released in FIFO. Fair Semaphore: Change busy-wait semaphore definition, so that if S>0 infinitely many times then every process performing WAIT(S) will eventually be released.

Binary Semaphores SIGNAL(S) wait(S1); X:=X+1; if X=1 signal(S2); signal(S1); WAIT(S) wait(S2); wait(S1); X:=X-1; if X>0 signal(S2); signal(S1); Replace S:=S+1 by S:=1 in all definitions. Note: operations are still strong and expensive. “Implementing Semaphores by Binary Semaphores”, Hans Barz, SIGPLAN Notices, vol. 18, Feb. 1983. S- semaphore; S1,S2 binary semaphores; X variable; This check to save on signal operations Outside the “atomic” regions wait(S1)signal(S1): X>0 iff S2=1.

Policy for Programming with Semaphores Use semaphores as little as possible – these are strong operations! Define the role of each semaphore using a fixed relation between semaphore’s value and “something” in the program. Examples: Mutual Exclusion: Process may enter critical section iff S=1. Readers-Writers: S = # of free slots in the buffer. Then do: Identify the necessity of each wait and signal wrt the above mentioned role of the semaphore. Same for semaphore initialization. Make sure each wait is eventually released.

Semaphores – a software engineering problem Processes handling semaphores contain code related to the role of these shared variables in other processes. An error using semaphore in any of the places in the system manifests itself in other processes at other times. It is extremely hard to identify the sources of such bugs.

Monitors – C.A.R.Hoare,CACM, vol 17, no. 10, Oct. 1974 Idea: lets put all the code for handling shared variables in one place. So, let’s make something which is: Object-oriented programming style (Simula class) Monolithic monitor – a central core handling all requests. Each monitor has its own mission, and private data. Only a single process can enter a monitor at any point in time. Monitor <name> (declaring variables local to the monitor and global to monitor procedures) Procedure name1 (…) Procedure name2 (…) … Begin ::: initializing monitor local variables End.

Condition Variables Each handles a set of waiting processes. wait(condvar) – the process always blocks and enters the set of processes waiting on condvar. Signal(condvar) – one process from the set of those waiting on condvar is released. Empty queue – nothing happens. (Since only one process is allowed into the monitor, we shall assume signal to be the last instruction when exiting the monitor.) Nonempty(condvar) – returns True iff the waiting set is not empty.

Barrier Synchronization – schedule 5 processes for concurrent execution Note this program is unfair (unless five is a FIFO): it allows a process to be released from waiting on five, looooooooooooooop, fetch the monitor again, wait on five, and be released again, while other processes keep waiting on five. Monitor example Integer count; Condition five; Procedure sync(); { If count<4 { count := count+1; wait(five); signal(five); } else { count := 0; signal(five); } }

Concurrent readers or Exclusive writer Procedure readproc { repeat M.startread; read the data M.endread; forever } Procedure writeproc { repeat M.startwrite; write the data M.endwrite; forever } Cobegin readproc; readproc; writeproc; … Coend. Monitor readwrite; integer readers; boolean writing; condition okread, okwrite; Procedure startread { if (writing or nonempty(okwrite)) wait(okread); readers := readers + 1; signal(okread); } Procedure endread { readers := readers – 1; if (readers == 0) signal(okwrite); } Procedure startwrite { if (readers /= 0 or writing) wait(okwrite); writing := true; } Procedure endwrite { writing := false; if (nonempty(okread)) signal(okread) else signal(okwrite); } BeginMonitor readers:=0; writing :=false; EndMonitor;

The program in the previous slidedoes not work Because what if a reader goes to sleep on OKREAD. Now, a writer comes in, goes out while signaling OKREAD, and another writer comes in. When the other writer is between startwrite and endwrite, the reader can fetch the monitor and since this time it does not check “writing”, it will enter the critical section together with the second writer. Fix: check “writing” also when re-entering after waiting on the condition variable (replace “if” with a “while”).

In general….. When re-acquiring the monitor after waiting on a condition variable, always make sure that conditions (program state) remain as they were when wait was performed. Either: Prove that this is always the case. Check it when re-acquiring the monitor.

Recursive Monitor Calls The problem: When a procedure A.x in Monitor A calls a procedure B.y at Monitor B, should the process exit A before entering B? Cons: From software engineering point of view, it is important that when the process returns to A, conditions of its exit will persist. Exiting B (right back into A) will be dependent on succeeding to enter A again. Pros: There is no activity in A while process is in B, can use the time. May prevent deadlock, if B has dependencies on actions happening in A. Current Java definition: no release.

WAIT in recursive calls • If monitor A calls monitor B, and B waits, does this releases the lock on A? • Current Java definition: no release. Release ONLY the locks on B.

Parallel and Distributed Programming 236370 Spring 2001