COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611Fundamentals of Parallel and Distributed Systems Lecture 5 Parallel Programming: Languages – pthreads (and MPI) John Gurd, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • Parallel Programming Fundamentals • Different levels of programming… • Managing parallel work units and their interaction • The Unix Model ( http://www.unix.org ) • Processes and threads ---- memory mapping • Overview of pthreads - single address space (and MPI - multiple address spaces) • Summary

Extensions to C • We need specific programming constructs to define parallel computations. We shall use (sequential) C as a starting point. • In this lecture, we investigate extensions to C that allow the programmer to express parallel activity in the thread-based or data-sharing style and in the message passing style. • We approach this from two main directions: • Extensions that allow the programmer to create and manage threadsexplicitly and interact via shared memory. • Extensions that allow the programmer to manage processes explicitly and exchange data via messages.

Different Levels ofThread-based Programming • None of these schemes is fully implicit (i.e. automatic); unfortunately, autoparallelisation of C (or any other serial) programs is beyond the present state-of-the-art. Instead, different schemes offer increasing amounts of high-level assistance for the creation and management of parallel threads. • POSIX Parallel Threads Library • 'Bare-metal' approach --- the programmer is responsible for everything except the posix call implementations. • OpenMP API (a higher level alternative for threads) • Much functionality is provided, e.g. at the loop level --- the programmer is presented with a simpler picture, but the scope for losing performance through naivety increases.

How to Obtain Parallel Thread-based Activity • The general approach to developing a parallel code is the same in each scheme. The basic idea is to create a number of parallel threads and then find (relatively) independent units of work for each of them to execute. • Units of work come in two basic types which correspond to task- and data-parallelism: • Functionally different subprograms, each executed once; • Single subprogram, executed multiple times – with different data • In general, these forms of parallelism can be nested. • Each scheme relies on run-time support routines, provided as part of the operating system. It is important to know how memory (address space) is laid out at run-time. An example is given by the UNIX system, described on the next slide – this is similar to other operating systems.

The UNIX Model:Processes and Threads • There are two basic units: • A Process, the basic unit of resource. • A thread, the basic unit of execution. • The simplest process is one having a single thread of execution. • This corresponds well to our programming models. Code is shared by all threads in a process. The general situation is illustrated in the following slide. • (Note: the terminology used in other operating systems is dangerously ambiguous.)

Process-shared data Memory Map for UNIX Processes and Threads Thread-shared data

POSIX Threads and an example… • An IEEE standard for UNIX (like) systems (defined for C) • Standard ‘wrappers’ exist to support use from FORTRAN and other languages • A set of library routines (and a run-time system) to manage the explicit creation and destruction of threads, and to manage their interaction • Essentially, a pthread executes a user-defined function • Scheduling of work to threads is down to the user • Calls to pthread synchronisation routines manage the interaction between shared data • OpenMP implementations can be built on top of pthreads • with details hidden from user • See: https://computing.llnl.gov/tutorials/pthreads • Good overview; starts with a description of the relationship between processes and threads

Pthreads – simple example #include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; Int rc; long t; for (t=0; t<NUM_THREADS; t++) { printf ("In main: creating thread %ld\n", t); rc = pthread_create (&threads[t], NULL, PrintHello, (void *)t); if (rc) { printf ("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }

void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); }

Synchronisation mechanisms • Locks and condition variables in pthreads • Example calls are shown on the following slides • Semphores (provided by the OS) • An early (Dijkstra, 1965) mechanism to control resource sharing in concurrent systems (providing mutual exclusion). • Pthread synchronisation routines are frequently implemented by OS-supported semaphores. For example, a binary semaphore is a lock. • More on these later…

Pthread locks pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER; void *func() { pthread_mutex_lock( &mutex1 ); counter++; printf("Counter value: %d\n",counter); pthread_mutex_unlock( &mutex1 ); } Note: using Semaphores, typical routines are: semaphore_wait(s), equivalent to pthread_mutex_lock() semaphore_signal(s), equivalent to pthread_mutex_unlock() Where s is a semaphore initialised to 1 (a binary semaphore)

Pthread condition synchronisation • Condition variables provide a “wait-notify” mechanism for threads. • For example, consider a producer-consumer model where one producer thread is served by one (or more) consumer threads. A consumer must check the shared state to see if there is anything for it to do. If there is nothing, the thread must wait until there is work. When the producer has work to be done, it updates the shared state and must then notify any consumers waiting. • Condition synchronisation involves a condition lock and a condition variable. • The combination of some shared data and some synchronisation mechanism (locks and condition variables, for example) is called a monitor. • More in later lectures on monitors.

Monitor Thread A Thread B data wait notify We refer to a thread enteringa monitor when a thread acquires the mutual exclusion lock associated with the monitor and exitingthe monitor when it releases the lock. wait - causes the thread to exit the monitor (release the lock), permitting other threads to enter the monitor. notify – causes one of any waiting threads to be allowed to enter the monitor.

A consumer thread takes the condition lock before examining the relevant shared state. • If there is work, the thread can modify the state (i.e. ‘get’ the work) and then release the lock. • If there is no work, the thread calls pthread_cond_wait() with the names of the condition lock and condition variable. This routine releases the condition lock and ensures that the thread waits (inside the wait routine) until it is woken up by a notification from the producer. • When the producer thread has work to be done, it takes the condition lock in order to modify the shared state (to make the work available to consumers). • The producer then calls pthread_cond_broadcast, with the name of the condition variable, to wake up all waiting threads and then releases the condition lock. • If the producer knows that only a single thread is waiting it may call pthread_cond_signal which wakes only a single thread.

On being awakened, a consumer thread (which is still inside the pthread_cond_wait routine) will immediately try to take the condition lock so that, on return from pthread_cond_wait, it owns the lock. • The consumer thread should then re-check the shared state to confirm that there is some work to be done (why?). • Typically, the consumer will call pthread_cond_wait from within a ‘while’ loop which checks the shared state. Thus, if there is no work, the consumer thread will simply call pthread_cond_wait again. If there is work, the consumer will modify the shared state to ‘get’ the work and then release the condition lock.

An example • Consider a producer thread which increments a shared counter (up to the value of 5) before passing on the work of further incrementing the shared counter (up to a value of 10) to a consumer thread. • In the following C/pthreads example, functionCount2 is the producer and functionCount1 is the consumer.

Pthread condition variables int count = 0; #define COUNT_DONE 10 #define COUNT_HALT1 5 pthread_mutex_t condition_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t condition_cond = PTHREAD_COND_INITIALIZER; main() { pthread_t thread1, thread2; pthread_create( &thread1, NULL, &functionCount1, NULL); pthread_create( &thread2, NULL, &functionCount2, NULL); pthread_join( thread1, NULL); pthread_join( thread2, NULL); exit(0); }

Producer void *functionCount2() { for(;;) { pthread_mutex_lock( &condition_mutex ); if( count < COUNT_HALT1 ) { count++; pthread_cond_signal( &condition_cond ); /* OR pthread_cond_broadcast( &condition_cond ); */ printf("Counter value functionCount2: %d\n",count); } pthread_mutex_unlock( &condition_mutex ); if (count >= COUNT_DONE) return (NULL); } }

Consumer void *functionCount1() { for(;;) { pthread_mutex_lock( &condition_mutex ); while( count < COUNT_HALT1 ) { pthread_cond_wait( &condition_cond, &condition_mutex ); printf("Counter value functionCount1 on waking: %d\n",count); } count++; printf("Counter value functionCount1: %d\n",count); pthread_mutex_unlock( &condition_mutex ); if(count >= COUNT_DONE) return(NULL); } }

Behaviour • The producer continually cycles taking the lock (entering the monitor) and increments the counter if it is less than 5. • After each increment, the producer signals a waiting thread (the consumer) before exiting the monitor by releasing the lock. • Either the producer or consumer will enter the monitor next. • If it is the producer, it carries on as before. • Note that once it has done its work, it simply releases the lock. • If it is the consumer, and it isn’t time for it to take over (i.e. the state is not 5), it simply announces the value it saw, and goes back to waiting. • Eventually the consumer takes over incrementing the counter until it reaches the final value, at which point both it and the producer thread terminate. • The example code is in the directory CondVar in the teaching domain directory ~griley/COMP60611/labs/extras

Implicit Thread Control: OpenMP • OpenMP is the de-facto standard API for writing shared memory parallel applications in C, C++, and Fortran. • OpenMP consists of: • Compiler directives, • Run time routines, • Environment variables. • Specification maintained by the OpenMP Architecture Review Board (http://www.openmp.org). • Version 3.0 was released May 2008.

OpenMP Overview:How do threads interact? • OpenMP is a multi-threading, shared address model. • Threads communicate by sharing variables. • Unintended sharing of data causes race conditions: • race condition – the program’s outcome changes as the threads are scheduled differently. • To control race conditions: • Use synchronization to protect data conflicts. • But synchronization can be expensive

Advantages of OpenMP • Good performance and scalability. • if you do it right .... • De-facto and mature standard. • An OpenMP program is portable. • Supported by a large number of compilers. • Simplifies the writing of multi-threaded programs in Fortran, C and C++. • Allows the program to be parallelized incrementally. • OpenMP is well suited for multicore architectures.

Summary of Threads approach • Programming multiple threads with explicit accesses to shared data requires attention to detail and much low level control. • This can be alleviated by providing the programmer with a high-level data-sharing model and leaving low-level problems to the implementation. Higher level abstractions make programming increasingly easier, but they provide more opportunity for performance to be lost, due to unforseen actions by the compiler or run-time system. • Experience shows that it is somewhat easier to program using threads, compared to other approaches we shall study, although it is still non-trivial.

Overview of MPI • Processes versus threads • Shared and distributed memory systems • Process-based Programming Fundamentals • Managing Processes • Passing Messages • The Message-Passing Interface (MPI) • See MPI forum on web • Message-Passing Overheads • Summary

Parallel Computing with Multiple Processes • For anyone familiar with concurrent execution of processes under a conventional uni-processor operating system, such as Unix, the notion of parallel computing with multiple (single-thread) processes is quite natural. • Each process is essentially a stand-alone sequential program, with some form of interprocess communication mechanism provided to permit controlled interaction with other processes. • The simplest form of interprocess communication mechanism is via input and output files. However, this does not allow very 'rich' forms of interaction. • Hence, more complex varieties of message-passing have evolved, e.g.: • UNIX pipes, sockets, MPI, Java RMI…

Message-Passing • The process-based approach to parallel programming is the main alternative to the thread-based approach. Again, we use (sequential) C as a starting point. • We will look at an extension to C which allow the programmer to express parallel activity in the message-passing style. • Extensions that allow the programmer to send and receive messages explicitly (to exchange program data and synchronise). We illustrate this using the Message-Passing Interface (MPI) standard library. • MPI can also be used from FORTRAN and C++ (research versions for, for example, Java are available too).

Why MPI? • Shared memory computers tend to be limited in size (numbers of processors) and the cost of hardware to maintain cache coherency across an interconnect grows rapidly with system size. So the ‘biggest’ computers do not support shared memory. • Distributed memory systems are relatively cheap to build. They are essentially multiple copies of ‘independent’ processors connected together. Interconnects for these are relatively simple and cheap (e.g. based on routers). For example: • Networks of workstations, NoWs, using Ethernet or Myrinet • Supercomputers with specialised router-based interconnects: HECToR, a Cray XT4 using fast SeaStar routers – more than 22,000 cores (5664 quad-core Opterons). Upgraded in 2010. • Most of the Top100 computers in the world are DM systems and MPI is the de-facto standard for writing parallel programs for them (at least in the scientific world). See: www.top500.org.

Managing Processes • Remember: in our UNIX view, a process is a (virtual) address space with one or more threads (program counter plus stack). Processes are independent! • A key requirement is to be able to create a new process and know its (unique) identity. With process identities known to one another in this way, it is feasible within any process to construct a message and direct it specifically to some other process. • MPI has the concept of process ‘groups’ through communicators, e.g. MPI_COMM_WORLD. • Finally, there needs to be a mechanism for causing a process to ‘die’ and allow the MPI ‘group’ to ‘die’.

Passing Messages • The fundamental requirements for passing a message between two processes are: • The sending process knows how to direct a message to an appropriate receiving process. • In MPI this is achieved explicitly by naming a process id or through the use of a communicator (naming a group of processes). • There are several models of interacting and synchronising processes in MPI. We shall keep it simple and look only at basic sending and receiving: • Where recvs block but sends do not (implying buffering of the data) • MPI also supports synchronous (sends block until a receive is posted) and asynchronous sends and recvs (using poling)

The MPI C Library • Typical MPI scientific codes use processes that are identical, thus implementing the Single-Program-Multiple-Data (SPMD) scheme. For example: mpiexec –n 4 a.out ! Runs a 4 process single SPMD job • MPI also supports the Multiple-Program-Multiple-Data (MPMD) scheme, in which different code is executed in each process: mpiexec –n 3 a.out : -n 4 b.out : -n 6 c.out ! An MPMD job • Chapter 8 in Foster's book is a good source for additional MPI information. See also, LLNL tutorials.

MPI Fundamentals • Processes are grouped together; they are numbered within a group using contiguous integers, starting from 0. Messages are passed using the send (MPI_SEND) and receive (MPI_RECV) library subroutine calls (many other forms exist!) • A message send has the general form: MPI_Send(sbuf,icount,itype,idest,itag,icomm,ierr) • A send may block or not – depends on the MPI implementation’s use of buffering. (MPI_Ssend is a guaranteed blocking send) • Programs should not assume buffering of sends! Can lead to deadlock (see later example). • A message receive has the general form: MPI_Recv(rbuf,icount,itype,isrce,itag,icomm,istat,ierr) • The receiving process blocks until a message of the appropriate kind becomes available. The buffer starting at rbuf has to be guaranteed to be large enough to hold icount elements. The istat parameter shows how many elements actually arrived, where from, etc.

MPI Fundamentals • There are four other core library functions, illustrated in the following slides (which uses C syntax) • The following slides show the MPI basics plus the skeleton on an n-body MPI application in the SPMD-style.

#include “mpi.h” /* include file of compile-time constants needed for MPI library calls */ /* main program */ main (int argc, char *argv[]) { /* call to initialise this process - called once only per process */ ierr = MPI_Init(&argc, &argv); /* find the number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &np); /* find the id (number) of this process */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* print a “Hello world” message from this process */ fprintf (“I am %d of %d processes!”, myid, nprocs); /* shut down this process - last thing a process should do MPI_Finalize(); }

#include “mpi.h” /* include file */ main(int argc, char *argv[]) { /* main program */ int myid, np, ierr, lnbr, rnbr; Real x[300], buff[300], forces[300]; MPI_Status status; ierr = MPI_Init(&argc, &argv); /* initialize */ if (ierr != MPI_SUCCESS) { /* check return code */ fprintf(stderr,”MPI initialisation error\n”); exit(1); }

MPI_Comm_size(MPI_COMM_WORLD, &np); /* nprocs */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* my process id */ lnbr = (myid+np-1)%np; /* id of left neighbour */ rnbr = (myid+1)%np; /* id of right nbr */ Initialize(x, buff, forces); for (i=0; i<np-1; i++) { /* circulate messages */ /* Note: assumes sends do not block! What if they do?*/ MPI_Send(buff, 300, MPI_FLOAT, rnbr, 0, MPI_COMM_WORLD); MPI_Recv(buff, 300, MPI_FLOAT, lnbr, 0, MPI_COMM_WORLD, &status); update_forces(x, buff, forces); } Print _forces(myid, forces); /* print result */ MPI_Finalize(); /* shutdown */ }

Other MPI Facilities • The tag parameter is used to match up an input message with a specific expected kind. If the kind of message is immaterial, MPI_ANY_TAG will match with anything. • There are also constructs for: global 'barrier' synchronisation (MPI_Barrier); transfer of data, including one-to-many 'broadcast' (MPI_Bcast) and 'scatter' (MPI_Scatter), and many-to-one 'gather' (MPI_Gather); and 'reduction' operators (MPI_Reduce and MPI_All_reduce). • A reduction has the general form: MPI_Reduce(src,result,icnt,ityp,op,iroot,icomm,ierr) where op is the operator, typ is the element type, and root is the number of the process that will receive the reduced result. All processes in the group receive the same result when MPI_All_reduce is used. • There are many other features, but these are too numerous to be studied further here.

MPI – pros and cons • MPI is the de-facto standard for programming large supercomputers because the current trend is only to build distributed memory machines. • The vast majority of current DM machines are built out of multicore processors • Mixed mode programming with MPI ‘outside’ and pthreads (or OpenMP) ‘inside’ is possible… • MPI forces the programmer to face up to the distributed nature of machines – is this a good thing? • MPI solutions tend to be more scalable than ptrhread (or OpenMP) solutions • (OpenMP is somewhat easier to use…)

Summary of MPI • Process-based programming using a library such as MPI for explicit passing of messages requires attention to detail and much low level description of activities. • Ultimately, the same underlying problems of parallelism emerge, regardless of whether the shared memory (e.g. pthreads or OpenMP) or distributed memory (e.g. MPI) programming approach is used.

Summary • Pthreads exploit parallelism by exploiting multiple threads within a single process • A single address space model • MPI exploits parallelism between processes and supports the explicit exchange of messages between processes • A multiple address space model • Note that MPI and pthreads can be nested! • Multiple (p)threads can execute inside each (MPI) process • An approach that appears to match the hierarchical architecture of modern computers (i.e. muticore processors in a distributed memory machine, e.g. HeCToR)

COMP60611 Fundamentals of Parallel and Distributed Systems