COMP60621 Designing for Parallelism

COMP60621Designing for Parallelism Lecture 2 Parallel Programming: Language extensions – pthreads (and MPI) Introduction to Interference John Gurd, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • Parallel Programming Fundamentals • Different levels of programming… • Managing parallel work units and their interaction • The Unix Model ( http://www.unix.org ) • Processes and threads ---- memory mapping • Overview of pthreads - single address space • Overview of MPI - multiple address spaces • Summary

Extensions to C • We need specific programming constructs to define parallel computations. We shall use (sequential) C as a starting point. • In this lecture, we investigate extensions to C that allow the programmer to express parallel activity in the thread-based or data-sharing style and in the message passing style. • We approach this from two main directions: • Extensions that allow the programmer to create and manage threadsexplicitly and interact via shared memory. • Extensions that allow the programmer to manage processes explicitly and exchange data via messages.

Different Levels ofThread-based Programming • None of these schemes is fully implicit (i.e. automatic); unfortunately, autoparallelisation of C (or any other serial) programs is beyond the present state-of-the-art. Instead, different schemes offer increasing amounts of high-level assistance for the creation and management of parallel threads. • POSIX Parallel Threads Library • 'Bare-metal' approach --- the programmer is responsible for everything except the posix call implementations. • OpenMP API (a higher level alternative for threads) • Much functionality is provided, e.g. at the loop level --- the programmer is presented with a simpler picture, but the scope for losing performance through naivety increases.

How to Obtain Parallel Thread-based Activity • The general approach to developing a parallel code is the same in each scheme. The basic idea is to create a number of parallel threads and then find (relatively) independent units of work for each of them to execute. • Units of work come in two basic types which correspond to task- and data-parallelism: • Functionally different subprograms, each executed once; • Single subprogram, executed multiple times – with different data • In general, these forms of parallelism can be nested. • Each scheme relies on run-time support routines, provided as part of the operating system. It is important to know how memory (address space) is laid out at run-time. An example is given by the UNIX system, described on the next slide – this is similar to other operating systems.

The UNIX Model:Processes and Threads • There are two basic units: • A Process, the basic unit of resource. • A thread, the basic unit of execution. • The simplest process is one having a single thread of execution. • This corresponds well to our programming models. Code is shared by all threads in a process. The general situation is illustrated in the following slide. • (Note: the terminology used in other operating systems is dangerously ambiguous.)

(Process-shared data) Memory Map for UNIX Processes and Threads e.g. via mmap() Thread-shared data

POSIX Threads and an example… • An IEEE standard for UNIX (like) systems (defined for C) • Standard ‘wrappers’ exist to support use from FORTRAN and other languages • A set of library routines (and a run-time system) to manage the explicit creation and destruction of threads, and to manage their interaction • Essentially, a pthread executes a user-defined function • Scheduling of work to threads is down to the user • Calls to pthread synchronisation routines manage the interaction between shared data • OpenMP implementations can be built on top of pthreads • with details hidden from user • See: https://computing.llnl.gov/tutorials/pthreads • Good overview; starts with a description of the relationship between processes and threads

Pthreads – simple example #include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; Int rc; long t; for (t=0; t<NUM_THREADS; t++) { printf ("In main: creating thread %ld\n", t); rc = pthread_create (&threads[t], NULL, PrintHello, (void *)t); if (rc) { printf ("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }

void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); }

Synchronisation mechanisms • Needed to manage interference between threads modifying shared data. • Need to provide mutual exclusion • Example to come… • Locks and condition variables in pthreads • Example calls are shown on the following slides • Semphores (provided by the OS) • An early (Dijkstra, 1965) mechanism to control resource sharing in concurrent systems (provide mutual exclusion). • Pthread synchronisation routines are frequently implemented by OS-supported semaphores. For example, a binary semaphore is a lock. • More on these later…

The Oriental Garden problem - Interference People enter an ornamental garden through either of two turnstiles. Management wish to know how many are in the garden at any time. The concurrent program consists of two concurrent threads and a shared people ‘value’ variable. From Magee & Kramer

Oriental garden program The people count (value) and turnstile threads are created by the Garden program as follows: • int value = 0; • main() • { • pthread_t thread1, thread2; • pthread_create( &thread1, NULL, &turnstyle, (void*)0 ); //East • pthread_create( &thread2, NULL, &turnstyle, (void*)1 ); //West • pthread_join( thread1, NULL); • pthread_join( thread2, NULL); • printf("Number of people on exit = %d\n",value); • exit(0); • }

Turnstyle function void *turnstyle(void *arg) { long id; int arrive; id = (long) arg; for(arrive=0;arrive<GARDEN_MAX;arrive++) { value++; } printf("Turnstyle %d completed\n",id); }

Running this code • Note, we do not consider whether the threads are running on the same processor core (multitasking) or on different cores (true parallelism). • The point is that there is exploitation of concurrency (mulitple threads) and the behaviour of the concurrent program is independent of its deployment onto actual hardware. • We want to design a correct implementation regardless of the deploment.

Graphic of possible outcome After the East and West turnstile threads have each incremented the people count 20 times, the garden people counter is not the sum of the counts displayed. Counter increments have been lost. Why? Magee & Kramer

PC PC Concurrent processes! Turnstyle threads for eastandwestmay be executing the code for the increment function ‘at the same time’. west east shared code increment: read value write value + 1 program counter program counter Without some form of locking to ensure mutual exclusion, writes can be lost and the wrong total computed.

Semaphores • Note: using Semaphores, typical routines are: • semaphore_wait(s), equivalent to pthread_mutex_lock() • semaphore_signal(s), equivalent to pthread_mutex_unlock() • Where s is a semaphore initialised to 1 (a binary semaphore) • We will see semaphores used in lab exercise 1

Summary of Threads approach • Programming multiple threads with explicit accesses to shared data requires attention to detail and much low level control. • This can be alleviated by providing the programmer with a high-level data-sharing model and leaving low-level problems to the implementation (e.g. OpenMP). Higher level abstractions make programming increasingly easier, but they provide more opportunity for performance to be lost, due to unforseen actions by the compiler or run-time system. • Experience shows that it is somewhat easier to program using threads, compared to other approaches we shall study, although it is still non-trivial.

Overview of MPI • This is review material from COMP60611 • Processes (no shared data, so message passing) versus threads (shared data) • Process-based Programming Fundamentals • Managing Processes • Passing Messages • The Message-Passing Interface (MPI) • See MPI forum on web • Summary

Parallel Computing with Multiple Processes • For anyone familiar with concurrent execution of processes under a conventional uni-processor operating system, such as Unix, the notion of parallel computing with multiple (single-threaded) processes is quite natural. • Each process is essentially a stand-alone sequential program, with some form of interprocess communication mechanism provided to permit controlled interaction with other processes. • The simplest form of interprocess communication mechanism is via input and output files. However, this does not allow very 'rich' forms of interaction. • Hence, more complex varieties of message-passing have evolved, e.g.: • UNIX pipes, sockets, MPI, Java RMI…

Message-Passing • The process-based approach to parallel programming is the main alternative to the thread-based approach. Again, we use (sequential) C as a starting point. • We will look at an extension to C which allow the programmer to express parallel activity in the message-passing style. • Extensions that allow the programmer to send and receive messages explicitly (to exchange program data and synchronise). We illustrate this using the Message-Passing Interface (MPI) standard library. • MPI can also be used from FORTRAN and C++ (research versions for, for example, Java are available too).

Why MPI? • Shared memory computers tend to be limited in size (numbers of processors) and the cost of hardware to maintain cache coherency across an interconnect grows rapidly with system size. So the ‘biggest’ computers do not support shared memory. • Distributed memory systems are relatively cheap to build. They are essentially multiple copies of ‘independent’ processors connected together. Interconnects for these are relatively simple and cheap (e.g. based on routers). For example: • Networks of workstations, NoWs, using Ethernet or Myrinet • Supercomputers with specialised router-based interconnects: HECToR, a Cray XT4 using fast SeaStar routers – more than 22,000 cores (5664 quad-core Opterons). Upgraded in 2010. • Most of the Top100 computers in the world are DM systems and MPI is the de-facto standard for writing parallel programs for them (at least in the scientific world). See: www.top500.org.

Managing Processes • Remember: in our UNIX view, a process is a (virtual) address space with one or more threads (program counter plus stack). Processes are independent! • A key requirement is to be able to create a new process and know its (unique) identity. With process identities known to one another in this way, it is feasible within any process to construct a message and direct it specifically to some other process. • MPI has the concept of process ‘groups’ through communicators, e.g. MPI_COMM_WORLD. • Finally, there needs to be a mechanism for causing a process to ‘die’ and allow the MPI ‘group’ to ‘die’.

Passing Messages • The fundamental requirements for passing a message between two processes are: • The sending process knows how to direct a message to an appropriate receiving process. • In MPI this is achieved explicitly by naming a process id or through the use of a communicator (naming a group of processes). • There are several models of interacting and synchronising processes in MPI. We shall keep it simple and look only at basic sending and receiving: • Where recvs block but sends do not (implying buffering of the data) • MPI also supports synchronous (sends block until a receive is posted) and asynchronous sends and recvs (using poling)

The MPI C Library • Typical MPI scientific codes use processes that are identical, thus implementing the Single-Program-Multiple-Data (SPMD) scheme. For example: mpiexec –n 4 a.out ! Runs a 4 process single SPMD job • MPI also supports the Multiple-Program-Multiple-Data (MPMD) scheme, in which different code is executed in each process: mpiexec –n 3 a.out : -n 4 b.out : -n 6 c.out ! An MPMD job • Chapter 8 in Foster's book is a good source for additional MPI information. See also, LLNL tutorials.

MPI Fundamentals • Processes are grouped together; they are numbered within a group using contiguous integers, starting from 0. Messages are passed using the send (MPI_SEND) and receive (MPI_RECV) library subroutine calls (many other forms exist!) • A message send has the general form: MPI_Send(sbuf,icount,itype,idest,itag,icomm,ierr) • A send may block or not – depends on the MPI implementation’s use of buffering. (MPI_Ssend is a guaranteed blocking send) • Programs should not assume buffering of sends! Can lead to deadlock (see later example). • A message receive has the general form: MPI_Recv(rbuf,icount,itype,isrce,itag,icomm,istat,ierr) • The receiving process blocks until a message of the appropriate kind becomes available. The buffer starting at rbuf has to be guaranteed to be large enough to hold icount elements. The istat parameter shows how many elements actually arrived, where from, etc.

MPI Fundamentals • There are four other core library functions, illustrated in the following slides (which uses C syntax) • The following slides show the MPI basics plus the skeleton on an n-body MPI application in the SPMD-style.

#include “mpi.h” /* include file of compile-time constants needed for MPI library calls */ /* main program */ main (int argc, char *argv[]) { /* call to initialise this process - called once only per process */ ierr = MPI_Init(&argc, &argv); /* find the number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &np); /* find the id (number) of this process */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* print a “Hello world” message from this process */ fprintf (“I am %d of %d processes!”, myid, nprocs); /* shut down this process - last thing a process should do MPI_Finalize(); }

#include “mpi.h” /* include file */ main(int argc, char *argv[]) { /* main program */ int myid, np, ierr, lnbr, rnbr; Real x[300], buff[300], forces[300]; MPI_Status status; ierr = MPI_Init(&argc, &argv); /* initialize */ if (ierr != MPI_SUCCESS) { /* check return code */ fprintf(stderr,”MPI initialisation error\n”); exit(1); }

MPI_Comm_size(MPI_COMM_WORLD, &np); /* nprocs */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* my process id */ lnbr = (myid+np-1)%np; /* id of left neighbour */ rnbr = (myid+1)%np; /* id of right nbr */ Initialize(x, buff, forces); for (i=0; i<np-1; i++) { /* circulate messages */ /* Note: assumes sends do not block! What if they do?*/ MPI_Send(buff, 300, MPI_FLOAT, rnbr, 0, MPI_COMM_WORLD); MPI_Recv(buff, 300, MPI_FLOAT, lnbr, 0, MPI_COMM_WORLD, &status); update_forces(x, buff, forces); } Print _forces(myid, forces); /* print result */ MPI_Finalize(); /* shutdown */ }

Other MPI Facilities • The tag parameter is used to match up an input message with a specific expected kind. If the kind of message is immaterial, MPI_ANY_TAG will match with anything. • There are also constructs for: global 'barrier' synchronisation (MPI_Barrier); transfer of data, including one-to-many 'broadcast' (MPI_Bcast) and 'scatter' (MPI_Scatter), and many-to-one 'gather' (MPI_Gather); and 'reduction' operators (MPI_Reduce and MPI_All_reduce). • A reduction has the general form: MPI_Reduce(src,result,icnt,ityp,op,iroot,icomm,ierr) where op is the operator, typ is the element type, and root is the number of the process that will receive the reduced result. All processes in the group receive the same result when MPI_All_reduce is used. • There are many other features, but these are too numerous to be studied further here.

MPI – pros and cons • MPI is the de-facto standard for programming large supercomputers because the current trend is only to build distributed memory machines. • The vast majority of current DM machines are built out of multicore processors • Mixed mode programming with MPI ‘outside’ and pthreads (or OpenMP) ‘inside’ is possible… • MPI forces the programmer to face up to the distributed nature of machines – is this a good thing? • MPI solutions tend to be more scalable than ptrhread (or OpenMP) solutions • (OpenMP is somewhat easier to use…)

Summary of MPI • Process-based programming using a library such as MPI for explicit passing of messages requires attention to detail and much low level description of activities. • Ultimately, the same underlying problems of parallelism emerge, regardless of whether the shared memory (e.g. pthreads or OpenMP) or distributed memory (e.g. MPI) programming approach is used.

Summary • Pthreads exploit parallelism by exploiting multiple threads within a single process • A single address space model • MPI exploits parallelism between processes and supports the explicit exchange of messages between processes • A multiple address space model • Note that MPI and pthreads can be nested! • Multiple (p)threads can execute inside each (MPI) process • An approach that appears to match the hierarchical architecture of modern computers (i.e. muticore processors in a distributed memory machine, e.g. HeCToR)

COMP60621 Designing for Parallelism