Aspects of practical parallel programming Parallel programming models Data parallel

Parallel Computing 3Models of Parallel ComputationsOndřej JaklInstitute of Geonics, Academy of Sci. of the CR

Outline of the lecture • Aspects of practical parallel programming • Parallel programming models • Data parallel • High Performance Fortran • Shared variables/memory • compiler’s support: automatic/assisted parallelization • OpenMP • thread libraries • Message passing

Parallel programming(1) • Primary goal: maximization of performance • specific approaches are expected to be more efficient than universal ones • considerable diversity in parallel hardware • techniques/tools are much more dependent on the target platform than in sequential programming • understanding the hardware will make it easier to make programs get high performance • back to the era of assembly programming? • On the contrary, standard/portable/universal methods increase the productivity in software development andmaintenance Trade-off

Parallel programming (2) • Parallel programs are more difficult to write and debug thansequential ones • parallel algorithms can be generally qualitatively different form the corresponding sequential ones • the change of the form of the code may be not enough • several new classes of potential software bugs (e.g. race conditions) • difficult debugging • issues of scalability

General approaches • Special programming language supporting concurrency • theoretically advantageous, in practice not as much popular • ex.: Ada, Occam, Sisal, etc. (there are dozens of designs) • language extensions: CC++, Fortran M, etc. • Universal programming language (C, Fortran,...) with parallelizing compiler • autodetection of parallelism in the sequential code • easier for shared memory, limited efficiency • matter of future? (despite of 30 years of intense research) • ex.: Forge90 for Fortran (1992), some standard compilers • Universal programming language plus a library of external parallelizing functions • mainstream nowadays • ex.: PVM (Parallel Virtual Machine), MPI (Message Passing Interface), Pthreads a. o.

Parallel programming models • A parallel programming model is a set of software technologies to express parallel algorithms and match applications with the underlying parallel systems [Wikipedia] • Considered models: • data parallel [just introductory info in this course] • shared variables/memory[related to the OpenMP lecture in part II of the course] • message passing[continued in the next lecture (MPI)]

Data parallel model

Hardware requirements • Assumed underlying hardware: multicomputeror multiprocessor • originally associated with SIMD machines such as CM-200 • multiple processing elements perform the same operation on multiple data simultaneously • array processors [Wikipedia]

Fortran90 fragment A = A+1 real A(100) do i = 0, 50 A(i) = A(i)+1 enddo do i = 51, 100 A(i) = A(i)+1 enddo Task 1 Task 2 Data parallel model • Based on concept of applying the same operation (e.g. “add 1 to every array element”) to a number of a data ensemble in parallel • a set of tasks operate collectively on the same data structure (usually an array) – each task on a different partition • On multicomputers the data structure issplitup and resides as “chunks” in the local memory of each task • On multiprocessors, all tasks may have access to the data structure through global memory • The tasks are loosely synchronized • at the beginning and end of the parallel operations • SPMDexecution model

Characteristics • Higher-level parallel programming • data distribution and communication done by compiler • transfer low-level details from programmer to compiler • compiler converts the program into standard code with calls to a message passing library (MPI usually); all message passing is done invisibly to the programmer • Ease of use • simple to write, debug and maintain • no explicit message passing • single-threaded control (no spawn, fork, etc.) • Restricted flexibility and control • only suitable for certain applications • data in large arrays • similar independent operations on each element • naturally load-balanced • harder to get top performance • reliant on good compilers

High Performance Fortran • The best known representative of data parallel programming language • HPF version 1.0 in 1993 (extends Fortran 90), version 2.0 in 1997 • Extensions to Fortran 90 to support data parallel model, including • directives to tell compiler how to distribute data • DISTRIBUTE, ALIGN directives • ignored as comments in serial Fortran compilers • mathematical operations on array-valued arguments • reduction operations on arrays • FORALL construct • assertions that can improve optimization of generated code • INDEPENDENT directive • additional intrinsics and library routines • Available e.g. in the Portland Group PGI Workstation package • http://www.pgroup.com/products/pgiworkstation.htm • Nowadays not frequently used

T,B A HPF data mapping example REAL A(12, 12) ! declaration REAL B(16, 16) ! of an arrays !HPF$ TEMPLATE T(16,16) ! and a template !HPF$ ALIGN B WITH T ! align B with T !HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! align A with T and shift !HPF$ PROCESSORS P(2, 2) ! declare number of procesors 2*2 !HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P ! distribution of arrays [Mozdren 2010]

L1 L4 L2 L3 Data parallel in MATLAB Codistributed arrays • Parallel MATLAB (the MathWorks): Parallel Computing Toolbox • plus Distributed Computing Server for greater parallel environments • released in 2004; increasing popularity • Some features coherent to the data parallel model • codistributed arrays: arrays partitioned into segments, each of which resides in the workspace of a different task • allow to handle larger data sets than in a single MATLAB session • support for more than 150 MATLAB functions (e.g. finding eigenvalues) • in a very similar way as with regular arrays • parallel FOR loop: loop iterations without enforcing their particular ordering • distributes loop iterations over a set of tasks • iterations must be independent of each other parfor i = (1:nsteps) x = i * step; s = s + (4 /(1 + x^2)); end

Shared variables model

Interconnectionfabric(bus, crossbar) Hardware requirements • Assumed underlying hardware: multiprocessor • collection of processors that sharecommonmemory • interconnection fabric supporting single address space • Not applicable to multicomputers • but: Intel Cluster OpenMP • Easier to apply than message passing • allows incrementalparallelization • Based on the notion of threads after [Wilkinson2004]

Code Heap Stack Thread IP Interrupt routines Stack Thread IP Files Thread vs. process (1) Process

Process Code Heap Stack Thread IP Interrupt routines Stack Thread IP Files Thread vs. process (2) • Thread (“lightweight” processes) differs from (“heavyweight”) process: • all threads in a process share the same memory space • each thread has a thread private area for its local variables • e.g. stack • threads can work on shared data structures • threads can communicate with each other via the shared data • Threads originally not targeted at the technical or HPC computing • low level, task (rather than data) parallelism • Details of thread/process relationship are very OS dependent

Thread 1 Thread 2 Program my_a = 23 sh_a = a my_a = sh_a+1 Private data 23 24 Shared data 23 Thread communication • Parallel application generates, when appropriate, a set of cooperating threads • usually one per processor • distinguished by enumeration • Shared memory provides means to exchange data among threads • shared data can be accessed by all threads • no message passing necessary

Thread synchronization • Threads execute their programs asynchronously • Writes and reads are always nonblocking • Accessing shared data needs careful control • need some mechanisms to ensure that the actions occur in the correct order • e.g. write of A in thread 1 must occur before its read in thread 2 • Most common synchronization constructs: • master section: a section of code executed by one thread only • e.g. initialisation, writing a file • barrier: all threads must arrive at a barrier before any thread can proceed past it • e.g. delimiting phases of computation (e.g. a timestep) • critical section: only one thread at a time can enter a section of code • e.g. modification of shared variables • Makes shared-variables programming error-prone

Thread 1 Thread 2 [Wilkinson2004] Accessing shared data • Consider two threads each of which is to add 1 to a shared data item X,e.g. X = 10. • read X • compute X+1 • write X back • If step 1 is performed at the same timeby both threads, the result will be 11(instead of expected 12) • Race condition: two or more threads(processes) are reading or writing shared data, and the result depends on who runs preciselywhen • X=X+1 must be atomic operation • Can be ensured by mechanisms of mutual exclusion • e.g. critical section, mutex, lock, semaphore, monitor

Fork/Join parallelism • Initially onlythemaster thread is active • executes sequential code • Basic operations: • fork: master thread creates / awakens additional threads to execute in a parallel region • join: at end of parallel region created threads die / are suspended • Dynamic thread creation • the number of active threads changes during execution • fork is not an expensive operation • Sequential program a special / trivial case of a shared-memory parallel program [Quinn 2004]

Computer realization • Compiler’s support: • automatic parallelization • assisted parallelization • OpenMP • Thread libraries: • POSIX threads, Windows threads [next slides]

Automatic parallelization • The code instrumented automatically by the compiler • according the compilation flags and/or environment variables • Parallelizes independent loops only • processed by the prescribed number of parallel threads • Usually provided by Fortran compilers for multiprocessors • as a rule proprietary solutions • Simple and sometimes fairly efficient • Applicable to programs with a simple structure • Ex.: • XL Fortran (IBM, AIX): -qsmp=auto option, XLSMPOPTS environment variable (the number of threads) • Fortran (SUN, Solaris): -autopar flag, PARALLEL environment variable • PGI C (Portland Group, Linux): -Mconcur flag

Assisted parallelization • The programmer provides the compiler with additional information by adding compiler directives • special lines of source code with meaning only to a compiler that understands them • in the form of stylized Fortran comments or #pragma in C • ignored by nonparallelizing compilers • Assertive and prescriptive directives [next slides] • Diverse formats of the parallelizing directives, but similar capabilities standard required

Assertive directives • Hints that state facts that the compiler might not guess from the code itself • Evaluation context dependent • Ex.:XL Fortran (IBM, AIX) • no dependencies (the references in the loop do not overlap, parallelization possible):!SMP$ ASSERT (NODEPS) • trip count (average number of iterations of the loop; helps to decide if unroll or parallelize the loop):!SMP$ ASSERT (INTERCNT(100))

Prescriptive directives • Instructions for the parallelizing compiler, which it must obey • clauses may specify additional information • A means for manual parallelization • Ex.: XL Fortran (IBM, AIX) • parallel region: defines a block of code that can be executed by a team of threads concurrently • parallel loop: enables to specify which loops the compiler should parallelize • Besides directives, additional constructs within the base language to express parallelism can be introduced • e.g. the forall statement in Fortran 95 !SMP$ PARALLEL<clauses> <block> !SMP$ END PARALLEL !SMP$ PARALLEL DO<clauses> <do loop> !SMP$ END PARALLEL DO

OpenMP • API for writing portablemultithreaded applications based on the shared variables model • master thread spawns a team of threads as needed • relatively high level (compared to thread libraries) • A standard developed by the OpenMP Architecture Review Board • http://www.openmp.org • first specification in 1997 • A set of compiler directives and library routines • Language interfaces for Fortran, C and C++ • OpenMP-like interfaces for other languages (e.g. Java) • Parallelism can be added incrementally • i.e. the sequential program evolves into a parallel program • single source code for both the sequential and parallel versions • OpenMP compilers available on most platforms (Unix, Windows, etc.) [More in a special lecture]

Thread libraries • Collection of routines to create, manage, and coordinate threads • Main representatives: • POSIX threads (Pthreads), • Windows threads (Windows (Win32) API) • Explicit threading not primarily intended for parallel programming • low level, quite complex coding

4.0 F(x) = 4/(1+x2) 2.0 0.0 1.0 x Example: PI calculation Calculation ofπ by the numerical integration formula Numerical integration based on the rectangle method: setn (number of strips) for each strip calculate the height yof the strip (rectangle) at its midpoint sum all y to the result S endfor multiply S by the width of the strips print result

PI in Windows threads (1) /* Pi, Win32 API */ #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[NUM_THREADS]; CRITICAL_SECTION hUpdateMutex; static long num_steps = 100000; double step, global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0/(double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); }

PI in Windows threads (2) void main () { double pi; int i; DWORD threadID; int threadArg[NUM_THREADS]; for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; InitializeCriticalSection(&hUpdateMutex); for (i = 0; i < NUM_THREADS; i++) { thread_handles[i] = CreateThread(0,0,(LPTHREAD_START_ROUTINE)Pi, &threadArg[i],0,&threadID); } WaitForMultipleObjects(NUM_THREADS,thread_handles,TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi); }

PI in POSIX threads (1) /* Pi , pthreads library */ #define _REENTRANT #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 2 pthread_t thread_handles[NUM_THREADS]; pthread_mutex_t hUpdateMutex; pthread_attr_t attr; static long num_steps = 100000; double step, global_sum = 0.0; void* Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0 / (double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } pthread_mutex_lock(&hUpdateMutex); global_sum += sum; pthread_mutex_unlock(&hUpdateMutex); }

PI in POSIX threads (2) void main () { double pi; int i; int retval; pthread_t threadID; int threadArg[NUM_THREADS]; pthread_attr_init(&attr); pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM); pthread_mutex_init(&hUpdateMutex,NULL); for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; for (i = 0; i < NUM_THREADS; i++) { retval = pthread_create(&threadID,NULL,Pi,&threadArg[i]); thread_handles[i] = threadID; } for (i=0; i<NUM_THREADS; i++) { retval = pthread_join(thread_handles[i],NULL); } pi = global_sum * step; printf(" pi is %.10f \n",pi); }

PI in OpenMP (1) /* Pi, OpenMP, using parallel for and reduction */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 2 static long num_steps = 1000000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);

PI in OpenMP (1) #pragma omp parallel for reduction(+:sum) private(x) for (i = 1; i < num_steps; i++){ x = (i - 0.5) * step; sum += 4.0 / (1.0 + x*x); } pi = sum * step; printf("Pi is %.10f \n",pi); } NB: Programs such as PI calculation are likely to be successfully parallelized through automatic parallelization as well

Message passing model

Hardware requirements • Assumed underlying hardware: multicomputer • collection of processors, each with its own localmemory • interconnection network supporting message transfer between every pair of processors • Supported by all (parallel) architectures – the most generalmodel • naturally fits multicomputers • easily implemented on multiprocessors • Complete control: data distribution and communication • May not be easy to apply – sequential-to-parallel transformation requires major effort • one giant step rather than many tiny steps • message passing = “assembler” of parallel computing [Quinn2004]

Process 1 Process 2 x y Data transfer send(&x, 2) recv(&y, 1) [Wilkinson2004] Message passing • Parallel application generates (next slide) a set of cooperating processes • process = instance of o running program • usually one per processor • distinguished by the unique ID number • rank (MPI), tid (PVM), etc. • To solve a problem, processes alternately perform computations and exchange messages • basic operations: send, receive • no shared memory space necessary • Messages transport the contents of variables of one process to variables of other process. • Message passing has also a synchronization function

Process 1 Start process 2 spawn(); Process 2 Time Process creation • Static process creation • fixed number of processes in time • specified before the execution (e.g. on the command line) • usually the processes follow the same code, but their control paths through the code can differ – depending on the ID • SPMD (Single Program Multiple Data) model • one master process (ID 0) – several slave processes • Dynamic process creation • varying number of processes in time • just one process at the beginning • processes can create (destroy) other processes: the spawnoperation • rather expensive! • the processes often differ in code • MPMD (Multiple Program Multiple Data) model [Wilkinson2004]

Point-to-point communication • Exactly two processes are involved • One process (sender / source) sends a message and another process (receiver / destination) receives it • active participation of processes on both sides usually required • two-sided communication • In general, the source and destination processes operate asynchronously • the source may complete sending a message long before the destination gets around to receiving it • the destination may initiate receiving a message that has not yet been sent • The order of messages is guaranteed (they do not overtake) • Examples of technical issues • handling more messages waiting to be received • sending complex data structures • using message buffers • send and receive routines – blocking vs. nonblocking

(Non-)blocking & (a-)synchronous • Blocking operation: only returns (from the subroutine call) when the operation has completed • ex.: sending fax on a standard machine • Nonblocking operation: returns immediately, the operation need not be completed yet, other work may be performed in the meantime • the completion of the operation can/must be tested • ex.: sending fax on a machine with memory • Synchronous send: does not complete until the message has been received • provides (synchronizing) info about the message delivery • ex.: sending fax (on a standard machine) • Asynchronous send: completes as soon as the message is on its way • sender only knows when the message has left • ex.: sending a letter

Collective communication • Transferof data in a set of processes • Provided by most message passing systems • Basic operations [next slides]: • barrier: synchronization of processes • broadcast: one-to-many communication of the same data • scatter: one-to-many communication of different portions of data • gather: many-to-one communication of the (different, but related) data • reduction: gather plus combination of data with arithmetic/logical operation • Root – in some collective operations, the single prominentsource / destination • e.g. in broadcast • Collective operations can be built out as a set of point-to-point operations, but these “blackbox” routines • hide a lot of the messy details • are usually more efficient • can take advantage of special communication hardware

Barrier • A basic mechanism for synchronizingprocesses • Inserted at the point in each process where it must wait for the others • All processes can continue from thispoint when all the processes have reached it • or when a stated number of processeshave reached this point • Often involved in other operations [Wilkinson2004]

Broadcast B B B B B B Broadcast • Distributesthesame piece of data from a single source (root) to all processes (concerned with problem) • multicast – sending the message to a defined group of processes root

root Scatter A A B B C C D D D C A B Scatter • Distributes each element of an array in the root to a separate process • including the root • contents of the ith array element sent to the ith process

root Gather A C D B A B C D D C A B Gather • Collects data from each process at the root • value from the ith process is stored in the ith array element (rank order)

Reduce root E F G H A E I  M A I M N B J C K O L D P Reduction • Gather operation combined with specified arithmetic/logical operation • collect data from each processor • reduce these data to a single value (such as a sum or max) • store the reduced result on the root processor

Message passing system(1) • Computer realization of the message passing model • Most popular message passing systems (MPS): • Message Passing Interface (MPI) [next lecture] • Parallel Virtual Machine (PVM) • in distributed computing Corba, Java RMI, DCOM, etc.

Message passing system (2) • Information needed by MPS to transfer a message include: • sending process and location, type and amount of transferred data • no interest in data itself (message body) • receiving process(-es) and storage to receive the data • Most of this information is attached as message envelope • may be (partially) available to the receiving process • MPS may provide various information to the processes • e.g. about the progress of communication • A lot of other technical aspects, e. g.: • process enrolment in MPS • addressing scheme • content of the envelope • using message buffers (system, userspace)

WWW (what, when, why) Shared variables (OMP) • easier to program than MP, code is simpler • implementation can be incremental • no message start-up costs • can cope with irregular communication patterns • limited to shared-memory systems • harder to debug and optimize • scalability limited • usually less efficient than MP equivalents Data parallel (HPF) • easier to program than MP • simpler to debug than SV • does not require shared memory • DP style suitable only for certain applications • restricted control over data and work distribution • difficult to obtain top performance • a few API’s available • out of date? Message passing (MPI) • easier to debug • easiest to optimize • can overlap communication and computation • potential to high scalability • support on all parallel architectures • harder to program • load balancing, deadlock prevention, etc. need to be addressed • most freedom and responsibility

Aspects of practical parallel programming Parallel programming models Data parallel

Aspects of practical parallel programming Parallel programming models Data parallel

Presentation Transcript

Parallel Programming

PARALLEL programming

Parallel Programming Models

Parallel Programming Models

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming Models

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming Models

Parallel Programming