Computer architecture II

Computer architecture II Programming: POSIX Threads OpenMP Computer Architecture II

OpenMP overview Open specifications for Multi Processing A set of API for writing multi threaded applications C/C++ and Fortran Thread-based parallelism fork/join model OpenMP Compiler directives. Library calls Environment variables Computer Architecture II

OpenMP Release History • 1997: OpenMP Fortran 1.0 • 1998: OpenMP C/C++ 1.0 • 1999: OpenMP Fortran 1.1 • 2000: OpenMP Fortran 2.0 • 2002: OpenMP C/C++ 2.0 Computer Architecture II

Goals • Standard for shared-memory machines • major computer hardware and software vendors • Limited number of directives • Ease of Use: • Incrementally parallelize a serial program • Coarse-grain and fine-grain parallelism • Portability: • Fortran (77, 90, and 95), C, and C++ Computer Architecture II

OpenMP Constructs • Directives • Parallel Region • Work-sharing • Synchronization • Runtime Library Routines • Environment variables Computer Architecture II

OpenMP C Directive format # pragma omp directivename [clauses..] { code } Computer Architecture II

1.a. Parallel Regions Directive • Indicates a block of code that will be executed by multiple threads. • Fork-join model #include<omp.h> void main() { int x; sequential code(); #pragma omp parallel { parallel code(); } sequential code(); } M Master creates (forks) threads T0 T1 T2 Join threads Computer Architecture II

1.b. Work sharing Directive • Types • for • section • single • For Construct • Assigns work to all threads. • The method of assigning depends on a SCHEDULE clause. • A implicit barrier is to be assumed at the end • All the private variables are flushed at the end Computer Architecture II

Work Sharing example for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Sequential code #pragma omp parallel{ int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;I<iend;i++) { a[i] = a[i] + b[i];}} OpenMP parallel region OpenMP parallel region and a work-sharing for-construct #pragma omp parallel #pragma omp for schedule(static) for(i=0;I<N;i++) { a[i] = a[i] + b[i];}

Schedule Clause The schedule clause effects how loop iterations are mapped onto threads • schedule(static [,chunk]) – assigns a number of “chunk” iterations to each thread. • schedule(dynamic[,chunk]) – When free, each thread picks “chunk” iterations from a queue until all iterations have been executed. • schedule(guided[,chunk]) – a special dynamical schedule. At the beginning each thread grads “chunk” iteration, then the number decreases slowly. Computer Architecture II

Section Directive • Non-iterative construct • Each section is executed by one thread # pragma omp parallel { # pragma omp sections { #pragma omp section code_executed_by_one(); #pragma omp section code_executed_byanother_one(); } } Computer Architecture II

Single Directive • One thread only will execute the single section, while the others will do nothing. # pragma omp single code_executed_by_only_one Computer Architecture II

Parallel Regions and work sharing Directives A parallel region directive could be combined with a work-sharing construct. # pragma omp parallel for ScheduleClause # pragma omp parallel sections Computer Architecture II

Data Scoping Clauses • Scoping: in which blocks of programs are the declared variable visible • By default the majority of variables is shared • Exceptions • Loop index within a parallel for • Subroutines called within a parallel region • Local variables declared within lexical scope of a parallel region • Is recommended to declare explicitly the scope of variables by using the clauses • SHARED: the variables are shared among threads • PRIVATE: the variable is private to a thread • FIRSTPRIVATE: the variable is private and all the private copies are initialized to the value from the original object location before entering the parallel region • LASTPRIVATE: the value of the last iteration is copied to the original object location • REDUCTION: performs a reduction on the private variables at the end of the parallel construct Computer Architecture II

Reduction example #include <omp.h> main () { int i; float a[100], b[100], result; result = 0.0; #pragma omp parallel for private(i) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); } a and b arrays are shared (by default) result is declared private and reduced at the end i is private by default (one of the 3 exceptions)

1.c. Synchronization directives !$omp barrier !$omp noWait !$omp critical !$omp master !$omp flush Computer Architecture II

Synchronization Directives • When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. • Implicit barriers are applied at: • End parallel regions • End of work sharing constructs (for,sections,single) • End of critical sections Computer Architecture II

Synchronization Directives • NoWait is a construct that overcomes the implicit barriers. • It is used with: • Parallel Regions Directives • Work sharing Directives Computer Architecture II

Synchronization Directives • The CRITICAL directive specifies a region of code that must be executed by only one thread at a time • It blocks all other threads until the current thread exits that CRITICAL region. • # pragma omp critical name • The optional name enables multiple different CRITICAL regions to exist • Different CRITICAL regions with the same name are treated as the same region. Computer Architecture II

Synchronization Directives • The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. • FLUSH is implied implicitly with these directives: • critical - entry and exit • barrier • parallel - exit • for - exit • sections - exit • single - exit Computer Architecture II

2. Runtime Library Routines • The OpenMP standard defines an API for library calls that perform a variety of functions: • Query the number of threads/processors, set number of threads to use • General purpose locking routines (semaphores) • Set execution environment functions: nested parallelism, dynamic adjustment of threads. Computer Architecture II

Runtime Library Routines • sets the number of threads that will be used in the next parallel region. void omp_set_num_threads(int num_threads) • returns the number of threads that are currently in the team executing the parallel region from which it is called. int omp_get_num_threads(void) • returns the thread number of the thread, within the team. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 int omp_get_thread_num(void) • returns the number of processors that are available to the program. int omp_get_num_procs(void) • Used to determine if the section of code which is executing is parallel or not. int omp_in_parallel(void) Computer Architecture II

Runtime Library Routines • By default, a program with multiple parallel regions will use the same number of threads to execute each region. • This behavior can be changed to allow the run-time system to dynamically adjust the number of threads that are created for a given parallel section. • To enables or disables dynamic adjustment (by the run time system) of the number of threads available for execution of parallel regions. void omp_set_dynamic(int dynamic_threads) Computer Architecture II

3. Environment Variables • Some of them are variants of run-time library calls • OMP_NUM_THREADS Sets the maximum number of threads to use during execution. For example: setenv OMP_NUM_THREADS 8 • OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. Valid values are TRUE or FALSE. For example: setenv OMP_DYNAMIC TRUE Computer Architecture II

Computer architecture II