Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science

MohsanJameel Department of Computing NUST School of Electrical Engineering and Computer Science

Outline • Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

What is OpenMP • Application program interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism • Consists of: • Compiler directives • Run time routines • Environment variables • Specification maintained by the OpenMP, Architecture Review Board (http://www.openmp.org) • Version 3.0 has been released May 2008

What OpenMP is Not • Not Automatic parallelization • User explicitly specifies parallel execution • Compiler does not ignore user directives even if wrong • Not just loop level parallelism • Functionality to enable coarse grained parallelism • Not meant for distributed memory parallel systems • Not necessarily implemented identically by all vendors • Not Guaranteed to make the most efficient use of shared memory

History of OpenMP • In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming extensions: • The user would augment a serial Fortran program with directives specifying which loops were to be parallelized. • First attempt at a standard was the draft for ANSI X3H5 in 1994. It was never adopted, largely due to waning interest as distributed memory machines became popular. • The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5had left off, as newer shared memory machine architectures started to become prevalent

Goal of OpenMP • Standardization : • Provide a standard among a variety of shared memory architectures/platforms • Lean and mean : • Establish a simple and limited set of directives for programming shared memory machines. • Ease of Use : • Provide capability to incrementally parallelize a serial program • Provide the capability to implement both coarse-grain and fine-grain parallelism • Portability : • Support Fortran (77, 90, and 95), C, and C++

Outline • Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

OpenMP Programming Model • Thread Based Parallelism • Explicit Parallelism • Compiler Directive Based • Dynamic Threads • Nested Parallelism Support • Task parallelism support (OpenMP specification 3.0)

Shared Memory Model

Execution Model ID=0 ID=1,2,3…N-1

Terminology • OpenMP Team=: Master + workers • A parallel region is block of code executed by all threads simultaneously. • Master thread always has thread ID=0 • Thread adjustment is done before entering parallel region. • An “if” clause can be used with parallel construct, incase the condition evaluate to FALSE, parallel region is avoided and code run serially • Work-sharing construct is responsible for dividing work among the threads in parallel region

Example OpenMP Code Structure

Components of OpenMP

Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

Go to helloworld.c

C/C++ Parallel Region Example !$OMP PARALLEL write (*,*) “Hello” !$OMP END PARALLEL thread 0 thread 2 thread 1 Hello world from thread = 0 Number of threads = 3 Hello world from thread = 1 Hello world from thread = 2

OpenMP Directives

OpenMP Scoping • Static Extent: • The code textually enclosed between beginning and end of structure block • The static extent does not span other routines • Orphaned Directive: • An OpenMP directive appear independently • Dynamic Extent: • It include extent of both static extent and orphaned directives

OpenMP Parallel Regions • A block of code that will be executed by multiple threads • Properties - Fork-Join Model - Number of threads won’t change inside a parallel region - SPMD execution within region - Enclosed block of code must be structured, no branching into or out of block • Format #pragmaomp parallel clause1 clause2 …

OpenMP Threads • How many threads? • Use of the omp_set_threads() library function • Setting of the OMP_NUM_THREADS environment variable • Implementation default • Dynamic Threads : • By default, the same number of threads are used to execute each parallel region • Two methods for enabling dynamic threads • Use of the omp_set_dynamic() library function • Setting of the OMP_DYNAMIC environment variable

OpenMP Work-sharing constructs Data parallelism Functional parallelism Serialize a section

Example: Count3s in an array • Lets assume we have an array of N integers. • We want to find how many 3s are in the array. • We need • a for loop • if statement, and • a count variable • Lets look at its serial and parallel version

Serial: Count3s in an array int count, n=100; int array[n]; // initialize array for(i=0;i<length;i++) { if (array[i]==3) count++; }

Work-sharing construct: “for loop” • “for loop” work-sharing construct is thought of as data parallelism construct.

Parallelize 1st attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel for default(none) shared(n,array,count) private(i) for(i=0;i<length;i++) { if (array[i]==3) count++; }

Work-sharing construct:Example of “for loop” #pragmaomp parallel for default(none) shared(n,a,b,c) private(i) for (i=0;i<n;i++) { c[i] = a[i] + b[i]; }

Work-sharing construct: “section” • “Section” work-sharing construct is thought of as functional parallelism construct.

Parallelize 2nd attempt: Count3s in an array • Say we also want to count 4s in same array. • Now we have two different function i.e. count 3 and count 4. int count, n=100; int array[n]; // initialize array #pragmaomp parallel sections default(none) shared(n,array,count3,count4) private(i) #pragmaomp parallel section for(i=0;i<length;i++) { if (array[i]==3) count3++; } #pragmaomp parallel section for(i=0;i<length;i++) { if (array[i]==4) count4++; } No date race condition in this example. WHY?

Work-sharing construct:Example 1 of “section” #pragmaomp parallel sections default(none) shared(a,b,c,d,e,n) private(i) { #pragmaomp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) a[i]=3*b[i]; } #pragmaomp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) e[i]=2*c[i]+d[i]; } } final_sum=sum(a,n) + sum(e,n); printf("FINAL_SUM is %d\n",final_sum)

Work-sharing construct:Example 2 of “section” 1/2

Work-sharing construct:Example 2 of “section” 2/2

Work-sharing construct:Example of “single” • In parallel region “single block” is used to specify that this block is executed only by one thread in the team of threads. Lets look at an example

Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

OpenMP Clauses: Data sharing 1/2 • shared(list) • shared clause is used to specify which data is shared among thread. • All threads can read and write to this shared variable. • By default all variables are shared. • private(list) • private variable are local to thread. • Typical example of private variable is loop counter, since each thread has its own loop counter initialized at entry point.

OpenMP Clauses: Data sharing 2/2 • A private variable is defined between entry and exit point of parallel region. • A private variable within parallel region has no scope out side of it • firstprivate and lastprivate clauses are used to increase scope of variable beyond parallel region. • firstprivate: All variables in the list are initialized with the original value that object had before entering parallel region • lastprivate: The thread that executes the last iteration or section updates the value of object in list.

Example: firstprivate and lastprivate int main(){ int C, B , A=10; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) firstprivate(A) lastprivate(B) private(i) for (i=0;i<n;i++) { … B = i + A; … } /*--- End of parallel region ---*/ C=B; }

OpenMP Clauses: nowait • nowaitclause is used to avoid implicit synchronization at end of work-sharing directive

OpenMP Clause: schedule • schedule clause is supported in loop construct only. • Used to control the manner in which loop iterations are distributed over the threads. • Syntax: schedule(kind[,chunk_size) • Types: • static[,chunk]: distribute iterations in blocks of size “chunk over the threads in a round-robin fashion • dynamic[,chunk]: fixed portions of work; size is controlled by the value chunk, when thread finishes its portion it starts with next portion. • guided[,chunk]: same as “dynamic”, but size of the portion of work decreases exponentially. • runtime[,chunk]: iteration scheduling scheme is set at runtime thought environment variable OMP_SCHEDULE

The Experiment with schedule clause

OpenMP Critical construct Example summation of a vector int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) shared(sum,a,n) private(i) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); } race condition

OpenMP Critical construct int main(){ int sum, local_sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel default(none) shared(sum,a,n) private(local_sum,i) { #pragmaomp for for (i=0;i<n;i++) { local_sum += a[i]; } #pragmaomp critical { sum+=local_sum } }/*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); }

Parallelize 3rd attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel default(none) shared(n,array,count) private(i,local_count) { #pragmaomp parallel for for(i=0;i<length;i++) { if (array[i]==3) local_count++; } #pragmaomp critical { count+=local_count } } /*--- End of Parallel region ---*/

OpenMP Clause: reduction • OpenMP provides a reduction clause which is used with for loop and section directives. • reductionvariable must be shared among threads • race condition is avoided implicitly. int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) shared(a,n) private(i)\ reduction(+:sum) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); }

Parallelize 4th attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel for default(none) shared(n,array) private(i) \ for(i=0;i<length;i++) { if (array[i]==3) count++; } /*--- End of Parallel region ---*/ reduction(+:count)

Tasking in OpenMP

Tasking in OpenMP • In OpenMP 3.0 the concept of tasks has been added to the OpenMP execution model • The Task model is useful is case where the number of parallel pieces and the work involved in each piece varies and/or unknown • Before inclusion of the Task model OpenMP was not suited for unstructured problem • Tasks are often set up within a single construct in a manager-worker model.

Task Parallelism Approach 1/2 • Threads line up as workers, go through the queue of work to be done, and do a task • Threads do not wait, as in loop parallelism, rather go back to queue and do more tasks. • Each task is executed serially by work thread that encounter that task in queue. • Load balancing occur as short and long task are done as threads become available.

Task Parallelism Approach 2/2

Example: Task parallelism

Best Practices • Optimize barrier use • Avoid ordered construct • Avoid large critical regions • Maximize parallel regions • Avoid multiple use of parallel regions • Address poor load balance

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science