CS 591x Clutter Computing and Programming Parallel Computers

CS 591x Clutter Computing and Programming Parallel Computers Introduction to OpenMP

Recall… • Three paradigms for parallel software • Distributed memory • Shared memory • Data Parallel • As the names imply – • Distributed memory paradigm suited for distributed memory architectures • Shared memory paradigm suited for SMP architectures (shared memory)

Distributed Memory • No common memory space • No automatic sharing of data across processes • All data sharing is done by explicit message passing • MPI

Shared Memory • Processes share common memory space • Data sharing via common memory space • No explicit message passing required • OpenMP

SMP vs Distributed Memory Architectures • SMP – several/many processors share a common memory pool • No explicit message passing • Performance very good – relative to message passing • So, why not just SMPs…

SMPs • SMPs with a large number of processors tend to be very expensive • Process/Cache/Interconnect technology limits the number of processors that can be connected in an SMP • Memory/Processor interconnect can be a bottleneck – limiting performance

SMPs - Multiport Memory Processor Memory Bank Processor Processor Processor

SMPs – Bus architecture Processor Processor Processor Processor Memory Memory Memory

SMP – Crossbar switches Processor Processor Processor Processor Processor Memory Memory Memory

SMPs • 64 or 128 processor SMPs are very large SMPs • Distributed memory systems can scale up to hundreds or thousands of processors • Fastest computer are distributed memory systems • Still… SMPs are very powerful.

PSC – Rachel and Jonas • 64 Processors • 1.67 Ghz EV7 Processors • 256 Gbytes of shared memory

OpenMP • OpenMP – application programming interface standard for SMP parallel programming • Widely accepted standard, not widely implemented • API for Fortran, C, C++ • Implementations • Portland Group Fortran, C, C++ • Intel Fortran, C, C++

OpenMP Concepts • OpenMP is not a language, more like an extension to existing languages – Fortran, C, C++ • Parallelization is implemented in a program through compiler directives… • Fortran – directives • C/C++ Pragmas • … and limited number of runtime functions… • and environment variables

OpenMP Concepts • Pragmas/directives are parallelization instructions to the compiler • Pragmas/directives start with sentinals • Fortran !$omp [command] c$omp [command] *$omp [command] • c/c++ #pragma omp [command]

OpenMP Concept • Parallel program blocks • blocks of code.. • completely contained within openMP construct • starts/ends with {/} (just like c) • one entry point/one exit point • no breaks or jumps

OpenMP Concepts • Processes vs Thread • Process – executable code with independent memory space from other (maybe similar) executable code • Thread – executable code that shares memory space with other executable code • Group of threads in a program as called “teams”

OpenMP Concepts • Fork and join • Fork -Process adds threads of execution • Joins - closes at completion of task

Fork and Join execution model MPI OpenMP p0 p1 p2 p3 Thread Threads t0 t1 t2 t3 Threads t0 t1 t2 t3

Compiling OpenMP programs • Portland Group C/C++ pgcc –mp myprog.c –o myprog • Intel C/C++ icc –openmp myproc.c –o myproc

Running OpenMP programs • Set environment variables • OMP_NUM_THREADS=n • NCPUS=n export OMP_NUM_THREADS=4 • PBS/Torque !#/bin/sh #PBS –l nodes=2,ppn=2 export OMP_NUM_THREADS=4 ./myprog

Sample OpenMP Program #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }

OpenMP – Three parts • Environment Variables • Run-time Library Functions • Pragmas/Directives *

OpenMP – Environment Variables • OMP_NUM_THREADS • specifies the number of threads to run during the execution of an OpenMP program • determines the number of threads assigned to a job regardless the number of processors in the system • default = 1 • export OMP_NUM_THREADS=4 • NCPUS does the same thing

OpenMP – Environment Variables • OMP_SCHEDULE • defines the type of iteration scheduling used in OMP for and OMP Parallel for loops • options are • static • guided • dynamic

OpenMP – Environment Variables • MPSTKZ • increase the size of stacks used by thread in parallel regions • may be needed if threads have a lot of private variables or .. • if functions within parallel regions have a lot of local variable storage • must use integer + “M” or “m” to mean megabytes • export MPSTKZ=8M

OpenMP – Run-time Library Functions • Always use • #include <omp.h> • int omp_get_num_threads(void); • returns the number of threads in a family in a running in the parallel region where it was called int thcount=omp_get_num_threads();

OpenMP – Run-time Library Functions • void omp_set_num_threads(int n); • sets the number of threads for the next parallel region • must be called before parallel region • if called from within parallel regions it is undefined • has precedence over OMP_NUM_THREADS environment variable omp_set_num_threads(6);

OpenMP – Run-time Library Functions • int omp_get_thread_num(void); • returns a thread number for a thread within a team • thread numbers run from 0 (root thread) to omp_get_num_threads()-1 • if called in serial region returns 0 • similar to MPI_Comm_rank(…) int mythread_num=omp_get_thread_num();

OpenMP – Run-time Library Functions • int omp_get_max_threads(void); • returns the max number of threads a job can have • returns max number of threads even if in serial region • can be changed with omp_set_num_threads(n); int max_t=omp_get_max_threads();

OpenMP – Run-time Library Functions • int omp_in_parallel(void); • returns non-zero if it is called within a parallel region… • returns zero if not in a parallel region int p_or_pnot=omp_in_parallel();

OpenMP – Run-time Library Functions • Defined but not implemented • void omp_set_dynamic(int dyn); • int omp_get_dynamic(void); • void omp_set_nested(int nested); • int omp_get_nested(void);

OpenMP – Run-time Library Functions • to be continued….

OpenMP – Pragmas/directives • General format • #pragma omp pragma_name [clauses] • where pragma_name is one of the pragma command names • clauses is one or more options or qualifying parameters for the pragma

OpenMP – Pragmas/directives #pragma omp parallel {parallel region of c/c++ code } -- declares a parallel region and spawns threads to execute the region in parallel

OpenMP – Pragmas/Directives • In Fortran – !$OMP PARALLEL [clause] Fortran code block !$OMP END PARALLEL

OpenMP – Pragmas/directives • Sample program - revisited #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }

OpenMP – Pragmas/directives #pragma omp parallel [clauses] clauses… private(list) shared(list) default(private | shared | none) firstprivate(list) reduction(operator:list) copyin(list) if (scalar_expression)

OpenMP – omp parallel clauses private(list) list is a variable list variables in the list are private or local to each thread shared(list) all variables in the list are shared among threads

OpenMP – omp parallel clauses • default(shared | none) • defines the default state of variables in the parallel region • shared means that variables are shared among threads unless otherwise stated • none means there is no default - all variables must be defined shared, private… • firstprivate(list) • private but initialize from object prior to parallel region

OpenMP – omp parallel clauses • reduction(operator: list) • performs reduction operation on variables in the list • specific reduction operation defined by operator • operators – +, *, -, &, |, ^, &&, ||

OpenMP – omp parallel clauses • copyin(list) • list must appear in threadprivate list • copies variable value from master thread to private variable in threads • if (scalar_value) • if evaluates to non-zero – executes region in parallel • if evaluates to zero – executes region in a single thread

OpenMP Pragmas/Directives • #pragma omp threadprivate (list) • declares variables that are private/local to threads • … but persistent across multiple parallel sections • must appear immediately after variable declaration

OpenMP - threadprivate • The copyin clause – copyin (a, b) • copies in (initializes) values of threadprivate variables in parallel region • used in parallel directives

OpenMP – Pragmas/Directives • #pragma omp for • distributes the work of a for loop across the threads in a team… • …if there are no serial dependencies in the loop calculations • a for loop must follow this pragma (DO in Fortran) • this pragma must be within a parallel region

OpenMP – Pragmas/Directives • Clauses • private(list) • shared(list) • lastprivate(list) • reduction(operator:list) • schedule(kind[,chunk]) • ordered • nowait – no barrier at the end of for loop

OpenMP – Pragmas/directives • #pragma omp parallel for [clauses] • performs a for loop within a parallel region • distributes the work of the loop across threads • if there are no serial dependencies in the loop calculations

OpenMP – Pragmas/Directives • #pragma omp parallel for – clauses • private(list) • shared(list) • default(shared | none) • firstprivate(list) • variables in list are private in each thread, but they are initialized by object in serial region

OpenMP – Pragmas/Directives • Parallel for clauses • lastprivate(list) • variables in the list are private in each thread, but the last thread to set the variables updates their shared counterpart in the serial program • reduction(operator:list) • copyin(list)

OpenMP – Pragmas/Directive • parallel for clauses • if (scalar_expression) • ordered • specifies that a section in a parallel region will be executed in the same order as if it were executed on a serial computer • schedule(kind, chunk)

OpenMP – Pragmas/Directives #include <stdio.h> #include "omp.h" void main(int argc, char** argv){ int a[12]={1,2,3,4,5,6,7,8,9,10,11,12}; int i; #pragma omp parallel for shared(a) for(i=0;i<12;i++){ a[i] = a[i] + 100; }; for(i=0;i<12;i++){ printf("here is a[%d}] -- %d\n",i,a[i]); } }

CS 591x Clutter Computing and Programming Parallel Computers