CS 591x Clutter Computing and Programming Parallel Computers
CS 591x Clutter Computing and Programming Parallel Computers. Introduction to OpenMP. Recall…. Three paradigms for parallel software Distributed memory Shared memory Data Parallel As the names imply – Distributed memory paradigm suited for distributed memory architectures
CS 591x Clutter Computing and Programming Parallel Computers
E N D
Presentation Transcript
CS 591x Clutter Computing and Programming Parallel Computers Introduction to OpenMP
Recall… • Three paradigms for parallel software • Distributed memory • Shared memory • Data Parallel • As the names imply – • Distributed memory paradigm suited for distributed memory architectures • Shared memory paradigm suited for SMP architectures (shared memory)
Distributed Memory • No common memory space • No automatic sharing of data across processes • All data sharing is done by explicit message passing • MPI
Shared Memory • Processes share common memory space • Data sharing via common memory space • No explicit message passing required • OpenMP
SMP vs Distributed Memory Architectures • SMP – several/many processors share a common memory pool • No explicit message passing • Performance very good – relative to message passing • So, why not just SMPs…
SMPs • SMPs with a large number of processors tend to be very expensive • Process/Cache/Interconnect technology limits the number of processors that can be connected in an SMP • Memory/Processor interconnect can be a bottleneck – limiting performance
SMPs - Multiport Memory Processor Memory Bank Processor Processor Processor
SMPs – Bus architecture Processor Processor Processor Processor Memory Memory Memory
SMP – Crossbar switches Processor Processor Processor Processor Processor Memory Memory Memory
SMPs • 64 or 128 processor SMPs are very large SMPs • Distributed memory systems can scale up to hundreds or thousands of processors • Fastest computer are distributed memory systems • Still… SMPs are very powerful.
PSC – Rachel and Jonas • 64 Processors • 1.67 Ghz EV7 Processors • 256 Gbytes of shared memory
OpenMP • OpenMP – application programming interface standard for SMP parallel programming • Widely accepted standard, not widely implemented • API for Fortran, C, C++ • Implementations • Portland Group Fortran, C, C++ • Intel Fortran, C, C++
OpenMP Concepts • OpenMP is not a language, more like an extension to existing languages – Fortran, C, C++ • Parallelization is implemented in a program through compiler directives… • Fortran – directives • C/C++ Pragmas • … and limited number of runtime functions… • and environment variables
OpenMP Concepts • Pragmas/directives are parallelization instructions to the compiler • Pragmas/directives start with sentinals • Fortran !$omp [command] c$omp [command] *$omp [command] • c/c++ #pragma omp [command]
OpenMP Concept • Parallel program blocks • blocks of code.. • completely contained within openMP construct • starts/ends with {/} (just like c) • one entry point/one exit point • no breaks or jumps
OpenMP Concepts • Processes vs Thread • Process – executable code with independent memory space from other (maybe similar) executable code • Thread – executable code that shares memory space with other executable code • Group of threads in a program as called “teams”
OpenMP Concepts • Fork and join • Fork -Process adds threads of execution • Joins - closes at completion of task
Fork and Join execution model MPI OpenMP p0 p1 p2 p3 Thread Threads t0 t1 t2 t3 Threads t0 t1 t2 t3
Compiling OpenMP programs • Portland Group C/C++ pgcc –mp myprog.c –o myprog • Intel C/C++ icc –openmp myproc.c –o myproc
Running OpenMP programs • Set environment variables • OMP_NUM_THREADS=n • NCPUS=n export OMP_NUM_THREADS=4 • PBS/Torque !#/bin/sh #PBS –l nodes=2,ppn=2 export OMP_NUM_THREADS=4 ./myprog
Sample OpenMP Program #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }
OpenMP – Three parts • Environment Variables • Run-time Library Functions • Pragmas/Directives *
OpenMP – Environment Variables • OMP_NUM_THREADS • specifies the number of threads to run during the execution of an OpenMP program • determines the number of threads assigned to a job regardless the number of processors in the system • default = 1 • export OMP_NUM_THREADS=4 • NCPUS does the same thing
OpenMP – Environment Variables • OMP_SCHEDULE • defines the type of iteration scheduling used in OMP for and OMP Parallel for loops • options are • static • guided • dynamic
OpenMP – Environment Variables • MPSTKZ • increase the size of stacks used by thread in parallel regions • may be needed if threads have a lot of private variables or .. • if functions within parallel regions have a lot of local variable storage • must use integer + “M” or “m” to mean megabytes • export MPSTKZ=8M
OpenMP – Run-time Library Functions • Always use • #include <omp.h> • int omp_get_num_threads(void); • returns the number of threads in a family in a running in the parallel region where it was called int thcount=omp_get_num_threads();
OpenMP – Run-time Library Functions • void omp_set_num_threads(int n); • sets the number of threads for the next parallel region • must be called before parallel region • if called from within parallel regions it is undefined • has precedence over OMP_NUM_THREADS environment variable omp_set_num_threads(6);
OpenMP – Run-time Library Functions • int omp_get_thread_num(void); • returns a thread number for a thread within a team • thread numbers run from 0 (root thread) to omp_get_num_threads()-1 • if called in serial region returns 0 • similar to MPI_Comm_rank(…) int mythread_num=omp_get_thread_num();
OpenMP – Run-time Library Functions • int omp_get_max_threads(void); • returns the max number of threads a job can have • returns max number of threads even if in serial region • can be changed with omp_set_num_threads(n); int max_t=omp_get_max_threads();
OpenMP – Run-time Library Functions • int omp_in_parallel(void); • returns non-zero if it is called within a parallel region… • returns zero if not in a parallel region int p_or_pnot=omp_in_parallel();
OpenMP – Run-time Library Functions • Defined but not implemented • void omp_set_dynamic(int dyn); • int omp_get_dynamic(void); • void omp_set_nested(int nested); • int omp_get_nested(void);
OpenMP – Run-time Library Functions • to be continued….
OpenMP – Pragmas/directives • General format • #pragma omp pragma_name [clauses] • where pragma_name is one of the pragma command names • clauses is one or more options or qualifying parameters for the pragma
OpenMP – Pragmas/directives #pragma omp parallel {parallel region of c/c++ code } -- declares a parallel region and spawns threads to execute the region in parallel
OpenMP – Pragmas/Directives • In Fortran – !$OMP PARALLEL [clause] Fortran code block !$OMP END PARALLEL
OpenMP – Pragmas/directives • Sample program - revisited #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }
OpenMP – Pragmas/directives #pragma omp parallel [clauses] clauses… private(list) shared(list) default(private | shared | none) firstprivate(list) reduction(operator:list) copyin(list) if (scalar_expression)
OpenMP – omp parallel clauses private(list) list is a variable list variables in the list are private or local to each thread shared(list) all variables in the list are shared among threads
OpenMP – omp parallel clauses • default(shared | none) • defines the default state of variables in the parallel region • shared means that variables are shared among threads unless otherwise stated • none means there is no default - all variables must be defined shared, private… • firstprivate(list) • private but initialize from object prior to parallel region
OpenMP – omp parallel clauses • reduction(operator: list) • performs reduction operation on variables in the list • specific reduction operation defined by operator • operators – +, *, -, &, |, ^, &&, ||
OpenMP – omp parallel clauses • copyin(list) • list must appear in threadprivate list • copies variable value from master thread to private variable in threads • if (scalar_value) • if evaluates to non-zero – executes region in parallel • if evaluates to zero – executes region in a single thread
OpenMP Pragmas/Directives • #pragma omp threadprivate (list) • declares variables that are private/local to threads • … but persistent across multiple parallel sections • must appear immediately after variable declaration
OpenMP - threadprivate • The copyin clause – copyin (a, b) • copies in (initializes) values of threadprivate variables in parallel region • used in parallel directives
OpenMP – Pragmas/Directives • #pragma omp for • distributes the work of a for loop across the threads in a team… • …if there are no serial dependencies in the loop calculations • a for loop must follow this pragma (DO in Fortran) • this pragma must be within a parallel region
OpenMP – Pragmas/Directives • Clauses • private(list) • shared(list) • lastprivate(list) • reduction(operator:list) • schedule(kind[,chunk]) • ordered • nowait – no barrier at the end of for loop
OpenMP – Pragmas/directives • #pragma omp parallel for [clauses] • performs a for loop within a parallel region • distributes the work of the loop across threads • if there are no serial dependencies in the loop calculations
OpenMP – Pragmas/Directives • #pragma omp parallel for – clauses • private(list) • shared(list) • default(shared | none) • firstprivate(list) • variables in list are private in each thread, but they are initialized by object in serial region
OpenMP – Pragmas/Directives • Parallel for clauses • lastprivate(list) • variables in the list are private in each thread, but the last thread to set the variables updates their shared counterpart in the serial program • reduction(operator:list) • copyin(list)
OpenMP – Pragmas/Directive • parallel for clauses • if (scalar_expression) • ordered • specifies that a section in a parallel region will be executed in the same order as if it were executed on a serial computer • schedule(kind, chunk)
OpenMP – Pragmas/Directives #include <stdio.h> #include "omp.h" void main(int argc, char** argv){ int a[12]={1,2,3,4,5,6,7,8,9,10,11,12}; int i; #pragma omp parallel for shared(a) for(i=0;i<12;i++){ a[i] = a[i] + 100; }; for(i=0;i<12;i++){ printf("here is a[%d}] -- %d\n",i,a[i]); } }