CI-TRAIN

CI-TRAIN Introduction to OpenMP

The Need for Speed • Why not “out of the box”? • We have a never ending need to get more performance out of our computers • Scale, scope and complexity of scientific computations can greatly exceed the capacity of “off the shelf” computers in any reasonable amount of time • Growth in processor (core) performance is leveling off, computational demand is not • Need to harness multiple processors working on a common computation

Parallel Machine Architectures • Paradigms for parallel machine architectures • Distributed memory • Shared memory • Data Parallel • GPGPU • Mix of above/All of above

Parallel Machine Architectures • As the names imply – • Distributed memory paradigm suited for distributed memory architectures • Shared memory paradigm suited for SMP architectures (shared memory) • Data parallel – concurrent processing with same/similar processing • Do these mix?– talk about this later…

SMP vs Distributed Memory

SMP vs Distributed Memory Architectures • SMP – several/many processors share a common memory pool • No explicit message passing • Performance very good – relative to message passing • So, why not just SMPs…

SMPs • SMPs with a large number of processors tend to be very expensive • Process/Cache/Interconnect technology limits the number of processors that can be connected in an SMP • Memory/Processor interconnect can be a bottleneck – limiting performance

SMPs • 64 or 128 processor SMPs are very large SMPs • Distributed memory systems can scale up to hundreds or thousands of processors • Fastest computer are distributed memory systems – because of scale • Still… SMPs are very powerful.

XSEDE SMP Resources • NICS Nautilus • 1024 Intel Nahelem EX Processors • 4 Terabytes of memory in single memory image • 4 Nvidia Tesla GPUs • TACC Ranger • 3,936 16-way SMP nodes • 32 GB memory per node • PSC Blacklight • 512 8 core Xeons • 16 TB shared memory (*2) via SM Interconnect

Your computer? • Dual core, Quad Core? • Dual processor, Quad processor? • …..

Parallel Software Models • GPGPU – CUDA • Message Passing - MPI • Threads – pthreads, Windows Threads • SMP - OpenMP

OpenMP and MPI – can the live in harmony? https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData

A note about Threads • Threads are streams of execution (code)… • That can run concurrently… • Usually spawned by a process • That process runs on a single processor (core)… • Code can spawn and control multiple threads…

A note about Threads • Threads are streams of execution (code)… • On single processor systems threads timeshare the processor… • On multi-processor or multiple core systems the process can start, stop and control threads across multiple cores… • …with obvious performance advantages.

A note about Threads • Threads are streams of execution (code)… • Using threads -… • Some OSs provide Thread API (Linux pthreads, Windows Win32) • Some language abstract the threads from the programmer ….OpenMP

OpenMP • OpenMP – “application programming interface” standard for SMP parallel programming • Widely accepted standard, not widely implemented • API for Fortran, C, C++ • Implementations • Portland Group Fortran, C, C++ • Intel Fortran, C, C++ • GCC

OpenMP Concepts • OpenMP is not a language, more like an extension to existing languages – Fortran, C, C++ • Parallelization is implemented in a program through compiler directives… • Fortran – directives • C/C++ Pragmas • … and limited number of runtime functions… • and environment variables

OpenMP Concepts • Pragmas/directives are parallelization instructions to the compiler • Pragmas/directives start with sentinals • Fortran !$omp [command] c$omp [command] *$omp [command] • c/c++ #pragma omp [command]

OpenMP Concept • Parallel program blocks • blocks of code.. • completely contained within an openMP construct • starts/ends with {/} (just like c) • one entry point/one exit point • no breaks or jumps

OpenMP Concepts • Processes vs Thread • Process – executable code with independent memory space from other (maybe similar) executable code • Thread – executable code that shares memory space with other executable code • Group of threads in a program as called “teams”

OpenMP Concepts • Fork and join • Fork -Process spawns threads of execution • Joins – closes threads at completion of task

Fork and Join execution model MPI OpenMP p0 p1 p2 p3 Thread Threads t0 t1 t2 t3 Threads t0 t1 t2 t3

Compiling OpenMP programs • Intel C/C++ icc –openmp myproc.c –o myproc GCC gcc –fopenmp myproc.c -omyproc

Running OpenMP programs • Set environment variables • OMP_NUM_THREADS=n • NCPUS=n export OMP_NUM_THREADS=4 • PBS/Torque !#/bin/sh #PBS –l nodes=2,ppn=2 export OMP_NUM_THREADS=4 ./myprog

Sample OpenMP Program #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }

OpenMP – Three parts • Three parts (interfaces) to OpenMP • Environment Variables • Run-time Library Functions • Pragmas/Directives *

OpenMP – Environment Variables • OMP_NUM_THREADS • specifies the number of threads to run during the execution of an OpenMP program • determines the number of threads assigned to a job regardless the number of processors in the system • default = 1 • export OMP_NUM_THREADS=4 • NCPUS does the same thing

OpenMP – Environment Variables • OMP_SCHEDULE • defines the type of iteration scheduling used in OMP for and OMP Parallel Do and for loops • options are • static • guided • Dynamic setenv OMP_SCHEDULE "dynamic"

OpenMP – Environment Variables • MPSTKZ • increase the size of stacks used by thread in parallel regions • may be needed if threads have a lot of private variables or .. • if functions within parallel regions have a lot of local variable storage • must use integer + “M” or “m” to mean megabytes • export MPSTKZ=8M

OpenMP – Run-time Library Functions • Always use • #include <omp.h> • int omp_get_num_threads(void); • returns the number of threads in a family/team in a running in the parallel region where it was called int thcount=omp_get_num_threads();

OpenMP – Run-time Library Functions • void omp_set_num_threads(int n); • sets the number of threads for the next parallel region • must be called before parallel region • if called from within parallel regions it is undefined • has precedence over OMP_NUM_THREADS environment variable omp_set_num_threads(6);

OpenMP – Run-time Library Functions • int omp_get_thread_num(void); • returns a thread number for a thread within a team • thread numbers run from 0 (root thread) to omp_get_num_threads()-1 • if called in serial region returns 0 • similar to MPI_Comm_rank(…) int mythread_num=omp_get_thread_num();

OpenMP – Run-time Library Functions • int omp_get_max_threads(void); • returns the max number of threads a job can have • returns max number of threads even if in serial region • can be changed with omp_set_num_threads(n); int max_t=omp_get_max_threads();

OpenMP – Run-time Library Functions • int omp_in_parallel(void); • returns non-zero if it is called within a parallel region… • returns zero if not in a parallel region int p_or_pnot=omp_in_parallel();

OpenMP – Run-time Library Functions • Defined but not implemented • void omp_set_dynamic(int dyn); • int omp_get_dynamic(void); • void omp_set_nested(int nested); • int omp_get_nested(void);

OpenMP – Run-time Library Functions • to be continued….

OpenMP – Pragmas/directives • General format • #pragma omp pragma_name [clauses] • where pragma_name is one of the pragma command names • clauses is one or more options or qualifying parameters for the pragma

OpenMP – Pragmas/directives #pragma omp parallel {parallel region of c/c++ code } -- declares a parallel region and spawns threads to execute the region in parallel

OpenMP – Pragmas/Directives • In Fortran – !$OMP PARALLEL [clause] Fortran code block !$OMP END PARALLEL

OpenMP – Pragmas/directives • Sample program - revisited #include <stdio.h> #include "omp.h" main(int argc, char** argv) { printf("Hello world from an openMP program.\n"); #pragma omp parallel { printf(" from thread number %d\n",omp_get_thread_num()); } printf("this part is sequential.\n"); }

OpenMP – Pragmas/directives #pragma omp parallel [clauses] structured block clauses… private(list) shared(list) default(private | shared | none) firstprivate(list) reduction(operator:list) copyin(list) if (scalar_expression) num_threads (scalar expression)

OpenMP – omp parallel clauses private(list) list is a variable list variables in the list are private or local to each thread shared(list) all variables in the list are shared among threads

OpenMP – omp parallel clauses • default(shared | none) • defines the default state of variables in the parallel region • shared means that variables are shared among threads unless otherwise stated • none means there is no default - all variables must be defined shared, private… • firstprivate(list) • private but initialize from object prior to parallel region

OpenMP – omp parallel clauses • reduction(operator: list) • performs reduction operation on variables in the list • specific reduction operation defined by operator • operators – +, *, -, &, |, ^, &&, ||

OpenMP – omp parallel clauses • copyin(list) • list must appear in threadprivate list • copies variable value from master thread to private variable in threads • if (scalar_value) • if evaluates to non-zero – executes region in parallel • if evaluates to zero – executes region in a single thread

OpenMP Pragmas/Directives • #pragma omp threadprivate (list) • declares variables that are private/local to threads • … but persistent across multiple parallel sections • must appear immediately after variable declaration

OpenMP - threadprivate • The copyin clause – copyin (a, b) • copies in (initializes) values of threadprivate variables in parallel region • used in parallel directives

OpenMP – Pragmas/Directives • #pragma omp for • distributes the work of a for loop across the threads in a team… • …if there are no serial dependencies in the loop calculations • a for loop must follow this pragma (DO in Fortran) • this pragma must be within a parallel region

OpenMP – Pragmas/Directives • Clauses • private(list) • shared(list) • lastprivate(list) • reduction(operator:list) • schedule(kind[,chunk]) • ordered • nowait – no barrier at the end of for loop

OpenMP – Pragmas/directives • #pragma omp parallel for [clauses] • performs a for loop within a parallel region • distributes the work of the loop across threads • if there are no serial dependencies in the loop calculations

CI-TRAIN

CI-TRAIN

Presentation Transcript

Szocializ ci

TRAIN

TRAIN

CI Requirements

TRAIN

TRAIN TO TRAIN

TRAIN

CI Structure

CI 102

CI 102

CI 4.5

train

Train

CI 4.5

CI 3.2