Parallel Programming On the IUCAA Clusters

Parallel Programming On the IUCAA Clusters Sunu Engineer

IUCAA Clusters • The Cluster – Cluster of Intel Machines on Linux • Hercules – Cluster of HP ES45 quad processor nodes • References: http://www.iucaa.ernet.in/

The Cluster • Four Single Processor Nodes with 100 Mbps Ethernet interconnect. • 1.4 GHz, Intel Pentium 4 • 512 MB RAM • Linux 2.4 Kernel (Redhat 7.2 Distribution) • MPI – LAM 6.5.9 • PVM – 3.4.3

Hercules • Four quad processor nodes with Memory Channel interconnect • 1.25 GHz Alpha 21264D RISC Processor • 4 GB RAM • Tru64 5.1A with TruCluster software • Native MPI • LAM 7.0 • PVM 3.4.3

ES45 Cluster Processor ~ 679/960 System GFLOPS ~ 30 Algorithm/Benchmark Used – Specint/float/HPL Expected Computational Performance • Intel Cluster • Processor - 512/590 • System GFLOPS ~ 2 • Algorithm/Benchmark Used – Specint/float/HPL

Parallel Programs • Move towards large scale distributed programs • Larger class of problems with higher resolution • Enhanced levels of details to be explored • …

The Starting Point • Model  Single Processor Program  Multi Processor Program • Model  Multiprocessor Program

Decomposition of a Single Processor Program • Temporal • Initialization • Control • Termination • Spatial • Functional • Modular • Object based

Multi Processor Programs • Spatial delocalization – Dissolving the boundary • Single spatial coordinate - Invalid • Single time coordinate - Invalid • Temporal multiplicity • Multiple streams at different rates w.r.t an external clock.

In comparison • Multiple points of initialization • Distributed control • Multiple points and times of termination • Distribution of the activity in space and time

Breaking up a problem

Yet Another way

And another

Amdahl’s Law

Degrees of refinement • Fine parallelism • Instruction level • Program statement level • Loop level • Coarse parallelism • Process level • Task level • Region level

Patterns and Frameworks • Patterns - Documented solutions to recurring design problems. • Frameworks – Software and hardware structures implementing the infrastructure

Processes and Threads • From heavy multitasking to lightweight multitasking on a single processor • Isolated memory spaces to shared memory space

Posix Threads in Brief • pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments) • pthread_exit • pthread_join • pthread_self • pthread_mutex_init • pthread_mutex_lock/unlock • Link with –lpthread

Multiprocessing architectures • Symmetric Multiprocessing • Shared memory • Space Unified • Different temporal streams • OpenMP standard

OpenMP Programming • Set of directives to the compiler to express shared memory parallelism • Small library of functions • Environment variables. • Standard language bindings defined for FORTRAN, C and C++

C An openMP program program openmp !$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num() !$OMP END PARALLEL stop end Open MP example #include <stdio.h> #include <omp.h> int main(int argc, char ** argv) { #pragma omp parallel { printf(“Hello World from %d\n”,omp_get_thread_num()); } return(0); }

Open MP directivesParallel and Work sharing • OMP Parallel [clauses] • OMP do [ clauses] • OMP sections [ clauses] • OMP section • OMP single

Combined work sharingSynchronization • OMP parallel do • OMP parallel sections • OMP master • OMP critical • OMP barrier • OMP atomic • OMP flush • OMP ordered • OMP threadprivate

OpenMP Directive clauses • shared(list) • private(list)/threadprivate • firstprivate/lastprivate(list) • default(private|shared|none) • default(shared|none) • reduction (operator|intrinsic : list) • copyin(list) • if (expr) • schedule(type[,chunk]) • ordered/nowait

Open MP Library functions • omp_get/set_num_threads() • omp_get_max_threads() • omp_get_thread_num() • omp_get_num_procs() • omp_in_parallel() • omp_get/set_(dynamic/nested)() • omp_init/destroy/test_lock() • omp_set/unset_lock()

OpenMP environment variables • OMP_SCHEDULE • OMP_NUM_THREADS • OMP_DYNAMIC • OMP_NESTED

OpenMP Reduction and Atomic Operators • Reduction : +,-,*,&,|,&&,|| • Atomic : ++,--,+,*,-,/,&,>>,<<,|

Simple loops • do I=1,N z(I) = a * x(I) + y end do !$OMP parallel do do I=1,N z(I) = a * x(I) + y end do

Data Scoping • Loop index private by default • Declare as shared, private or reduction

Private variables • !$OMP parallel do private(a,b,c) do I=1,m do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do end do #pragma omp parallel for private(a,b,c)

Dependencies • Data dependencies (Lexical/dynamic extent) • Flow dependencies • Classifying and removing the dependencies • Non removable dependencies • Examples Do I=2,n a(I) =a(I)+a(I-1) end do Do I=2,N,2 a(I)= a(I)+a(I-1) End do

Making sure everyone has enough work • Parallel overhead – Creation of threads, synchronization vs. work done in the loop $!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime

Parallel regions – from fine to coarse parallelism • $!OMP Parallel • threadprivate and copyin • Work sharing constructs • do, sections, section, single Synchronization • critical, atomic, barrier, ordered, master

To distributed memory systems • MPI, PVM, BSP …

Some Parallel Libraries Existing parallel libraries and toolkits include: • PUL, the Parallel Utilities Library from EPCC. • The Multicomputer Toolbox from Tony Skjellum and colleagues at LLNL and MSU. • The Portable, Extensible, Toolkit for Scientific computation from ANL. • ScaLAPACK from ORNL and UTK. • ESSL, PESSL on AIX • PBLAS, PLAPACK, ARPACK

Parallel Programming On the IUCAA Clusters

Parallel Programming On the IUCAA Clusters

Presentation Transcript

Parallel Programming

parallel data mining on multicore clusters

PARALLEL programming

Parallel Programming

parallel data mining on multicore clusters

Parallel Programming on the SGI Origin2000

Parallel Programming

Parallel Programming

Parallel Programming

Special Lectures on Parallel Programming

Parallel Programming

Parallel Programming on Computational Grids

Parallel Programming on the SGI Origin2000

Parallel Programming

Parallel Computing on Wide-Area Clusters: the Albatross Project

Parallel Simulations on High-Performance Clusters

Parallel Programming on Computational Grids