Introduction to Parallel Computing

Introduction to Parallel Computing George Mozdzynski March 2004

Outline • What is parallel computing? • Why do we need it? • Types of computer • Parallel Computing today • Parallel Programming Languages • OpenMP and Message Passing • Terminology

What is Parallel Computing? The simultaneous use of more than one processor or computer to solve a problem

Why do we need Parallel Computing? • Serial computing is too slow • Need for large amounts of memory not accessible by a single processor

An IFS operational TL511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs). How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

Ans. About 8 days This PC would need about 25 Gbytes of memory. 8 days is too long for a 10 day forecast! 2-3 hours is too long …

IFS Forecast Model (TL511L60) CPUs Wall time 64 11355 128 5932 192 4230 256 3375 320 2806 384 2338 448 2054 512 1842 Amdahl’s Law: Wall Time = S + P/NCPUS Serial =574 secs Parallel=690930 secs (Calculated using Excel’s LINEST function) (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

IFS Forecast Model (TL511L60) CPUs Wall time SpeedUp Efficiency 1 691504 1 100.0 64 11355 61 95.2 128 5932 117 91.1 192 4230 163 85.1 256 3375 205 80.0 320 2806 246 77.0 384 2338 296 77.0 448 2054 337 75.1 512 1842 375 73.3

IFS model would be inefficient on large numbers of CPUs. But OK up to 512.

S … … P P P P M M M Types of Parallel Computer P=Processor M=Memory S=Switch Shared Memory Distributed Memory

S … … P P P P … M M Node Node IBM Cluster 1600 ( at ECMWF) P=Processor M=Memory S=Switch

IBM Cluster 1600’s at ECMWF (hpca + hpcb)

ECMWF supercomputers 1979 CRAY 1A Vector CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Fujitsu VPP700 Fujitsu VPP5000 2002 IBM p690 Scalar + MPI +Shared Memory Parallel } Vector + Shared Memory Parallel } Vector + MPI Parallel

ECMWF’s first Supercomputer CRAY-1A 1979

Where have 25 years gone?

Types of Processor DO J=1,1000 A(J)=B(J) + C ENDDO LOAD B(J) FADD C STORE A(J) INCR J TEST Single instruction processes one element SCALAR PROCESSOR Single instruction processes many elements LOADV B->V1 FADDV B,C->V2 STOREV V2->A VECTOR PROCESSOR

Parallel Computing Today Scalar Systems IBM Cluster 1600 FujitsuPRIMEPOWER HPC2500 HP Integrity rx2600 Itanium2 Vector Systems NEC SX6 CRAY X-1 Fujitsu VPP5000 Cluster Systems (typically installed by an Integrator) Virgina Tech, Apple G5 / Infiniband NCSA, Dell PowerEdge 1750, P4 Xeon / Myrinet LLNL, MCR Linux Cluster Xeon / Quadrics LANL, Linux Networx AMD Opteron / Myrinet

The TOP500 project • started in 1993 • Top 500 sites reported • Report produced twice a year • EUROPE in JUNE • USA in NOV • Performance based on LINPACK benchmark • http://www.top500.org/

Top 500 Supercomputers

Where is ECMWF in Top 500 Rmax Rpeak Rmax – Gflop/sec using Linpack Benchmark Rpeak – Peak Hardware Gflop/sec (that will never be reached!)

What performance do Meteorological Applications achieve? • Vector computers • About 30 to 50 percent of peak performance • Relatively more expensive • Also have front-end scalar nodes • Scalar computers • About 5 to 10 percent of peak performance • Relatively less expensive • Both Vector and Scalar computers are being used in Met Centres around the world • Is it harder to parallelize than vectorize? • Vectorization is mainly a compiler responsibility • Parallelization is mainly the user’s responsibility

http://www.top500.org/ORSC/2003/ Overview of Recent Supercomputers Aad J. van der Steen and Jack J. Dongarra

Parallel Programming Languages? • High Performance Fortran (HPF) • directive based extension to Fortran • works on both shared and distributed memory systems • not widely used (more popular in Japan?) • not suited to applications using irregular grids • http://www.crpc.rice.edu/HPFF/home.html • OpenMP • directive based • support for Fortran 90/95 and C/C++ • shared memory programming only • http://www.openmp.org

Most Parallel Programmers use… • Fortran 90/95, C/C++ with MPI for communicating between tasks (processes) • works for applications running on shared and distributed memory systems • Fortran 90/95, C/C++ with OpenMP • For applications that need performance that is satisfied by a single node (shared memory) • Hybrid combination of MPI/OpenMP • ECMWF’s IFS uses this approach

the myth of automatic parallelization(2 common versions) • Compilers can do anything (but we may have to wait a while) • Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source • Compilers can’t do anything (now or never) • Automatic parallelization is useless. It’ll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

Terminology Cache, Cache line NUMA false sharing Data decomposition Halo, halo exchange FLOP Load imbalance Synchronization

THANKYOU

P P P C C2 C1 C1 M M Cache P=Processor C=Cache M=Memory

P P P P P P P P C2 C2 C2 C2 C1 C1 C1 C1 C1 C1 C1 C1 IBM node = 8 CPUs + 3 levels of $ C3 Memory

Cache is … • Small and fast memory • Cache line typically 128 bytes • Cache line has state (copy,exclusive owner) • Coherency protocol • Mapping, sets, ways • Replacement strategy • Write thru’ or not • Important for performance • Single stride access of always the best!!! • Try to avoid writes to same cache line from different Cpus • But don’t lose sleep over this

IFS blocking in grid space( IBM p690 / TL159L60 ) Optimal use of cache / subroutine call overhead

Introduction to Parallel Computing