What is HPC?

Trends and ChallengesinHigh Performance ComputingHai Xiang LinDelft Institute of Applied MathematicsDelft University of Technology

What is HPC? • Supercomputers are computers which are typically 100 times or more faster than a PC or a workstation. • High Performance Computing are typically computer applications running on parallel and distributed supercomputers in order to solve large scale or complex models within an acceptable time. (huge computing speed and huge memory storage)

Computational Science & Engineering • Computational Science & Engineering as a third paradigm for scientific research is becoming increasingly more important (traditional paradigms: analytical (theory) and experimental); • HPC is the driving force for the rise of the third paradigm (although CSE is not necessary connected to HPC);

Hunger for more computing power • Tremendous increase in speed: • Clock speed: 106 • Parallel processing: 105 • Efficient algorithms: 106 • Computational scientists and engineers demand for ever more computing power

Outline • Architectures • Software and Tools • Algorithms and Applications

Evolution of supercomputers • Vector supercomputers (70’s, 80’s) • Very expensive! (Cray 2 (1986): 1 Gflops) • Massively Parallel Computers (90’s) • Still expensive ( Intel ASCI Red (1997): 1 Tflops) • Distributed Cluster Computers (late 90’s - ) • Cheap (using off-the-shelf components), and • Easy to produce • (IBM Roadrunner (2008): 1 Pflops) • What’s next? • 1 Exaflops in 2019? (cycle of 11 years)

Hardware trends: no more doubling of clock speed every 2~3 years

CMOS Devices hitting a scaling wall • Power components: • Active power • Passive power • Gate leakage • Sub-threshold leakage (source-drain leakage) • Net: • Further improvements require structure/materials changes Air Cooling limit Power Density (W/cm2) Source: P. Hofstee, IBM, Europar 09 keynote

Microprocessor Trends Hybrid • Single Thread performance power limited • Multi-core throughput performance extended • Hybrid extends performance and efficiency Multi-Core Performance Single Thread Power

Hardware trends: moving towards multi-core and accelerators (e.g., Cell, GPU, …) • Multi-Core: e.g., IBM Cell BE: 1 PPE +8 SPE • GPU: e.g., Nvidia G200 (240 cores) and GF-100 (512 cores). • “supercomputer” affordable for everyone now! E.g., a PC + 1 Tflops GPU. • The size is kept increasing – The largest supercomputer will soon have more than 1 million processors/cores (e.g.,IBM Sequoia 1.6 Power-processors, 1.6 Pbytes and 20 Pflops, (2012)). • Power consumption is becoming an important metric (watts/Gflops) for (HPC) computers.

Geographical distribution of Supercomputing power (Tflops) Figure: John West, InsideHPC

HPC Cluster Directions (according to Peter Hofstee, IBM)

Software and Tools Challenges • In the mid 70s to mid 90s, data parallel languages and SIMD execution mode were popular together with the vector computers. • Automatic vectorization for array type operations is quite well developed • For MPPs and clusters, an efficient automatic parallelizing compiler has not been developed till today • Optimizing data distribution, and automatic detection of task-parallelism turns out to be very hard problems to tackle.

Software and Tools Challenges (con.) • Current situation: OpenMP works for SMP systems with a small number of processors/cores. For large systems and distributed memory systems, data distibution/communication must be done manually by the programmer, mostly with MPI. • Programming GPU type of accelerators using CUDA, OpenCL etc. has sort resemblance of programming vector processors in the old days. Very high performance for certain type of operations, but the programmability and applicability are some what limited.

The programming difficulty is getting severer • In contrast to the fast development in hardware, the development of parallel compiler and programming lack behind. • Moving towards larger and larger systems enlarges this problem even further • Heterogeneity • Debugging • …

DOE report on ExaScale superomputing [4] • “The shift from faster processors to multicore processors is as disruptive as the shift from vector to distributed memory supercomputers 15 years ago. That change required complete restructuring of scientific application codes, which took years of effort. The shift to multicore exascale systems will require applications to exploit million-way parallelism. This ‘scalability challenge’ affects all aspects of the use of HPC. It is critical that work begin today if the software ecosystem is to be ready for the arrival of exascale systems in the coming decade”

The big challenge requires consorted Int’l effort • IESP - International Exascale Sofwtare Project [5].

Applications • Applications which require Exaflops computing power, examples ([4],[6]): • Climate and atmospheric modelling • Astrophysics • Energy research (e.g., combustion and fusion) • Biology (genetics, molecular dynamics, …) • … • Are their applications which can use 1 million processors? • Parallelism is inherent in nature • Serialization is a way we deal with complexity • Some mathematical and computational models may have to be reconsidered

Algorithms • Algorithms with a large degree of parallelism is essential • Data locality is important to efficiency • Data movement at the cache level(s) • Data movement (communication) between processors/nodes

HPC & A Growing gap betw. processor and memory

HPC & A Memory hierarchy: NUMA In order reduce the big delay of directly accessing the main or remote memory, it requires: • Optimizing the data movement, maximize reuse of data already in fastest memory; • Minimizing data movement between ‘remote’ memories (communication)

Scale change requires change in algorithms • It is well known that sometimes an algorithm with higher degree of parallelism is preferred above an ‘optimal’ algorithm (in the sense of number of operations); • In order to reduce the data movement, we need to consider to restructure existing algorithms

An example: Krylov iterative method James Demmel et al, “Avoiding communication in Sparse Matrix Computations”, Proc. IPDPS, April 2008. In an iteration of the Krylov method, e.g. CG or GMRES, typically an SpMV (Sparse Matrix-Vector Multiplication) is calculated: y  y + A x, A is a sparse matrix For ai,j≠ 0, yi = yi + ai,j * xj SpMV has low computational intensity. ai,j is used only once, no reuse at all.

An example: Krylov iterative method (con.) • Consider the operation across a number of iterations, where the “matrix power kernel” [x, Ax, A2x, …, Akx] is computed. • Computing all these terms at the same time  minimize the data movement of A (some redundant work) • Speedup upto 7x, and 22x across the Grid.

Example: Generating Parallel operation by graph transformations [Lin2001] A Unifying Graph Model for Designing Parallel Algorithms For Tridiagonal Systems, Parallel Computing, Vol. 27, 2001 [Lin2004] Graph Transformation and Designing Parallel Sparse Matrix Algorithms beyond Data Dependence Analysis. Scientific Programming, Vol.12, 2004. May be yet a step too far, first thing should be automatic parallelizing compiler (by detecting parallelism and optimizing data locality).

What is HPC?

What is HPC?

Presentation Transcript