Introduction to High Performance Computing

Introduction to High Performance Computing Jon Johansson Academic ICT University of Alberta

Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??

High Performance Computing • HPC is the field that concentrates on developing supercomputers and software to run on supercomputers • a main area of this discipline is developing parallel processing algorithms and software • programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors

High Performance Computing • HPC is about “big problems”, i.e. need: • lots of memory • many cpu cycles • big hard drives • no matter what field you work in, perhaps your research would benefit by making problems “larger” • 2d → 3d • finer mesh • increase number of elements in the simulation

Grand Challenges • weather forecasting • economic modeling • computer-aided design • drug design • exploring the origins of the universe • searching for extra-terrestrial life • computer vision • nuclear power and weapons simulations

Grand Challenges – Protein To simulate the folding of a 300 amino acid protein in water: # of atoms: ~ 32,000 folding time: 1 millisecond # of FLOPs: 3  1022 Machine Speed: 1 PetaFLOP/s Simulation Time: 1 year (Source: IBM Blue Gene Project) Ken Dil and Kit Lau’s protein folding model. IBM’s answer: The Blue Gene Project US$ 100 M of funding to build a 1 PetaFLOP/s computer Charles L Brooks III, Scripps Research Institute

Grand Challenges - Nuclear • National Nuclear Security Administration • http://www.nnsa.doe.gov/ • use supercomputers to run three-dimensional codes to simulate instead of test • address critical problems of materials aging • simulate the environment of the weapon and try to gauge whether the device continues to be usable • stockpile science, molecular dynamics and turbulence calculations http://archive.greenpeace.org/comms/nukes/fig05.gif

Grand Challenges - Nuclear ASCI White • March 7, 2002: first full-system three-dimensional simulations of a nuclear weapon explosion • simulation used more than 480 million cells (grid: 780x780x780) • if the grid is a cube • 1,920 processors on IBM ASCI White at the Lawrence Livermore National laboratory • 2,931 wall-clock hours or 122.5 days • 6.6 million CPU hours Test shot “Badger” Nevada Test Site – Apr. 1953 Yield: 23 kilotons http://nuclearweaponarchive.org/Usa/Tests/Upshotk.html

Grand Challenges - Nuclear • Advanced Simulation and Computing Program (ASC) • http://www.llnl.gov/asc/asc_history/asci_mission.html

What is a “Mainframe”? • large and reasonably fast machines • the speed isn't the most important characteristic • high-quality internal engineering and resulting proven reliability • expensive but high-quality technical support • top-notch security • strict backward compatibility for older software

What is a “Mainframe”? • these machines can, and do, run successfully for years without interruption (long uptimes) • repairs can take place while the mainframe continues to run • the machines are robust and dependable • IBM coined a term advertise the robustness of their mainframe computers : • Reliability, Availability and Serviceability (RAS)

What is a “Mainframe”? • Introducing IBM System z9 109 • Designed for the On Demand Business • IBM is delivering a holistic approach to systems design • Designed and optimized with a total systems approach • Helps keep your applications running with enhanced protection against planned and unplanned outages • Extended security capabilities for even greater protection capabilities • Increased capacity with more available engines per server

What is a Supercomputer?? • at any point in time the term “Supercomputer” refers to the fastest machines currently available • a supercomputer this year might be a mainframe in a couple of years • a supercomputer is typically used for scientific and engineering applications that must do a great amount of computation

What is a Supercomputer?? • the most significant difference between a supercomputer and a mainframe: • a supercomputer channels all its power into executing a few programs as fast as possible • if the system crashes, restart the job(s) – no great harm done • a mainframe uses its power to execute many programs simultaneously • e.g. – a banking system • must run reliably for extended periods

What is a Supercomputer?? • to see the worlds “fastest” computers look at • http://www.top500.org/ • measure performance with the Linpack benchmark • http://www.top500.org/lists/linpack.php • solve a dense system of linear equations • the performance numbers give a good indication of peak performance

Terminology • combining a number of processors to run a program is called variously: • multiprocessing • parallel processing • coprocessing

Terminology • parallel computing – harnessing a bunch of processors on the same machine to run your computer program • note that this is one machine • generally a homogeneous architecture • same processors, memory, operating system • all the machines in the Top 500 are in this category

Terminology • distributed computing - harnessing a bunch of processors on different machines to run your computer program • heterogeneous architecture • different operating systems, cpus, memory • the terms “parallel” and “distributed” computing are often used interchangeably • the work is divided into sections so each processor does a unique piece

Terminology • some distributed computing projects are built on BOINC (Berkeley Open Infrastructure for Network Computing): • SETI@home – Search for Extraterrestrial Intelligence • Proteins@home – deduces DNA sequence, given a protein • Hydrogen@home – enhance clean energy technology by improving hydrogen production and storage (this is beta now)

Quantify Computer Speed • we want a way to compare computer speeds • count the number of “floating point operations” required to solve the problem • + - x / • results of the benchmark are so many Floating point Operations Per Second (FLOPS) • a supercomputer is a machine that can provide a very large number of FLOPS

Floating Point Operations • multiply 2 1000x1000 matrices • for each resulting array element • 1000 multiplies • 999 adds • do this 1,000,000 times • ~109 operations needed • increasing array size has the number of operations increasing as O(N3)

High Performance Computing • supercomputers use many CPUs to do the work • note that all supercomputing architectures have • processors and some combination cache • some form of memory and IO • the processors are separated from the other processors by some distance • there are major differences in the way that the parts are connected • some problems fit into different architectures better than others

High Performance Computing • increasing computing power available to researchers allows • increasing problem dimensions • adding more particles to a system • increasing the accuracy of the result • improving experiment turnaround time

Flynn’s Taxonomy • Michael J. Flynn (1972) • classified computer architectures based on the number of concurrent instructions and data streams available • single instruction, single data (SISD) – basic old PC • multiple instruction, single data (MISD) – redundant systems • single instruction, multiple data (SIMD) – vector (or array) processor • multiple instruction, multiple data (MIMD) – shared or distributed memory systems: symmetric multiprocessors and clusters • common extension: • single program (or process), multiple data (SPMD)

Architectures • we can also classify supercomputers according to how the processors and memory are connected • couple processors to a single large memory address space • couple computers, each with its own memory address space

Architectures • Symmetric Multiprocessing (SMP) • Uniform Memory Access (UMA) • multiple CPUs, residing in one cabinet, share the same memory • processors and memory are tightly coupled • the processors share memory and the I/O bus or data path

Architectures • SMP • a single copy of the operating system is in charge of all the processors • SMP systems range from two to as many as 32 or more processors

Architectures • SMP • "capability computing" • one CPU can use all the memory • all the CPUs can work on a little memory • whatever you need

Architectures • UMA-SMP negatives • as the number of CPUs get large the buses become saturated • long wires cause latency problems

Architectures • Non-Uniform Memory Access (NUMA) • NUMA is similar to SMP - multiple CPUs share a single memory space • hardware support for shared memory • memory is separated into close and distant banks • basically a cluster of SMPs • memory on the same processor board as the CPU (local memory) is accessed faster than memory on other processor boards (shared memory) • hence "non-uniform" • NUMA architecture scales much better to higher numbers of CPUs than SMP

Architectures

Architectures University of Alberta SGI Origin SGI NUMA cables

Architectures • Cache Coherent NUMA (ccNUMA) • each CPU has an associated cache • ccNUMA machines use special-purpose hardware to maintain cache coherence • typically done by using inter-processor communication between cache controllers to keep a consistent memory image when the same memory location is stored in more than one cache • ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession

Architectures Distributed Memory Multiprocessor (DMMP) • each computer has its own memory address space • looks like NUMA but there is no hardware support for remote memory access • the special purpose switched network is replaced by a general purpose network such as Ethernet or more specialized interconnects: • Infiniband • Myrinet Lattice: Calgary’s HP ES40 and ES45 cluster – each node has 4 processors

Architectures • Massively Parallel Processing (MPP) Cluster of commodity PCs • processors and memory are loosely coupled • "capacity computing" • each CPU contains its own memory and copy of the operating system and application. • each subsystem communicates with the others via a high-speed interconnect. • in order to use MPP effectively, a problem must be breakable into pieces that can all be solved simultaneously

Architectures

Architectures • lots of “how to build a cluster” tutorials on the web – just Google: • http://www.beowulf.org/ • http://www.cacr.caltech.edu/beowulf/tutorial/building.html

Architectures • Vector Processor or Array Processor • a CPU design that is able to run mathematical operations on multiple data elements simultaneously • a scalar processor operates on data elements one at a time • vector processors formed the basis of most supercomputers through the 1980s and into the 1990s • “pipeline” the data

Architectures • Vector Processor or Array Processor • operate on many pieces of data simultaneously • consider the following add instruction: • C = A + B • on both scalar and vector machines this means: • add the contents of A to the contents of B and put the sum in C' • on a scalar machine the operands are numbers • on a vector machine the operands are vectors and the instruction directs the machine to compute the pair-wise sum of each pair of vector elements

Architectures • University of Victoria has 4 NEC SX-6/8A vector processors • in the School of Earth and Ocean Sciences • each has 32 GB of RAM • 8 vector processors in the box • peak performance is 72 GFLOPS

BlueGene/L • The fastest on the Nov. 2007 top 500 list: • http://www.top500.org/ • installed at the Lawrence Livermore National Laboratory (LLNL) (US Department of Energy) • Livermore California

http://www.llnl.gov/asc/platforms/bluegenel/photogallery.htmlhttp://www.llnl.gov/asc/platforms/bluegenel/photogallery.html

BlueGene/L • processors: 212992 • memory: 72 TB • 104 racks – each has 2048 processors • the first 64 had 512 GB of RAM (256 MB/processor) • the 40 new racks have 1 TB of RAM (512 MB/processor) • a Linpack performance of 478.2 TFlop/s • in Nov 2005 it was the only system ever to exceed the 100 TFlop/s mark • there are now 10 machines over 100 TFlop/s

The Fastest Six

# of Processors with Time The number of processors in the fastest machines has increased by about a factor of 200 in the last 15 years

# of Gflops Increase with Time Machine speed has increased by more than a factor of 5000 in the last 15 years.

Future BlueGene

Introduction to High Performance Computing