High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking Introduction to High Performance Computing Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

Acknowledgements • The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! • Henri Casanova • Principles of High Performance Computing • http://navet.ics.hawaii.edu/~casanova • henric@hawaii.edu • Ligang He • http://www.dcs.warwick.ac.uk/~liganghe • Email: liganghe@dcs.warwick.ac.uk • Kai Wang • Department of Computer Science • University of South Dakota • http://www.usd.edu/~Kai.Wang • Kyril Faenov • Director of High Performance Computing • Windows Server Group • Andrew Tanenbaum

Agenda • HPC Introduction • HPC Applications • HPC Goals • Concurrency • History

High Performance Computing • Difficult to define - it’s a moving target. • In 1980s: • a “supercomputer” was performing 100 Mega FLOPS • FLOPS: FLoating point Operations Per Second • Today: • a 2G Hz desktop/laptop performs a few Giga FLOPS • a “supercomputer” performs tens of Tera FLOPS (Top500) • High Performance Computing: loosely an order of 1000 times more powerful than the latest desktops

Units of Measure in HPC • High Performance Computing (HPC) units are: • Flops: floating point operations • Flop/s: floating point operations per second • Bytes: size of data (double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte

Metric Units • The principal metric prefixes.

High Performance Computing • HPC: • “The term high performance computing (HPC) refers to the use of (parallel) supercomputers and computer clusters, that is, computing systems comprised of multiple (usually mass-produced) processors linked together in a single system with commercially available interconnects.” • Wikipedia • “This is in contrast to mainframe computers, which are generally monolithic in nature.” • Wikipedia

High Performance Computing • HPC: • “The more current and evolving definition of HPC refers to High Productivity Computing, and reflects the purpose and use model of the myriad of existing and evolving architectures, and the supporting ecosystem of software, middleware, storage, networking and tools behind the next generation of applications.” • Wikipedia • Parallel Computing: • Computing on parallel computers • Super Computing: • Computing on top 500 machines

High Performance Computing Practice Theory • The definition that we use in this course • “How do we make computers to compute bigger problems faster?” • Three main issues • Hardware: How do we build faster computers? • Software: How do we write faster programs? • Hardware and Software: How do they interact? • Many perspectives • architecture • systems • programming • modeling and analysis • simulation • algorithms and complexity

High Performance Computing • HPC Related Technologies • HPC is an all-encompassing term for related technologies that continually push computing boundaries. • Computer architecture • CPU, memory, VLSI • Compilers • Identify inefficient implementations • Make use of the characteristics of the computer architecture • Choose suitable compiler for a certain architecture • Algorithms (for parallel and distributed systems) • How to program on parallel and distributed systems • Middleware • From Grid computing technology • Application->middleware->operating system • Resource discovery and sharing

High Performance Computing • The key techniques for making computers compute “bigger problems faster” is to use multiple computers at once • Later in this lecture, we will learn why! • This is called parallelism • It takes 1000 hours for this program to run on one computer! • Well, if I use 100 computers maybe it will take only 10 hours?! • This computer can only handle a dataset that’s 2GB! • So maybe if I use 100 computers I can deal with a 200GB dataset?! • We will spend enough time to learn and experience different flavors of parallel computing • shared-memory parallelism • distributed-memory parallelism • hybrid parallelism

Words of Wisdom • “Four or five computers should be enough for the entire world until the year 2000.” • T.J. Watson, Chairman of IBM, 1945. • “640KB [of memory] ought to be enough for anybody.” • Bill Gates, Chairman of Microsoft,1981. • You may laugh at their vision today, but … • Lesson learned: Don’t be too visionary and try to make things work! ;) • We now know this was not quite true! • Games • Digital video/images • Databases • Operating systems • But the first people to really need more computing oomph where scientists • And they go way back

Evolution of Science • Traditional scientific and engineering: • Do theory or paper design • Perform experiments or build system • Limitations: • Too difficult -- build large wind tunnels • Too expensive -- build a throw-away airplane • Too slow -- wait for climate or galactic evolution • Too dangerous -- weapons, drug design, climate experiments • Solution: • Use high performance computer systems to simulate the phenomenon

Scientific Computing • Use of computers to solve/compute scientific models • For instance, many natural phenomena can be well approximated by differential equations • Classic Example: Heat Transfer • Consider a “1-D” material between 2 heat sources T = H T = L x

Scientific Computing • Use of computers to solve/compute scientific models • For instance, many natural phenomena can be well approximated by partial differential equations (PDEs) • Problem: compute f(x,t) T = H T = L f(x,t): temperature at location x at time t 0 < x < X

Heat Transfer • The laws of physics say that: • where alpha depends on the material • where f(0,t) = H, f(X,t) = L and f(x,0) are all fixed • Called the boundary conditions • Question: How do we solve this PDE? • It does not have an analytical solution • Therefore it must be solved numerically (i.e., via approximation)

Heat Transfer • One well-known methods to solve the heat equation is called “finite differences” • Approach: • Discretize the domain: decide that the values of f(x,t) will only be known for some finite (but large) number of values of x and t • The discretized domain is called a mesh • All x values are separated by ∆x • All t values are separated by ∆t • Then, one replaces partial derivatives by algebraic differences • In the limit, when ∆x and ∆t go to zero, we get close to the real solution

Heat Transfer • There are many different approximations of the partial derivatives, based on Taylor series developments, etc. • For instance, denoting f(x,t) as (discrete) fi,m, we can write the “Forward Time, Centered Space” (FTCS) heat transfer equation as: • The various discretizations of the heat transfer equation have advantages and drawbacks in terms of • complexity • numerical stability • (if you’re into it, there are countless papers and textbooks) • We have transfer a difficult PDE into some type of algebraic induction! • Easy to compute in an iterative fashion • Given all the values at time m, one can compute all the values at time m+1

Heat Transfer • Summary • But they all use some matrix or volume of numbers (in the 2-D and 3-D cases) and iteratively do additions, multiplications and divisions, for many iterations • Therefore, we can replace difficult calculus by simple computations on multi-dimensional arrays of numbers • Challenges • These matrices may be really big, for better resolution and larger domains  Large Data • The number of additions and multiplications can be overwhelming  Heavy Computation • Hence • the early and always constant need of scientists to get bigger memories and faster CPUs

HPC Applications • Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational Chemistry • Computational Material Sciences and Nanosciences • Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) • Combustion (engine design) • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulation • Cryptography

Example: Computational Fluid Dynamics (CFD) Replacing NASA’s Wind Tunnels with Computers

Example: Global Climate Source: http://www.epm.ornl.gov/chammp/chammp.html • Problem is to compute: f (latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity • Approach: • Discretize the domain, e.g., a measurement point every 10 km • Devise an algorithm to predict weather at time t+1 given t • Uses: • Predict El Nino • Set air emissions standards

Global Climate Requirements • One piece is modeling the fluid flow in the atmosphere • Solve Navier-Stokes problem • Roughly 100 Flops per grid point with 1 minute timestep • Computational requirements: • To match real-time, need 5x1011 flops in 60 seconds = 8 Gflop/s • Weather prediction (7 days in 24 hours)  56 Gflop/s • Climate prediction (50 years in 30 days)  4.8 Tflop/s • To use in policy negotiations (50 years in 12 hours)  288 Tflop/s • Let’s make it even worse! • To 2x grid resolution, computation is > 8x • State of the art models require integration of atmosphere, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this!

High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL

Goals of HPC • Minimize turn-around time • to complete specific application problems (strong scaling) • Maximise the problem size • that can be solved given a set amount of time (weak scaling) • Identify compromise between • performance and cost. • Note: Most supercomputers are obsolete • in terms of performance before the end of their physical life.

Maximizing Performance • How is performance maximized? • Reduce the time per instruction (cycle time) [1]: clock rate. • Increase the number of instructions executed per-cycle [2]: pipelining. • Allow multiple processors to work on different parts of the same program at the same time [3]: parallel execution. • When performance is gained from [1] and [2] • There is a limit to how quick processors will operate. • Speed of light and electricity. • Heat dissipation. • Power consumption • An instruction processing procedure cannot be divided into infinite stages • When performance improvements come from [3] • Overhead of communications

A 10 TFlop/s CPU? CPU • Question: Could we build a single CPU that delivers 10,000 billion floating point operations per second (10 TFlops), and operates over 10,000 billion bytes (10 TByte)? • Representative of what many scientists need today. • Clock rate has to be 10,000 GHz • Assume that data travels at the speed of light • Assume that the computer is an “ideal” sphere

A 10 TFlop/s CPU? • Assume that the machine issues one instruction per cycle • therefore the clock rate must be 10,000GHz ~ 1013 Hz • Data must travel some distance from the memory to the CPU • Assume that Each instruction will need at least one 8 bytes of memory • Assume that data travels at the speed of light c=3x108 m/s • Then the distance between the memory and the CPU must be r < c / 1013 ~ 3x10-6 m • Then we must have 1013 bytes of memory in 4/3r3 = 3.7e-17 m3 • Therefore, each word of memory must occupy 3.7e-30 m3 • This is 3.7 Angstrom3 • Or the volume of a very small molecule that consists of only a few atoms • Current memory densities are 10GB/cm3, • or about a factor 1020 from what would be needed! • Conclusion: It’s not going to happen until some scifi breakthrough happens

Concurrency • Since we cannot conceivably build a single CPU to solve relevant scientific problems, we resort to concurrency: • execution of multiple “tasks” at the “same” time • Concurrency is everywhere in computers • Load a word from memory while adding two registers • Adding two pairs of registers at the same time • Receiving data from the network while writing to disk • Dual-proc systems • Clusters of workstations • SETI@home • Some concurrency is “true” • meaning that things really happen at the same time • Some concurrency is just the illusion • of simultaneous execution, with rapid switching among activities

Concurrent, parallel, distributed? • “Concurrency” is typically the more general term • A program is said to be concurrent if it contains more than one execution context • e.g., more than one thread/process • Typically the word “parallel” implies some notion of high performance / scientific application running on a single hardware platform • The word “distributed” typically refers to applications that run on multiple computers that may not be in the same room • These terms are conflated and misused all the time; in different research communities they mean different things. • We’ll see that distinctions are disappearing anyway

Two Types of HPC • Parallel Computing • Breaking the problem to be computed into parts that can be run simultaneously in different processors • Distributed Computing • Parts of the work to be computed are computed in different places • Note: does not necessarily imply simultaneous processing • An example: C/S model • Solve loosely-coupled problems • (no much communication)

Parallel Computing • Architectures of Parallel Computing • SMP (Symmetric Multi-Processing) • Multiple CPUs, single memory, shared I/O • All resources in a SMP machine are equally available to each CPU • Does not scale well to a large number of processors (less than 8) • NUMA (Non-Uniform Memory Access) • Multiple CPUs • Each CPU has fast access to its local area of the memory, but slower access to other areas • Scale well to a large number of processors • Complicated memory access pattern • MPP (Massively Parallel Processing) • Cluster

Reasons for Concurrency • Concurrency arises for at least 4 reasons • To increase performance or memory capacity • To allow users and computers to “collaborate” • To capture the logical structure of a problem • To cope with independent physical devices

Reason #1 • To increase performance

Reason #1 (cont.) • To increase memory capacity • Example • A 3D weather simulation over Kaneohe Bay (1-meter resolution) • Say we consider a volume 2km x 2km x 1km over the bay • Each zone is characterized by, say, temperature, wind direction, wind velocity, air pressure, air moisture, for a total of (1+3+1+1+1)*8 = 56 bytes • Therefore we need about 208GB of memory to hold the data • Option #1: Buy a machine with > 208 GB RAM • 96GB server from Sun: about 1 million dollar! • They have a 288GB configuration (contact them for price) • There is a 3TB shared-memory SGI machine at NCSA • Option #2: Couple individual machines together • Buy 52 4GB Power-edge servers from Dell for 2.5K each • Slap some network on them and you’ve got enough memory • total cost: ~ 200K • But: it’s not as simple as that!

Reason #1 (cont.) Interferometer Gravitational Wave Observatory (LIGO) Tiny distortions of space and time caused when very large masses, such as stars, move suddenly. 1TB/day (1024 GB/day), Year-long experiments The Compact Muon Solenoid At CERN, designed to study proton-proton collisions with high quality measurements (12,000 tons) 10 GB/sec!!! Many PB/year (1024 TB/year)

Reason #2 • To allow users and computers to collaborate • Example • Assume that we want to allow users to do on-line purchases • We need Web browsers, Web servers, Database servers • All these are processes • They all communicate with multiple processes simultaneously, they are all multithreaded, running on multiple machines, some of them are multi-proc servers • It’s just a big concurrent system and it is critical that it be fast and correct!

Reason #3 • To capture the logical structure of a problem • Example • Let’s assume that we want to write a program that simulates the interactions between a robot and living entities • We can implement the robot as its own thread • The code is just the code of the robot • We can implement each entity as its own thread • The code is the simulation of the entity’s behavior • Now we let them “loose” at the beginning • They may meet, interact, etc. • All of this happens without a central notion of control, although I may be running on a single CPU • Concurrency just fits the problem

Reason #4 • To cope with independent physical devices • Example • Let’s assume that we want to write a program that receives data from the network, processes it, and writes output to the disk • We can read from the network and write to disk at the same time “almost” for free • We can compute on the data while I receive from the network “almost” for free • We can compute on the data while I write the previously computed data to disk “almost” for free • We are better off writing this program as three concurrent threads (even if on a single CPU) • Each thread uses one “independent” device of the computer

A brief history of concurrency • First machines were used in “single-user mode” • The user would declare: “I am going to use the machine for 2PM till 4PM” • Then the user would go in the special machine room and sit there for 2 hours • The user punches cards, which were prepared in advance • The user tries to run the program • The user tries to debug the program • etc. etc. • Extreme lack of productivity • During the user’s “thinking time”, the multi-million $ machine practically does nothing!

A brief history of concurrency • Batch Processing! • Instead of reserving the machine for a lapse of time to do all the activities (including debugging), the user just “submits” requests to a “queue” • The queue serves requests in order (possibly with priorities) • When the program fails and stops, another program is scheduled to use the machine “immediately” • Great! But how about the CPU idle time during the I/O!

A brief history of concurrency

A brief history of concurrency • Multi-programming! • Multiple programs reside in memory at once • Required interrupts and memory protection • Interrupts are used to switch programs between devices and CPUs • Concurrency issues in the O/S • race conditions, deadlocks, critical sections • semaphores, monitors, etc. • beginning of theory of concurrent systems (1960) • Increase in memory size • Development of virtual memory

A brief history of concurrency • Multiprogramming system • three jobs in memory

A brief history of concurrency • Time-sharing! • For fast, interactive response, one needs fast context switching • Makes it possible to have the illusion that one is alone on a (perhaps slower) machine • Already common by 1970 • Led to concurrency in user applications! • The user’s application is “logically” two concurrent tasks • The user can now implement it as two concurrent tasks!

A brief history of concurrency • Technology advances! • Multiple CPUs on a motherboard • faster buses, shared-memory, cache coherency • Networked computers • distributed memory • Clusters, ..., Internet • Concurrency across CPUs • Also: Concurrency within the CPU at the hardware level • Beyond CPU and I/O devices • Multiple units (e.g., ALUs) • Vector processors • Pipelining

High-Performance Grid Computing and Research Networking