CS 491: Parallel and Distributed Computing

CS 491Parallel & Distributed Computing

Welcome to CS 491 • Instructor • Dan Stevenson • Office: P 136 • stevende@uwec.edu • Course Web Site: • http://www.cs.uwec.edu/~stevende/cs491/

Getting Help:When you have questions • Regarding HELP with course materials and assignments • Come to office hours – Phillips 136 • TIME TBD (Check website) • OR by appointment (just e-mail or call my office) • Send me an e-mail: stevende@uwec.edu

Textbooks • Required: • Michael J. Quinn • Parallel Programming in C with OpenMP and MPI • Suggested: • Web tutorials during the semester

Overall Course Grading

CS 491:Overview Slide content for this course collected from various sources, including: • Dr. Henry Neeman, University of Oklahoma • Dr. Libby Shoop, Macalester College • Dr. Charlie Peck, Earlham College • Tom Murphy, Contra Costa College • Dr. Robert Panoff, Shodor Foundation • Others, who will be credited…

“Parallel & Distributed Computing” • What does it mean to you? • Coordinating Threads • Supercomputing • Multi-core Processors • Beowulf Clusters • Cloud Computing • Grid Computing • Client-Server • Scientific Computing • All contexts for “splitting up work” in an explicit way CS 491 – Parallel and Distributed Computing

CS 491 • In this course, we will take mostly from the context of “Supercomputing” • This is the field with the longest record of parallel computing expertise. • It also has a long record of being a source for “trickle-down” technology. CS 491 – Parallel and Distributed Computing

What is Supercomputing? • Supercomputing is the biggest, fastest computing - right this minute. • Likewise, a supercomputer is one of the biggest, fastest computers right this minute. • The definition of supercomputing is, therefore, constantly changing. • A Rule of Thumb: A supercomputer is typically at least 100 times as powerful as a PC. • Jargon: Supercomputing is also known as High Performance Computing (HPC) or High End Computing (HEC) or Cyberinfrastructure (CI).

Fastest Supercomputer vs. Moore GFLOPs: billions of calculations per second Over recent years, supercomputers have benefitted directly from microprocessor performance gains, and have also gotten better at coordinating their efforts.

Recent Champion • Jaguar – Oak Ridge National Laboratory (TN) • 224162 processor cores – 1.76 PetaFLOP/second CS 491 – Parallel and Distributed Computing

Current Champ • 2008 IBM Roadrunner: 1.1Petaflops • 2009 Cray Jaguar: 1.6 • 2010 Tiahe-1A (China): 2.6 • 2011 Fujitsu K (Japan): 10.5 • 88,128 8-core processors -> 705,024 cores • Needs power equivalent to 10,000 homes • Linpack numbers • Core i7 – 2.3 Gflops • Glalaxy Nexus – 97 Mflops CS 491 – Parallel and Distributed Computing

Hold the Phone • Why should we care? • What useful thing actually takes a long time to run anymore? (especially long enough to warrant investing 7/8/9 figures on a supercomputer) • Important: It’s usually not about getting something done faster, but about getting a harder thing done in the same amount of time • This is often referred to as capability computing CS 491 – Parallel and Distributed Computing

Tornadic Storm What Is HPC Used For? • Simulation of physical phenomena, such as • Weather forecasting • Galaxy formation • Oil reservoir management • Data mining: finding needles of information in a haystack ofdata, such as: • Gene sequencing • Signal processing • Detecting storms that might produce tornados (want forecasting, not retrocasting…) • Visualization: turning a vast sea of data into pictures that a scientist can understand • Oak Ridge National Lab has a 512-core cluster devoted entirely to visualization runs

CS 491 – Parallel and Distributed Computing

What is Supercomputing About? Size Speed (Laptop)

What is Supercomputing About? • Size: Many problems that are interesting™ can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk. • Speed: Many problems that are interesting™ would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer.

Supercomputing Issues • Parallelism: doing multiple things at the same time • finding and coordinating this can be challenging • The tyranny of the storage hierarchy • The hardware you’re running on matters • Moving data around is often more expensive than actually computing something

Parallel Computing Hardware CS 491 – Parallel and Distributed Computing

Parallel Processing • The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors • Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing Key questions in hardware design: • How do parallel processors share data and communicate? • shared memory vs distributed memory • How are the processors connected? • single bus vs network • The number of processors is determined by a combination of #1 and #2

How is Data Shared? • Shared Memory Systems • All processors share one memory address space and can access it • Information sharing is often implicit • Distributed Memory Systems (AKA “Message Passing Systems”) • Each processor has its own memory space • All data sharing is done via programming primitives to pass messages • i.e. “Send data value to processor 3” • Information sharing is always explicit

Message Passing • Processors communicate via messages that they send to each other: send and receive • This form is required for multiprocessors that have separate private memories for each processor • Cray T3E • “Beowolf Cluster” • SETI@HOME • Note: shared memory multiprocessors can also have separate memories – they just aren’t “private” to each processor

Shared Memory Systems • Processors all operate independently, but operate out of the same logical memory. • Data structures can be read by any of the processors • To properly maintain ordering in our programs, synchronization primitives are needed! (locks/semaphores)

Connecting Multiprocessors

Single Bus Multiprocessor • Connect several processors via a single shared bus • bus bandwidth limits the number of processors • local cache lowers bus traffic • single memory module attached to the bus • Limited to very small systems! • Intel processors support this mode by default

The Cache Coherence Problem

Cache Coherence Solutions • Two most common variations: • “snoopy” schemes • rely on broadcast to observe all coherence traffic • well suited for buses and small-scale systems • example: SGI Challenge or Intel x86 • directory schemes • uses centralized information to avoid broadcast • scales well to large numbers of processors • example: SGI Origin/Altix

Snoopy Cache Coherence Schemes • Basic Idea: • all coherence-related activity is broadcast to all processors • e.g., on a global bus • each processor monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache • examples: • if another processor wishes to write to a line, you may need to “invalidate” (i.e. discard) the copy in your own cache • if another processor wishes to read a line for which you have a dirty copy, you may need to supply it • Most common approach in commercial shared-memory multiprocessors. • Protocol is a distributed algorithm: cooperating state machines • Set of states, state transition diagram, actions

Network Connected Multiprocessors • In the single bus case, the bus is used for every main memory access • In the network connected model, the network is used only for inter-process communication • There are multiple “memories” BUT that doesn’t mean that there’s separate memory spaces

Directory Coherence • Network-based machines do not want to use a snooping coherence protocol! • Means that every memory transaction would need to be sent everywhere! • Directory-based systems use a global “Directory” to arbitrate who owns data • Point-to-point communication with the directory instead of bus broadcasts • The directory keeps a list of what caches have the data in question • When a write to that data occurs, all of the affected caches can be notified directly

Network Topologies: Ring • Each node (processor) contains its own local memory • Each node is connected to the network via a switch • Messages hop along the ring from node to node until they reach the proper destination

Network Topologies: 2D Mesh • 2D grid, or mesh, of nodes • Each “inside” node has 4 neighbors • “outside” nodes only have 2 • If all nodes have four neighbors, then this is a 2D torus

Network Topologies: Hypercube • Also called an n-cube • For n=2  2D cube (4 nodes  square) • For n=3  3D cube (8 nodes) • For n=4  4D cube (16 nodes) • In an n cube, all nodes have n neighbors 3 cube 4 cube

Network Topologies: Full Crossbar • Every node can communicate directly with every other node in only one pass  fully connected network • n nodes  n2 switches • Therefore, extremely expensive to implement!

Network Topologies: Butterfly Network Omega network switch box • Fully connected, but requires passes thru multiple switch boxes • Less hardware required than crossbar, but contention can occur

Flynn’s Taxonomy of Computer Systems (1966) A simple model for categorizing computers: 4 categories: • SISD – Single Instruction Single Data • the standard uniprocessor model • SIMD – Single Instruction Multiple Data • Full systems that are “true” SIMD are no longer in use • Many of the concepts exist in vector processing and to come extend graphics cards • MISD – Multiple Instruction Single Data • doesn’t really make sense • MIMD – Multiple Instruction Multiple Data • the most common model in use

“True” SIMD • A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time • Most well known examples are: • Thinking Machines CM-1 and CM-2 • MasPar MP-1 and MP-2 • others • All are out of existence now • SIMD requires massive data parallelism • Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)

Vector Processors • Closely related to SIMD • Cray J90, Cray T90, Cray SV1, NEC SX-6 • Starting to “merge” with MIMD systems • Cray X1E and upcoming systems (“Cascade”) • Use a single instruction to operate on an entire vector of data • Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline • Uses “vector registers” to feed a pipeline for the vector operation • Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data • (Because of this, didn’t typically have caches until late 90s)

MIMD • Multiple instructions are applied to multiple data • The multiple instructions can come from the same program, or from different programs • Generally “parallel processing” implies the first • Most modern multiprocessors are of this form • IBM Blue Gene, Cray T3D/T3E/XT3/4/5, SGI Origin/Altix • Clusters

Parallel Computing Hardware “Supercomputer Edition” CS 491 – Parallel and Distributed Computing

The Most Common Supercomputer: Clustering • A parallel computer built out of commodity hardware components • PCs or server racks • Commodity network (like ethernet) • Often running a free-software OS like Linux with a low-level software library to facilitate multiprocessing • Use software to send messages between machines • Standard is to use MPI (message passing interface)

What is a Cluster? “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean”

What a Cluster is …. • A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network • It also needs software that allows the nodes to communicate over the interconnect. • But what a cluster is … is all of these components working together as if they’re one big computer (a supercomputer)

What a Cluster is …. • nodes • PCs • Server rack nodes • interconnection network • Ethernet (“GigE”) • Myrinet (“10GigE”) • Infiniband (low latency) • The Internet (not really – typically called “Grid”) • software • OS • Generally Linux • Redhat / CentOS / SuSE • Windows HPC Server • Libraries (MPICH, PBLAS, MKL, NAG) • Tools (Torque/Maui, Ganglia, GridEngine)

An Actual (Production) Cluster Interconnect Nodes

Other Actual Clusters… CS 491 – Parallel and Distributed Computing

What a Cluster is NOT… • At the high end, many supercomputers are made with custom parts • Custom backplane/network • Custom/Reconfigurable processors • Extreme Custom cooling • Custom memory system • Examples: • IBM Blue Gene • Cray XT4/5/6 • SGI Altix CS 491 – Parallel and Distributed Computing

Moore’s Law

Moore’s Law • In 1965, Gordon Moore was an engineer at Fairchild Semiconductor. • He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months. • It turns out that computer speed was roughly proportional to the number of transistors per unit area. • Moore wrote a paper about this concept, which became known as “Moore’s Law.”

Fastest Supercomputer vs. Moore GFLOPs: billions of calculations per second

CS 491: Parallel and Distributed Computing