Parallel platforms, etc.

Parallel platforms, etc. Dr. Marco Antonio Ramos Corchado

Where used parallel computing

Taxonomy of platforms? • It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all past and present systems • But it’s not going to happen • Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be • Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multi-dimensional scheme! • Both papers agree that terms are conflated, misused, etc. (MPP) • We’ll look at one traditional taxonomy • We’ll look at current categorizations from Top500 • We’ll look at examples of platforms • We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture • What about conceptual models of parallel machines?

The Flynn taxonomy • Proposed in 1966!!! • Functional taxonomy based on the notion of streams of information: data and instructions • Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above • Four possibilities • SISD (sequential machine) • SIMD • MIMD • MISD (rare, no commercial system... systolic arrays)

SIMD single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element • PEs can be deactivated and activated on-the-fly • Vector processing (e.g., vector add) is easy to implement on SIMD • Debate: is a vector processor an SIMD machine? • often confused • strictly not true according to the taxonomy (it’s really SISD with pipelined operations) • more later on vector processors

MIMD • Most general category • Pretty much everything in existence today is a MIMD machine at some level • This limits the usefulness of the taxonomy • But you had to have heard of it at least once because people keep referring to it, somehow... • Other taxonomies have been proposed, none very satisfying • Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway

A host of parallel machines • There are (have been) many kinds of parallel machines • For the last 11 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 • It is a good source of information about what machines are (were) and how they have evolved http://www.top500.org

What is the LINPACK Benchmark • LINPACK: “LINear algebra PACKage” • A FORTRAN • Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. • LINPACK Benchmark • Dense linear system solve with LU factorization • 2/3 n^3 + O(n^2) • Measure: MFlops • The problem size can be chosen • You have to report the best performance for the best n, and the n that achieves half of the best performance.

What can we find on the Top500?

Pies

Platform Architectures SIMD Cluster Vector Constellation SMP MPP

SIMD • ILLIAC-IV, TMC CM-1, MasPar MP-1 • Expensive logic for CU, but there is only one • Cheap logic for PEs and there can be a lot of them • 32 procs on 1 chip of the MasPar, 1024-proc system with 32 chips that fit on a single board! • 65,536 processors for the CM-1 • Thinking Machine’s gimmick was that the human brain consists of many simple neurons that are turned on and off, and so was their machine • CM-5 • hybrid SIMD and MIMD • Death • Machines not popular, but the programming model is. • Vector processors often labeled SIMD because that’s in effect what they do, but they are not SIMD machines • Led to the MPP terminology (Massively Parallel Processor) • Ironic because none of today’s “MPPs” are SIMD

SMPs P2 P1 Pn $ $ $ network/bus memory • “Symmetric MultiProcessors” (often mislabeled as “Shared-Memory Processors”, which has now become tolerated) • Processors all connected to a (large) memory • UMA: Uniform Memory Access, makes is easy to program • Symmetric: all memory is equally close to all processors • Difficult to scale to many processors (<32 typically) • Cache Coherence via “snoopy caches”

Distributed Shared Memory • Memory is logically shared, but physically distributed in banks • Any processor can access any address in memory • Cache lines (or pages) are passed around the machine • Cache coherence: Distributed Directories • NUMA: Non-Uniform Memory Access (some processors may be closer to some banks) • SGI Origin2000 is a canonical example • Scales to 100s of processors • Hypercube topology for the memory (later) P2 P1 Pn $ $ $ memory memory network memory memory

Clusters, Constellations, MPPs P1 NI P0 NI Pn NI memory memory memory . . . interconnect • These are the only 3 categories today in the Top500 • They all belong to the Distributed Memory model (MIMD) (with many twists) • Each processor/node has its own memory and cache but cannot directly access another processor’s memory. • nodes may be SMPs • Each “node” has a network interface (NI) for all communication and synchronization. • So what are these 3 categories?

Clusters • 58.2% of the Top500 machines are labeled as “clusters” • Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes • A commodity cluster is one in which both the network and the compute nodes are available in the market • In the Top500, “cluster” means “commodity cluster” • A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs”

What is Beowulf? • An experiment in parallel computing systems • Established vision of low cost, high end computing, with public domain software (and led to software development) • Tutorials and book for best practice on how to build such platforms • Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software • Project initiated by T. Sterling and D. Becker at NASA in 1994

Constellations??? • Commodity clusters that differ from the previous ones by the dominant level of parallelism • Clusters consist of nodes, and nodes are typically SMPs • If there are more procs in an node than nodes in the cluster, then we have a constellation • Typically, constellations are space-shared among users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP • To be honest, this term is not very useful and not very used.

MPP???????? • Probably the most imprecise term for describing a machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?) • May use proprietary networks, vector processors, as opposed to commodity component • IBM SP2, Cray T3E, IBM SP-4 (DataStar), Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. • Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500. • Let’s look at these “non-commodity” things

Vector Processors • Vector architectures were based on a single processor • Multiple functional units • All performing the same operation • Instructions may specify large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel • Historically important • Overtaken by MPPs in the 90s as seen in Top500 • Re-emerging in recent years • At a large scale in the Earth Simulator (NEC SX6) and Cray X1 • At a small scale in SIMD media extensions to microprocessors • SSE, SSE2 (Intel: Pentium/IA64) • Altivec (IBM/Motorola/Apple: PowerPC) • VIS (Sun: Sparc) • Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

Vector Processors … … … … … vr1 vr2 vr1 vr3 vr2 + + + + + + + • Definition: a processor that can do elt-wise operations on entire vectors with a single instruction, called a vector instruction • These are specified as operations on vector registers • A processor comes with some number of such registers • A vector register holds ~32-64 elts • The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 • The hardware performs a full vector operation in • #elements-per-vector-register / #pipes r1 r2 + (logically, performs #elts adds in parallel) r3 (actually, performs #pipes adds in parallel)

Vector Processors • Advantages • quick fetch and decode of a single instruction for multiple operations • the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion • The compiler does the work for you of course • Memory-to-memory • no registers • can process very long vectors, but startup time is large • appeared in the 70s and died in the 80s • Cray, Fujitsu, Hitachi, NEC

Global Address Space P1 NI P0 NI Pn NI memory memory memory . . . interconnect • Cray T3D, T3E, X1, and HP Alphaserver cluster • Network interface supports “Remote Direct Memory Access” • NI can directly access memory without interrupting the CPU • One processor can read/write memory with one-sided operations (put/get) • Not just a load/store as on a shared memory machine • Remote data is typically not cached locally • (remember the MPI-2 extension)

Cray X1: Parallel Vector Architecture Cray combines several technologies in the X1 • 12.8 Gflop/s Vector processors (MSP) • Shared caches (unusual on earlier vector machines) • 4 processor nodes sharing up to 64 GB of memory • Single System Image to 4096 Processors • Remote put/get between nodes (faster than MPI)

51 GB/s Cray X1: the MSP • Cray X1 building block is the MSP • Multi-Streaming vector Processor • 4 SSPs (each a 2-pipe vector processor) • Compiler will (try to) vectorize/parallelize across the MSP, achieving “streaming” custom blocks 12.8 Gflops (64 bit) S S S S 25.6 Gflops (32 bit) V V V V V V V V 25-41 GB/s 0.5 MB $ 0.5 MB $ 0.5 MB $ 0.5 MB $ shared caches 2 MB Ecache At frequency of 400/800 MHz To local memory and network: 25.6 GB/s 12.8 - 20.5 GB/s Figure source J. Levesque, Cray

Cray X1: A node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO • Shared memory • 32 network links and four I/O links per node

Cray X1: 32 nodes R R R R R R R R Fast Switch

Cray X1: 128 nodes

Cray X1: Parallelism • Many levels of parallelism • Within a processor: vectorization • Within an MSP: streaming • Within a node: shared memory • Across nodes: message passing • Some are automated by the compiler, some require work by the programmer • Hard to fit the machine into a simple taxonomy • Similar story for the Earth Simulator

The Earth Simulator (NEC) • Each node: • Shared memory (16GB) • 8 vector processors + I/O processor • 640 nodes fully-connected by a 640x640 crossbar switch • Total: 5120 8GFlop processors -> 40GFlop peak

DataStar • 8-way or 32-way Power4 SMP nodes • Connected via IBM’s Federation (formerly Colony) interconnect • 8-ary Fat-tree topology • 1,632 processors • 10.4 TeraFlops • Each node is directly connected via fiber to IBM’s GPFS (parallel file system) • Similar to the SP-x series, but higher bandwidth and higher arity of the fat-tree

Blue Gene/L • 65,536 processors (still being assembled) • Relatively modest clock rates, so that power consumption is low, cooling is easy, and space is small (1024 nodes in the same rack) • Besides, processor speed is on par with the memory speed so faster does not help • 2-way SMP nodes! • several networks • 64x32x32 3-D torus for point-to-point • tree for collective operations and for I/O • plus other Ethernet, etc.

If you like dead Supercomputers • Lots of old supercomputers w/ pictures • http://www.geocities.com/Athens/6270/superp.html • Dead Supercomputers • http://www.paralogos.com/DeadSuper/Projects.html • e-Bay • Cray Y-MP/C90, 1993 • $45,100.70 • From the Pittsburgh Supercomputer Center who wanted to get rid of it to make space in their machine room • Original cost: $35,000,000 • Weight: 30 tons • Cost $400,000 to make it work at the buyer’s ranch in Northern California

Network Topologies • People have experimented with different topologies for distributed memory machines, or to arrange memory banks in NUMA shared-memory machines • Examples include: • Ring: KSR (1991) • 2-D grid: Intel Paragon (1992) • Torus • Hypercube: nCube, Intel iPSC/860, used in the SGI Origin 2000 for memory • Fat-tree: IBM Colony and Federation Interconnects (SP-x) • Arrangement of switches • pioneered with “Butterfly networks” like in the BBN TC2000 in the early 1990 • 200 MHz processors in a multi-stage network of switches • Virtually Shared Distributed memory (NUMA) • I actually worked with that one!

Hypercube • Defined by its dimension, d 1D 2D 3D 4D

Hypercube • Properties • Has 2d nodes • The number of hops between two nodes is at most d • The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes • But each node needs d neighbors, which is a problem • Routing and Addressing 1111 1110 0110 0111 • d-bit address • routing from xxxx to yyyy: just keep going to a neighbor that has a smaller hamming distance • reminiscent of some p2p things • TONS of Hypercube research (even today!!) 0010 0011 1010 1011 0101 1101 0100 1100 1001 1000 0000 0001

Systolic Array? • Array of processors in some topology with each processor having a few neighbors • typically 1-D linear array or 2-D grid • Processors perform regular sequences of operations among data that flow between them • e.g. receive from my left and top neighbor, compute, pass to my right and bottom neighbor • Like SIMD machines, everything happens in locked step • Example: CMU’s iWarp by Intel (1988 or so) • Allows for convenient algorithms for some problems • Today: used in FPGA systems that build systolic arrays to run a few algorithms. • regular computations (matrix multiply) • genetic algorithms • Impact: allows us to reason about algorithms

Models for Parallel Computation • We have seen broad taxonomies of machines, examples of machines, techniques to program them (OpenMP, MPI, etc.) • At this point, how does one reason about parallel algorithms, about their complexity, about their design, etc.? • What one needs is abstract models of parallel platforms • Some are really abstract • Some are directly inspired from actual machines • Although these machines may no longer exist or be viable, the algorithms can be implemented on more relevant architectures, or at least give us clues • e.g.: Matrix multiply on a systolic array helps doing matrix multiply on a logical 2-D grid topology that sits on top of a cluster of workstations. • PRAM, Sorting networks, systolic arrays, etc.

Parallel platforms, etc.