Types of Parallelism

Types of Parallelism • Overt • Parallelism is visible to the programmer • Difficult to do (right) • Large improvements in performance • Covert • Parallelism is not visible to the programmer • Compiler responsible for parallelism • Easy to do • Small improvements in performance Message Passing Computing

Parallel Architectures • For a long time parallel programs were written with a specific architecture in mind • Programs would only run on one type, maybe even only one, machine • A programmer would have to fit a problem to a specific architecture • Over time, programmers started writing programs in a particular style • The programs are then mapped onto a specific machine by a compiler Message Passing Computing

Problem Architectures • Synchronous (SIMD) • The same operation is performed on all data points at the same time • Loosley Synchronous (SPMD) • The same operations are performed by all processors but they need not be done at exactly the same time • Not synchronized at the computer clock cycle but rather only macroscopically “every now and then” • Asynchronous (MPMD) • Every processor executes its own instruction on its own data Message Passing Computing

SIMD Program Controller P0 P1 P2 P3 P4 P5 P6 Pn-1 Interconnection Network Message Passing Computing

SPMD Same program - but no longer strictly synchronized Program Program Program Program Program Program Program Program P0 P1 P2 P3 P4 P5 P6 Pn-1 Interconnection Network Message Passing Computing

MPMD Prog0 Prog1 Prog2 Prog3 Prog4 Prog5 Prog6 Progn-1 P0 P1 P2 P3 P4 P5 P6 Pn-1 Interconnection Network Message Passing Computing

Processes • One can view a parallel programming as consisting of a number of independent processes • These processes are mapped to the physical processors • Ideally(?) one process per processor • You can also think of these as threads, although technically threads are a different sort of beast • For program development we do not really care about the mapping • Two ways to create processes • Static • All processes are specified before execution • The system executes a fixed number of processes • In a world were there is a mapping between process and processor this is only view that makes sense • Dynamic • Processes can be created at runtime • More powerful but incurs overhead at runtime Message Passing Computing

Communication • Communication is vital in any kind of distributed application. • Initially most people wrote their own protocols. • Tower of Babel effect. • Eventually standards appeared. • Parallel Virtual Machine (PVM). • Message Passing Interface (MPI). Message Passing Computing

Message Passing • In basic message passing, processes coordinate activities by explicitly sending and receiving messages • Commonly used in distributed-memory MIMD systems • Programming in an MP environment can be achieved by • Designing a special parallel language • Occam • Extending an existing sequential language to include MP constructs • Inmos C • Use a middleware layer that in conjunction with an existing language provides MP faciltities • MPI • Parallel Java • PVM Message Passing Computing

Synchronous Message Passing Blocks until matching send() is complete recv(x) sync point send(2, x) Blocks until matching recv() is complete send(2, y) sync point recv(y) Message Passing Computing

Asynchronous Message Passing Copies to buffer and continues send(2, x) Buffer recv(x) May or may not block recv(y) No synchronization point send(2, y) Message Passing Computing

Broadcast P0 P1 P2 P3 data data data data Buffer May or may not be synchronous Message Passing Computing

Multicast P0 P1 P2 P3 data data data data Buffer May or may not be synchronous Message Passing Computing

Scatter P0 P1 P2 P3 data data data data May or may not be synchronous Message Passing Computing

Gather P0 P1 P2 P3 data data data data May or may not be synchronous Message Passing Computing

Reduction • Method to calculate a commutative (i.e., sum, product, minimum, maximum, …) value in log P steps • Think of summing the values in a tree 56 34 22 15 19 14 8 10 5 15 4 6 8 1 7 Message Passing Computing

Reduction • If each node is a process… • Pairs of nodes on the bottom add and pass to parent • Pairs at next level do the same • Repeat until at root 56 34 22 15 19 14 8 10 5 15 4 6 8 1 7 Message Passing Computing

Reduction • Instead of a tree, consider the reduction in a group of processors 10 5 15 4 6 8 1 7 7 6 5 4 3 2 1 0 Message Passing Computing

Reduction • Sum to even processors 10 15 15 19 6 14 1 8 7 6 5 4 3 2 1 0 Message Passing Computing

Reduction • Repeat 10 15 15 34 6 14 1 22 7 6 5 4 3 2 1 0 Message Passing Computing

Reduction • Repeat one last time 10 15 15 29 6 14 1 56 7 6 5 4 3 2 1 0 Message Passing Computing

Think Binary 111 110 101 100 011 010 001 000 Message Passing Computing

Step 1 111 110 101 100 011 010 001 000 Message Passing Computing

Programming It Mask: 001 111 110 101 100 011 010 001 001 001 000 001 000 001 000 001 000 Bitwise AND Message Passing Computing

Programming It Mask: 010 111 110 101 100 011 010 001 000 010 000 010 000 Bitwise AND Message Passing Computing

Programming It Mask: 100 111 110 101 100 011 010 001 000 100 000 Bitwise AND Message Passing Computing

Reduction • Okay now I know who sends when, but… • How do I know who to send to? Message Passing Computing

Programming It Mask: 001 111 110 101 100 011 010 001 000 001 000 001 000 001 000 001 000 Bitwise AND 110 100 010 000 Bitwise XOR Message Passing Computing

Programming It Mask: 010 111 110 101 100 011 010 001 000 010 000 010 000 Bitwise AND 100 000 Bitwise XOR Message Passing Computing

Programming It Mask: 100 111 110 101 100 011 010 001 000 100 000 Bitwise AND 000 Bitwise XOR Message Passing Computing

Reduce P0 P1 P2 P3 data data data data + Buffer May or may not be synchronous Message Passing Computing

What is MPI? • A message passing library specification • Message-passing model • Not a compiler specification (i.e. not a language) • Not a specific product • Designed for parallel computers, clusters, and heterogeneous networks • Lets users, tool writers, library developers concentrate on their code as opposed to the low level communication code • API • Middleware Message Passing Computing

The MPI Process • Development began in early 1992 • Open process/Broad participation • IBM,Intel, TMC, Meiko, Cray, Convex, Ncube • PVM, p4, Express, Linda, … • Laboratories, Universities, Government • Final version of draft in May 1994 • Public and vendor implementations are now widely available Message Passing Computing

Why Message Passing? • Message passing is a mature paradigm • CSP was developed in 1978 • Well understood • Relatively easy to match to distributed hardware • Goal was to provide a full featured portable system • Modularity • Peak performance • Portability • Heterogeneity • Performance measurement tools Message Passing Computing

Features • Communicators • A collection of processes that can send messages to each other • Point-to-point Communication • Collective Communication • Barrier synchronization • Broadcast • Gather/Scatter data • All-to-all exchange of data • Global reduction • Scan across all members of a communicator Message Passing Computing

Bare bones MPI Program #include <mpi.h> void main( int argc, char **argv ) { // Non-MPI Stuff can go here MPI_Init( &argc, &argv ); // Your parallel code goes here MPI_Finalize(); // Non-MPI Stuff can go here } Message Passing Computing

Odds and Ends • Even though programs are running on different processors you can print using printf() • No promise about ordering • Very useful for debugging • Supposedly scanf() • Be sure to use the –i option • Although it appears that argc and argv do what you expect, in some implementations they do not work • Send messages instead • Be careful with random number generators • If everyone seeds with the same value, numbers will not be very random Message Passing Computing

Communicators • Many MPI calls require a communicator • A communicator is a collection of processes that can send messages to each other • Think of a communicator as defining a group • Only processes in the same communicator can communicate • Allows you to segment your communication traffic • Every process belongs to the MPI_COMM_WORLD communicator Message Passing Computing

Getting Information • You can gather information about your environment • MPI_Comm_Rank( communicator, &retVal ); • Returns your rank – original process gets 0 • MPI_Get_processor_name( str_array, &length ); • Returns information about processor • MPI_MAX_PROCESSOR_NAME Message Passing Computing

HelloWorldPrint.c #include <stdio.h> #include <mpi.h> void main ( int argc, char** argv ) { int myRank; int nameLen; char myName[ MPI_MAX_PROCESSOR_NAME ]; /* Initialize MPI */ MPI_Init( &argc, &argv ); /* Obtain information about the process */ MPI_Comm_rank( MPI_COMM_WORLD, &myRank ); MPI_Get_processor_name( myName, &nameLen ); /* Standard print */ printf( "Hello world from process #%d on %s\n", myRank, myName ); /* Terminate MPI */ MPI_Finalize(); } Message Passing Computing

Compiling Parallel Programs • All clusters within the CS department are running Sun’s HPC software • Contains a variety of tools – including MPI • Everything (including documentation) is in /opt/SUNWhpc • Executables are in /opt/SUNWhpc/bin • Probably should add that to your path • Note that only the “clusters” have this software installed • See http://www.cs.rit.edu/~ark/runningpj.shtml for details • Compile MPI C programs using mpcc mpcc HelloWorldPrint.c –o hello –lmpi Message Passing Computing

CS Parallel Resources • SMP parallel computers • paradise/parasite -- 8 processors, 1.35 GHz clock, 16 GB RAM • paradox/paragon -- 4 processors, 450 MHz clock, 4 GB RAM • Cluster parallel computer • paranoia.cs.rit.edu (296 MHz clock, 192 MB RAM) • 32 backend computers (thug01 through thug32) -- each an UltraSPARC-IIe CPU, 650 MHz clock, 1 GB RAM • 100-Mbps switched Ethernet backend interconnection network • Hybrid SMP cluster parallel computer (Not for class use) • tardis.cs.rit.edu (CPU, 650 MHz clock, 512 MB RAM) • 10 backend computers (dr00 through dr09) -- each with two AMD Opteron four processors, 2.6 GHz clock, 8 GB RAM • 1-Gbps switched Ethernet backend interconnection Message Passing Computing

Running Parallel Programs • Rules of Engagement • Use the paradox and paradise machines to run SMP parallel programs. • Use the java mprun command on the paranoia machine to run MPI cluster parallel programs. Do not use the mprun command directly. • Run parallel java cluster programs on the paranoia machine • Details at: http://www.cs.rit.edu/~ark/runningpj.shtml • Account setup • You need to setup your account so you can ssh to the parallel machines without specifying a password • You need to include the parallel java libraries in your classpath Message Passing Computing

Sample Run paranoia> mpcc HellowWorldPrint.c -o hello -l mpi paranoia> java mprun -np 6 hello Job 2, thug05, thug06, thug07, thug08, thug09, thug10 Hello world from process #1 on thug06 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems Hello world from process #2 on thug07 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems Hello world from process #3 on thug08 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems Hello world from process #4 on thug09 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems Hello world from process #5 on thug10 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems Hello world from process #0 on thug05 SunOS 5.9 SUNW,Sun-Blade-100 Sun_Microsystems paranoia> Message Passing Computing

Sending/Receiving Messages • MPI places messages inside “envelopes” • Point-to-Point Messages are sent/received using • MPI_Send( buffer, count, type, dest, tag, comm ); • MPI_Recv( buffer, count, type, src, tag, comm ); • These are blocking calls • Return when buffer is available/full Message Passing Computing

Types of Parallelism