550 likes | 774 Vues
Programming. Message-Passing. Worawan Diaz Carballo , Phd Department of Computer Science Faculty of Science and Technology Thammasat University. Outline. Introduction to Ocean and Tiger2 Clusters Introduction to MPI Collective Communication Operations
E N D
Programming Message-Passing Worawan Diaz Carballo, Phd Department of Computer Science Faculty of Science and Technology Thammasat University
Outline Introduction to Ocean and Tiger2 Clusters Introduction to MPI Collective Communication Operations Process Groups and Communicators Lab and demo CS427 Introduction to Parallel Computing
Chapter Objectives • After finish this module, students should.. • be able to log-in and submit MPI jobs to execute on Ocean and Tiger2 clusters. • understand the core concepts of message-passing programming, its pros and cons. • be able to develop simple MPI applications using non-blocking send/receive and collective communication. • understand the concepts process groups and communicators. CS427 Introduction to Parallel Computing
http://www.lsr.nectec.or.th/ Many thanks to our great neighbour for the computation power! Introduction to Ocean and Tiger2 Clusters CS427 Introduction to Parallel Computing
The Ocean Cluster 4 master nodes Each node comprises 4 Intel Xeon processors X7350 2.93GHz Each processor has 4 cores Total RAM of 256 GB infinibandDDR 20 Gbps HP Polyserve SAN storage 4.8 TB 750 GFLOPS (Rpeak) and 400 GFLOPS (Rmax) CS427 Introduction to Parallel Computing
The Tiger2 Cluster http://www.lsr.nectec.or.th/images/e/e2/Itanium2_Cluster.jpg 32 master nodes (we have been allocated 4 nodes) Each node comprises two single-core Itanium 2 1.4GHz processors Total RAM of 256 GB Hybrid interconnection between infiniband and Gigabit Ethernet Network Storage of 3 TB 358 GFLOPS (Rpeak) and 200 GFLOPS (Rmax) CS427 Introduction to Parallel Computing
High Performance Cluster • Constructed with many compute nodes and often a high-performance interconnect
Job Scheduling and Lunching • Both Ocean and Tiger2 offer the “batch computing facility”. • uses PBS (Portable Batch System) • Improves overall system efficiency • Fair access to all users since it maintains a scheduling policy • Provides protection against dead nodes CS427 Introduction to Parallel Computing
How PBS works • http://www.lsr.nectec.or.th/index.php/PBS_Job_Submission • User writes a batch script for the job and submits it to PBS with the qsub command. • PBS places the job into a queue based on its resource requests and runs the job when those resources become available. • The job runs until it either completes or exceeds one of its resource request limits. • PBS copies the job’s output into the directory from which the job was submitted and optionally notifies the user via email that the job has ended.
Step 1: Login and file transfering • Off-campus machine: • Connect via remote desktop to parlab.cs.tu.ac.th • from parlab using putty to connect to Server1: ocean.lsr.nectec.or.th Server2: tiger2.lsr.nectec.or.th • On-campus machine • use any secure shell clients to connect to Server1: ocean.lsr.nectec.or.th Server2: tiger2.lsr.nectec.or.th • Use sftp like, WinSCP to transfer files to the servers.
Step 2: Compile MPICH Programs Syntax: C : mpicc –o hello hello.c C++ : mpiCC –o hello hello.cpp Note: Before compilation, make sure the MPICH library path is set or use the export command like below: export PATH=/opt/mpich/gnu/bin: $PATH Executable file
Step 3: Write PBS batch script file Ex1: A simple script file (pbs_script) A job named “HELLO” requests 2 nodes to run in parallel. #PBS -N HELLO #PBS -q parallel #PBS -l nodes=2:ppn=2 #PBS -m ae cat $PBS_NODEFILE > nodelist /home1/mpich/mpirun -np 4 -machinefilenodelist /home2/supakit/HELLO http://www.lsr.nectec.or.th/index.php/PBS_Job_Submission
Step 3.1: Submit a Job • Use PBS command qsub Syntax : /opt/torque/bin/qsub <pbs_scripte_file> Example : /opt/torque/bin/qsubwdc_script.pbs returns the message 837.cluster.hpcc.nectec.or.th (837 is the job ID that PBS automatically assigns to your job)
Result after job completion • An error file and an output file are created. • The names are usually of the form: • jobfilename.o(jobid) • jobfilename.e(jobid) • Ex: wdc_script.e837 – Contains STDERR • wdc_script.o837 – Contains STDOUT • -j oe (combine standard output and standard error)
Book: Chapter 5 Message-Passing Programming Introduction to MPI CS427 Introduction to Parallel Computing
Introduction to MPI The message-passing programming model is based on the abstraction of a parallel computer with a distributed address space where each processor has a local memory to which it has exclusive access. The Message-Passing Interface (MPI) is a standardization of a message-passing library interface specification. CS427 Introduction to Parallel Computing
network Recap: Message Passing Model • A set of cooperating sequential processes • Each with own local address space • Processes interact with explicit transaction (send, receive,…) • Advantage • Programmer controls data and work distribution • Disadvantage • Communication overhead for small transactions • Hard to program! • Example : MPI Address space Process Eun-Gyu Kim, Research Group Meeting 8/12/2004, University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/snir/PPP/index.html
Distributed v.s. Shared Memory CPU CPU CPU CPU CPU CPU CPU CPU Message Passing Single process Concurrent execution Shared memory and resources Explicit threads, OpenMP* Threads Memory Memory Memory Memory Inter-connection Network • Multiple processes • Share data with messages BUS Memory
Message Passing Interface • MPI is a library - not a language • It is a library for inter-process communication and data exchange • MPI is best for distributed-memory parallel systems • The MPI library is large and complex, but the core is small • MPI communication: • Initialization/finalization • Point-to-point, blocking • Point-to-point, non-blocking • Collective • Communicator topologies • User-defined data types • Utilities (e.g.- timing and initialization)
Common MPI Implementations • MPICH* (Argonne National Laboratory) • Most common MPI implementation • Derivatives • MPICH GM* – Myrinet* support (available from Myricom) • MVAPICH* – InfiniBand* support (available from Ohio State University) • Intel® MPI – version tuned to Intel Architecture systems • LAM/MPI* (Indiana University) • Contains many MPI 2.0 features • Daemon-based MPI implementation • MPI/Pro* (MPI Software Technology) • Scali MPI Connect* • Provides native support for most high-end interconnects
Core MPI • 11 Out of 125 Total Functions • Program startup and shutdown • MPI_Init, MPI_Finalize • MPI_Comm_size, MPI_Comm_rank • Point-to-point communication • MPI_Send, MPI_Recv • Non-blocking communication • MPI_Isend, MPI_Irecv, MPI_Wait • Collective communication • MPI_Bcast, MPI_Reduce MPI Programming
MPI Basic Steps • Writing a program • using “mpi.h” and some essential function calls • Compiling your program • using a compilation script • Specify the machine file CS427 Introduction to Parallel Computing
“Hello, World” in MPI • Demonstrates how to create, compile and run a simple MPI program on the lab cluster using the Intel MPI implementation #include <stdio.h> #include “mpi.h” int main (intargc, char* argv[]) { /* Initialize the library */ MPI_Init (&argc, &argv); printf(“Hello world\n”); /* Wrap it up. */ MPI_Finalize(); } Initialize MPI Library Do some work! Return the resources
Compiling an MPI Program (in General) • Most MPI implementations supply compilation scripts: • Manual compilation/linking also possible: mpif77 mpi_prog.f mpiccmpi_proc.c mpif90 mpi_prof.f90 mpiCCmpi_prof.C mpiiccmpi_proc.c Intel MPI script cc mpi_prog.c–L/usr/local/mpich-1.2.5/lib –lmpich
Compiling MPI Programs (at TU) • mpicc: script to compile and link C+MPI programs • Flags: same meaning as C compiler • -O optimize • -o <file> where to put executable mpicc -O -o foo foo.c
MPI Machine File • A text file telling MPI where to launch processes • Put separate machine name on each line • Example: • Check implementation for multi-processor node formats • Default file found in MPI installation compute-0-0 compute-0-1 compute-0-2 compute-0-3
Specifying Host Machines • File .mpi-machines in home directory lists host processors in order of their use • Example .mpi_machines file contents compute-0-0 compute-0-1 compute-0-2 compute-0-3
Parallel Program Execution • Launch scenario for MPIRUN • Find machine file (to know where to launch) • Use SSH or RSH to execute a copy of program on each node in machine file • Once launched, each copy establishes communication with local MPI lib (MPI_Init) • Each copy ends MPI interaction with MPI_Finalize
Running MPI Programs • mpirun -np <p> <exec> <arg1> … • -np <p> number of processes • <exec> executable • <arg1>… command-line arguments
Starting an MPI Program 3 compute-0-0 1 Execute on front end compute-0-0 mpirun–np 4 mpi_prog mpi_prog (rank 0) compute-0-1 mpi_prog (rank 1) compute-0-2 mpi_prog (rank 2) Check the MPICH machine file: compute-0-0 compute-0-1 compute-0-2 compute-0-3 2 compute-0-3 mpi_prog (rank 3)
Start up MPI • MPI_Init prepares the system for MPI execution • Call to MPI_Init may update arguments in C • Implementation dependent • No MPI functions may be called before MPI_Init intMPI_Init(int* argc, char** argv) CPU 7 CPU 6 CPU 5 CPU 3 CPU 2 CPU 1 CPU 4 CPU 0 Memory Memory Memory Memory Memory Memory Memory Memory MPI_COMM_WORLD Rank 3 Rank 2 Inter-connection Network Rank 1 Rank 0 CS427 Introduction to Parallel Computing
Shutting Down MPI intMPI_Finalize (void) • MPI_Finalize frees any memory allocated by the MPI library • No MPI functions may be called after calling MPI_Finalize • Exception: MPI_Init
Sizing the MPI Communicator C: intMPI_Comm_size(MPI_Commcomm, int* size) • MPI_Comm_size returns the number of processes in the specified communicator • The communicator structure, MPI_Comm, is defined in mpi.h
Determining MPI Process Rank • MPI_Comm_rank returns the rank of calling process within the specified communicator • Processes are numbered from 0 to N-1 C: intMPI_Comm_rank ( MPI_Commcomm, int* rank)
Activity 2: “Hello, World” with IDs #include <stdio.h> #include “mpi.h” int main (intargc, char* argv[]) { intnumProc, myRank; MPI_Init (&argc, &argv); /* Initialize the library */ MPI_Comm_rank (MPI_COMM_WORLD, &myRank); /* Who am I?” */ MPI_Comm_size (MPI_COMM_WORLD, &numProc); /*How many? */ printf (“Hello. Process %d of %d here.\n”, myRank, numProc); MPI_Finalize (); /* Wrap it up. */ }
Predefined MPI Data Types SCCS443 - Parallel & Distributed Systems MPI_CHAR MPI_INT MPI_LONG MPI_UNSIGNED MPI_FLOAT MPI_DOUBLE MPI_BYTE
Where to receive • What to receive • How many to receive • Where to send • What to send • How many to send Sending and Receiving Message SCCS443 - Parallel & Distributed Systems • Two versions of call completion • Locally complete: if it has completed all of its part in the operation • Globally complete: if all those involved have completed their parts of the operation
MPI Blocking Send & Receive SCCS443 - Parallel & Distributed Systems • Return when they are locally complete • MPI_Send • Location used to hold the message can be used again or altered without affecting the message being sent • This does not mean that the message has been received • MPI_Recv • The message has been received into the destination location and that location can be read
MPI Blocking Send & Receive SCCS443 - Parallel & Distributed Systems
MPI Send & Receive Example • To send an integer x from process 0 to process 1 SCCS443 - Parallel & Distributed Systems
Book: Chapter 5 Message-Passing Programming A communication operation is called collective or global if all or a subset of the processes of a parallel program are involved. Collective Communication Operations CS427 Introduction to Parallel Computing
Collective Communication Operations CS427 Introduction to Parallel Computing
Broadcasting Data C: intMPI_Bcast(void* buf, intcount, MPI_Datatypedatatype, introot, MPI_Commcomm) • MPI_Bcast sends the specified data to all processes in the communicator MPI Programming
MPI_COMM_WORLD Rank 0 msg 1 1 1 1 Rank 1 msg 0 0 0 0 MPI_BCAST msg 1 1 1 1 Rank 2 msg 0 0 0 0 msg 1 1 1 1 MPI_Bcast Example • /* Example BroadcastData */ • include “mpi.h” • int main(intargc, char* argv) • { • intmsg[4]; • MPI_Init(&argc,&argv); • MPI_Comm_rank(MPI_COMM_WORLD, &myrank); • MPI_Comm_size(MPI_COMM_WORLD, &numtasks);) • if (myrank == 0) • for (int i=0; i<4; i++) msg[i] = 1; • else • for (int i=0; i<4; i++) msg[i] = 1; • MPI_Bcast(msg, 4, • MPI_INT, 0, MPI_COMM_WORLD); • MPI_Finalize(); • } MPI Programming
MPI_Reduceperforms the specified reduction operation on the specified data from all processes in the communicator Data Reduction C: intMPI_Reduce (void* sendbuf, void* recvbuf, intcount, MPI_Datatypedatatype, MPI_Opop, introot, MPI_Commcomm) MPI Programming
MPI Reduction Operations MPI Programming
MPI_COMM_WORLD 0 + 1 + 2 Rank 0 msg 3 msg 0 MPI_REDUCE Rank 1 Rank 2 msg 1 msg 2 MPI_Reduce Example • /* Example ReduceData*/ • include “mpi.h” • int main(intargc, char* argv) • { • intmsg, sum=0; • MPI_Init(&argc,&argv); • MPI_Comm_rank(MPI_COMM_WORLD, &myrank); • MPI_Comm_size(MPI_COMM_WORLD, &numtasks);) • msg= myrank; • MPI_Reduce(&msg, &sum, 1, MPI_INT, • MPI_SUM, 0, MPI_COMM_WORLD); • MPI_Finalize(); • return 0; • } MPI Programming