Understanding Parallel Programming Models: A Deep Dive into Amdahl’s Law and Architectures

Distributed SystemsCS 15-440 Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad HammoudandVinayKolar

Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional models of parallel programming Parallel computer architectures Why parallelism? Why parallelism?

Amdahl’s Law • We parallelize our programs in order to run them faster • How much faster will a parallel program run? • Suppose that the sequential execution of a program takes T1time units and the parallel execution on p processors takes Tp time units • Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable • Then the speedup (Amdahl’s formula):

Amdahl’s Law: An Example • Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program • The speedup you can get according to Amdahl is: • Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time)

Real Vs. Actual Cases • Amdahl’s argument is too simplified to be applied to real cases • When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general 20 80 20 80 Serial Serial Parallel 20 20 Parallel 20 20 Process 1 Process 1 Process 2 Process 2 Cannot be parallelized Process 3 Process 3 Cannot be parallelized Can be parallelized Communication overhead Process 4 Process 4 Can be parallelized Load Unbalance 2. Parallel Speed-up: An Actual Case 1. Parallel Speed-up: An Ideal Case

Guidelines • In order to efficiently benefit from parallelization, we ought to follow these guidelines: • Maximize the fraction of our program that can be parallelized • Balance the workload of parallel processes • Minimize the time spent for communication

Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional models of parallel programming Parallel computer architectures Parallel computer architectures Why parallelism?

Parallel Computer Architectures • We can categorize the architecture of parallel computers in terms of two aspects: • Whether the memory is physically centralized or distributed • Whether or not the address space is shared

Symmetric Multiprocessor • Symmetric Multiprocessor (SMP) architecture uses shared system resources that can be accessed equally from all processors • A single OS controls the SMP machine and it schedules processes and threads on processors so that the load is balanced Processor Processor Processor Processor Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O

Massively Parallel Processors • Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem • An independent OS runs at each node Interconnection Network Processor Processor Processor Processor Cache Cache Cache Cache Bus Bus Bus Bus Memory I/O Memory I/O Memory I/O Memory I/O

Non-Uniform Memory Access • Non-Uniform Memory Access (NUMA) architecture machines are built on a similar hardware model as MPP • NUMA typically provides a shared address space to applications using a hardware/software directory-based coherence protocol • The memory latency varies according to whether you access memory directly (local) or through the interconnect (remote). Thus the name non-uniform memory access • As in an SMP machine, a single OS controls the whole system

Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional Models of parallel programming Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?

Models of Parallel Programming • What is a parallel programming model? • A programming model is an abstraction provided by the hardware to programmers • It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands • It determines how efficiently parallel tasks can be executed on the hardware • Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, NUMA) and minimize the elapsed time of your program

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing Message Passing

Shared Memory Model • In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory • Parallel tasks can communicate through reading and writing common memory locations • This is similar to threads from a single process which share a single address space • Multi-threaded programs (e.g., OpenMP programs)are the best fit with shared memory programming model

Shared Memory Model Si = Serial Pj = Parallel Single Thread Multi-Thread S1 S1 Time Time Spawn P1 P3 P1 P2 P3 P2 Join P3 S2 Shared Address Space P4 S2 Process Process

Shared Memory Example begin parallel // spawn a child thread private int start_iter, end_iter, i; shared int local_iter=4, sum=0; shared double sum=0.0, a[], b[], c[]; shared lock_type mylock; start_iter = getid() * local_iter; end_iter = start_iter + local_iter; for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; barrier; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); } barrier; // necessary end parallel // kill the child thread Print sum; for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential Parallel

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Shared Memory Message Passing

Message Passing Model • In message passing, parallel tasks have their own local memories • One task cannot access another task’s memory • Hence, to communicate data they have to rely on explicit messages sent to each other • This is similar to the abstraction of processes which do not share an address space • MPI programs are the best fit with message passing programming model

Message Passing Model Message Passing Single Thread S = Serial P = Parallel S1 S1 S1 S1 S1 Time Time P1 P1 P1 P1 P1 P2 S2 S2 S2 S2 P3 P4 Process 0 Process 1 Process 2 Process 3 S2 Node 1 Node 2 Node 3 Node 4 Data transmission over the Network Process

Message Passing Example id = getpid(); local_iter = 4; start_iter = id * local_iter; end_iter = start_iter + local_iter; if (id == 0) send_msg (P1, b[4..7], c[4..7]); else recv_msg (P0, b[4..7], c[4..7]); for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; local_sum = 0; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i]; if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum; } else send_msg (P0, local_sum); for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential Parallel

Shared Memory Vs. Message Passing • Comparison between shared memory and message passing programming models:

Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Examples of parallel processing Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?

SPMD and MPMD • When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution • We distinguish between two models: • Single Program Multiple Data (SPMD) model • Multiple Programs Multiple Data (MPMP) model

SPMD • In the SPMD model, there is only one program and each process uses the same executable working on different sets of data a.out Node 1 Node 3 Node 2

MPMD • The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem • MPMD has two styles, the master/worker and the coupled analysis a.out a.out b.out b.out c.out a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis Example Node 1 Node 1 Node 2 Node 2 Node 3 Node 3 1. MPMD: Master/Slave 2. MPMD: Coupled Analysis

3 Key Points • To summarize, keep the following 3 points in mind: • The purpose of parallelization is to reduce the time spent for computation • Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable • Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself

Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Message Passing Interface (MPI) Examples of parallel processing Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?

Message Passing Interface • In this part, the following concepts of MPI will be described: • Basics • Point-to-point communication • Collective communication

What is MPI? • The Message Passing Interface (MPI) is a message passing library standard for writing message passing programs • The goal of MPI is to establish a portable, efficient, and flexible standard for message passing • By itself, MPI is NOT a library - but rather the specification of what such a library should be • MPI is not an IEEE or ISO standard, but has in fact, become the industry standardfor writing message passing programs on HPC platforms

Reasons for using MPI

Programming Model • MPI is an example of a message passing programming model • MPI is now used on just about any common parallel architecture including MPP, SMP clusters, workstation clusters and heterogeneous networks • With MPIthe programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs

Communicators and Groups • MPI uses objects called communicators and groups to define which collection of processes may communicate with each other to solve a certain problem • Most MPI routines require you to specify a communicator as an argument • The communicator MPI_COMM_WORLD is often used in calling communication subroutines • MPI_COMM_WORLD is the predefined communicator that includes all of your MPI processes

Ranks • Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the system when the process initializes • A rank is sometimes called a task ID. Ranks are contiguous and begin at zero • Ranks are used by the programmer to specify the source and destination of messages • Ranks are often also used conditionally by the application to control program execution (e.g., if rank=0 do this / if rank=1 do that)

Multiple Communicators • It is possible that a problem consists of several sub-problems where each can be solved concurrently • This type of application is typically found in the category of MPMD coupled analysis • We can create a new communicator for each sub-problem as a subset of an existing communicator • MPI allows you to achieve that by using MPI_COMM_SPLIT

Example of Multiple Communicators • Consider a problem with a fluid dynamics part and a structural analysis part, where each part can be computed in parallel MPI_COMM_WORLD Comm_Fluid Comm_Struct Rank=0 Rank=0 Rank=1 Rank=1 Rank=0 Rank=4 Rank=1 Rank=5 Rank=2 Rank=2 Rank=3 Rank=3 Rank=2 Rank=6 Rank=3 Rank=7 • Ranks within MPI_COMM_WORLD are printed in red • Ranks within Comm_Fluid are printed with green • Ranks within Comm_Struct are printed with blue

Next Class Discussion on Programming Models MapReduce Message Passing Interface (MPI) Message Passing Interface (MPI) Examples of parallel processing Traditional Models of parallel programming Parallel computer architectures Programming Models- Part II Why parallelizing our programs?

Message Passing Interface • In this part, the following concepts of MPI will be described: • Basics • Point-to-point communication • Collective communication

Point-to-Point Communication • MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks • One task performs a send operation and the other performs a matchingreceive operation • Ideally, every send operation would be perfectly synchronized with its matching receive • This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync Processor1 Processor2 Network sendA recvA

Two Cases • Consider the following two cases: • A send operation occurs 5 seconds before the receive is ready - where is the message stored while the receive is pending? • Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that are "backing up"?

Steps Involved in Point-to-Point Communication • The data is copied to the user buffer by the user • The user calls one of the MPI send routines • The system copies the data from the user buffer to the system buffer • The system sends the data from the system buffer to the destination process Process 0 Sender The user calls one of the MPI receive routines The system receives the data from the source process and copies it to the system buffer The system copies data from the system buffer to the user buffer The user uses data in the user buffer User Mode Kernel Mode sendbuf 1 sysbuf 2 Call a send routine Copying data from sendbuf to sysbuf 3 Send data from sysbuf to destination Now sendbuf can be reused 4 Data Process 1 Receiver User Mode Kernel Mode Receive data from source to sysbuf Call a recev routine 1 2 4 sysbuf Now recvbuf contains valid data Copying data from sysbuf to recvbuf recvbuf 3

Blocking Send and Receive • When we use point-to-point communication routines, we usually distinguish between blocking and non-blocking communication • A blocking send routine will only returnafter it is safe to modify the application buffer for reuse • Safe means that modifications will not affect the data intended for the receive task • This does not imply that the data was actually received by the receiver- it may be sitting in the system buffer at the sender side Safe to modify sendbuf Rank 0 Rank 1 sendbuf recvbuf Network recvbuf sendbuf

Blocking Send and Receive • A blocking send can be: • Synchronous: Means there is a handshaking occurring with the receive task to confirm a safe send • Asynchronous: Means the system buffer at the sender side is used to hold the data for eventual delivery to the receiver • A blocking receive only returns after the data has arrived (i.e., stored at the application recvbuf) and is ready for use by the program

Non-Blocking Send and Receive (1) • Non-blocking send and non-blocking receiveroutines behave similarly • They return almost immediately • They do not wait for any communication events to complete such as: • Message copying from user buffer to system buffer • Or the actual arrival of a message

Non-Blocking Send and Receive (2) • However, it is unsafe to modify the application buffer until you make sure that the requested non-blocking operation was actually performed by the library • If you use the application buffer before the copy completes: • Incorrect data may be copied to the system buffer (in case of non-blocking send) • Or your receive buffer does not contain what you want (in case of non-blocking receive) • You can make sure of the completion of the copy by using MPI_WAIT() after the send or receive operations

Why Non-Blocking Communication? • Why do we use non-blocking communication despite its complexity? • Non-blocking communication is generally faster than its corresponding blocking communication • We can overlap computations while the system is copying data back and forth between application and system buffers

MPI Point-To-Point Communication Routines

Message Order • MPI guarantees that messages will not overtake each other • If a sender sends two messages M1and M2in succession to the same destination, and both match the same receive, the receive operation will receive M1before M2 • If a receiver posts two receives R1 and R2, in succession, and both are looking for the same message, R1will receive the message before R2

Fairness • MPI does not guarantee fairness – itis up to the programmer to prevent operation starvation • For instance, if task 0 and task 1 send competing messages (i.e., messages that match the same receive) to task 2, only one of the sends will complete Task 0 Task 1 Msg A Msg A Task 2 ?

Unidirectional Communication • When you send a message from process 0 to process 1, there are four combinations of MPI subroutines to choose from • Blocking send and blocking receive • Non-blocking send and blocking receive • Blocking send and non-blocking receive • Non-blocking send and non-blocking receive Rank 0 Rank 1 sendbuf recvbuf recvbuf sendbuf

Understanding Parallel Programming Models: A Deep Dive into Amdahl’s Law and Architectures

Understanding Parallel Programming Models: A Deep Dive into Amdahl’s Law and Architectures

Presentation Transcript

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440