Parallel Real-Time Systems Parallel Computing Overview
References(Will be expanded as needed) • Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/PDC-F08/ • Selected slides from “Introduction to Parallel Computing” • Michael Quinn, Parallel Programming in C with MPI and Open MP, McGraw Hill, 2004. • Chapter 1 is posted on website • Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available on website.
Outline • Why use parallel computing • Moore’s Law • Modern parallel computers • Flynn’s Taxonomy • Seeking Concurrency • Data clustering case study • Programming parallel computers
Why Use Parallel Computers • Solve compute-intensive problems faster • Make infeasible problems feasible • Reduce design time • Solve larger problems in same amount of time • Improve answer’s precision • Reduce design time • Increase memory size • More data can be kept in memory • Dramatically reduces slowdown due to accessing external storage increases computation time • Gain competitive advantage
1989 Grand Challenges to Computational Science Categories • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Materials design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling
Weather Prediction • Atmosphere is divided into 3D cells • Data includes temperature, pressure, humidity, wind speed and direction, etc • Recorded at regular time intervals in each cell • There are about 5×103 cells of 1 mile cubes. • Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast • Details in Ian Foster’s 1995 online textbook • Design & Building Parallel Programs • Included in Parallel Reference List, which will be posted on website.
Moore’s Law • In 1965, Gordon Moore  observed that the density of chips doubled every year. • That is, the chip size is being halved yearly. • This is an exponential rate of increase. • By the late 1980’s, the doubling period had slowed to 18 months. • Reduction of the silicon area causes speed of the processors to increase. • Moore’s law is sometimes stated: “The processor speed doubles every 18 months”
Micros Speed (log scale) Supercomputers Mainframes Minis Time Microprocessor Revolution Moore's Law
Some Definitions • Concurrent – Sequential events or processes which seem to occur or progress at the same time. • Parallel –Events or processes which occur or progress at the same time • Parallel computing provides simultaneous execution of operations within a single parallel computer • Distributed computing provides simultaneous execution of operations across a number of systems.
Flynn’s Taxonomy • Best known classification scheme for parallel computers. • Depends on parallelism it exhibits with its • Instruction stream • Data stream • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD
SISD • Single Instruction, Single Data • Usual sequential computer is primary example • i.e., uniprocessors • Note: co-processors don’t count as more processors • Concurrent processing allowed • Instruction prefetching • Pipelined execution of instructions • Independent concurrent tasks can execute different sequences of operations.
SIMD • Single instruction, multiple data • One instruction stream is broadcast to all processors • Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU; • PEs do not store a copy of the program nor have a program control unit. • Individual processors can be inhibited from participating in an instruction (based on a data test).
SIMD (cont.) • All active processor executes the same instruction synchronously, but on different data • On a memory access, all active processors must access the same location in their local memory. • The data items form an array (or vector) and an instruction can act on the complete array in one cycle.
SIMD (cont.) • Quinn calls this architecture a processor array. • Examples include • The STARAN and MPP (Dr. Batcher architect) • Connection Machine CM2, built by Thinking Machines).
How to View a SIMD Machine • Think of soldiers all in a unit. • The commander selects certain soldiers as active. • For example, every even numbered row. • The commander barks out an order to all the active soldiers, who execute the order synchronously.
MISD • Multiple instruction streams, single data stream • Primarily corresponds to multiple redundant computation, say for reliability. • Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) • Some authors include pipelined architecture in this category • This category does not receive much attention from most authors, so we won’t discuss it further.
MIMD • Multiple instruction, multiple data • Processors are asynchronous and can independently execute different programs on different data sets. • Communications are handled either • through shared memory. (multiprocessors) • by use of message passing (multicomputers) • MIMD’s are considered by many researchers to include the most powerful, least restricted computers.
MIMD (cont. 2/4) • Have major communication costs • When compared to SIMDs • Internal ‘housekeeping activities’ are often overlooked • Maintaining distributed memory & distributed databases • Synchronization or scheduling of tasks • Load balancing between processors • The SPMDmethod of programming MIMDs • All processors to execute the same program. • SPMD stands for single program, multiple data. • Easy method to program when number of processors are large. • While processors have same code, they can each can be executing different parts at any point in time.
MIMD (cont 3/4) • A more common technique for programming MIMDs is to use multi-tasking • The problem solution is broken up into various tasks. • Tasks are distributed among processors initially. • If new tasks are produced during executions, these may handled by parent processor or distributed • Each processor can execute its collection of tasks concurrently. • If some of its tasks must wait for results from other tasks or new data , the processor will focus the remaining tasks. • Larger programs usually require a load balancing algorithm to rebalance tasks between processors • Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks • E.g., on critical path, more important, earlier deadline, etc.
MIMD (cont 4/4) • Recall, there are two principle types of MIMD computers: • Multiprocessors (with shared memory) • Multicomputers (message passing) • Both are important and will be covered in greater detail next.
Multiprocessors(Shared Memory MIMDs) • Consists of two types • Centralized Multiprocessors • Also called UMA (Uniform Memory Access) • Symmetric Multiprocessor or SMP • Distributed Multiprocessors • Also called NUMA (Nonuniform Memory Access)
Centralized Multiprocessors(SMPs) • Consists of identical CPUs connected by a bus and to common block of memory. • Each processor requires the same amount of time to access memory. • Usually limited to a few dozen processors due to memory bandwidth. • SMPs and clusters of SMPs are currently very popular
Distributed Multiprocessors(or NUMA) • Has a distributed memory system • Each memory location has the same address for all processors. • Access time to a given memory location varies considerably for different CPUs. • Normally, uses fast cache to reduce the problem of different memory access time for processors. • Creates problem of ensuring all copies of the same data in different memory locations are identical.
Multicomputers (Message-Passing MIMDs) • Processors are connected by a network • Usually an interconnection network • Also, may be connected by Ethernet links or a bus. • Each processor has a local memory and can only access its own local memory. • Data is passed between processors using messages, when specified by the program.
Multicomputers (cont) • Message passing between processors is controlled by a message passing language (e.g., MPI, PVM) • The problem is divided into processes or tasks that can be executed concurrently on individual processors. • Each processor is normally assigned multiple tasks.
Multiprocessors vs Multicomputers • Programming disadvantages of message-passing • Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone. • Data is not shared but copied, which increases the total data size. • Data Integrity: difficulty in maintaining correctness of multiple copies of data item.
Multiprocessors vs Multicomputers (cont) • Programming advantages of message-passing • No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently. • Allows PCs on a network to be easily upgraded when faster processors become available. • Mixed “distributed shared memory” systems exist • An example is a cluster of SMPs.
Types of Parallel Execution • Data parallelism • Control/Job/Functional parallelism • Pipelining • Virtual parallelism
Data Parallelism • All tasks (or processors) apply the same set of operations to different data. • Example: • Operations may be executed concurrently • Accomplished on SIMDs by having all active processors execute the operations synchronously. • Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously. for i 0 to 99 do a[i] b[i] + c[i] endfor
Supporting MIMD Data Parallelism • SPMD (single program, multiple data) programming is not really data parallel execution, as processors typically execute different sections of the program concurrently. • Data parallel programming can be strictly enforced when using SPMD as follows: • Processors execute the same block of instructions concurrently but asynchronously • No communication or synchronization occurs within these concurrent instruction blocks. • Each instruction block is normally followed by a synchronization and communication block of steps
MIMD Data Parallelism (cont.) • Strict data parallel programming is unusual for MIMDs, as the processors usually execute independently, running their own local program.
Data Parallelism Features • Each processor performs the same data computation on different data sets • Computations can be performed either synchronously or asynchronously • Defn:Grain Size is the average number of computations performed between communication or synchronization steps • See Quinn textbook, page 411 • Data parallel programming usually results in smaller grain size computation • SIMD computation is considered to be fine-grain • MIMD data parallelism is usually considered to be medium grain
Control/Job/Functional Parallelism • Independent tasks apply different operations to different data elements • First and second statements may execute concurrently • Third and fourth statements may execute concurrently a 2 b 3 m (a + b) / 2 s (a2 + b2) / 2 v s - m2
Control Parallelism Features • Problem is divided into different non-identical tasks • Tasks are divided between the processors so that their workload is roughly balanced • Parallelism at the task level is considered to be coarse grained parallelism
Data Dependence Graph • Can be used to identify data parallelism and job parallelism. • See page 11. • Most realistic jobs contain both parallelisms • Can be viewed as branches in data parallel tasks • If no path from vertex u to vertex v, then job parallelism can be used to execute the tasks u and v concurrently. - If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.
For example, “mow lawn” becomes • Mow N lawn • Mow S lawn • Mow E lawn • Mow W lawn • If 4 people are available to mow, then data parallelism can be used to do these tasks simultaneously. • Similarly, if several people are available to “edge lawn” and “weed garden”, then we can use data parallelism to provide more concurrency.
Pipelining • Divide a process into stages • Produce several items simultaneously
Consider the for loop: p a for i 1 to 3 do p[i] p[i-1] + a[i] endfor This computes the partial sums: p a p a + a p a + a + a p a + a + a + a The loop is not data parallel as there are dependencies. However, we can stage the calculations in order to achieve some parallelism. Compute Partial Sums
Virtual Parallelism • In data parallel applications, it is often simpler to initially design an algorithm or program assuming one data item per processor. • Particularly useful for SIMD programming • If more processors are needed in actual program, each processor is given a block of n/p or n/p data items • Typically, requires a routine adjustment in program. • Will result in a slowdown in running time of at least n/p. • Called Virtual Parallelism since each processor plays the role of several processors. • A SIMD computer has been built that automatically converts code to handle n/p items per processor. • Wavetracer SIMD computer.
Slides from Parallel Architecture Section See www.cs.kent.edu/~jbaker/PDC-F08/ s
References • Slides in this section are taken from the Parallel Architecture Slides at site www.cs.kent.edu/~jbaker/PDC-F08/ • Book reference is Chapter 2 of Quinn’s textbook.
Interconnection Networks • Uses of interconnection networks • Connect processors to shared memory • Connect processors to each other • Different interconnection networks define different parallel machines. • The interconnection network’s properties influence the type of algorithm used for various machines as it affects how data is routed.
Terminology for Evaluating Switch Topologies • We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness • These are • The diameter • The bisection width • The edges per node • The constant edge length • We’ll define these and see how they affect algorithm choice. • Then we will introduce several different interconnection networks.
Terminology for Evaluating Switch Topologies • Diameter – Largest distance between two switch nodes. • A low diameter is desirable • It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes.
Terminology for Evaluating Switch Topologies • Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves. • Or within 1 node of one-half if the number of processors is odd. • High bisection width is desirable. • In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm.
Terminology for Evaluating Switch Topologies • Number of edges per node • It is best if the maximum number of edges/node is a constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes. • Degree is the maximum number of edges per node. • Constant edge length? (yes/no) • Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size.
Three Important Interconnection Networks • We will consider the following three well known interconnection networks: • 2-D mesh • linear network • hypercube • All three of these networks have been used to build commercial parallel computers.