An Overview of Parallel Computing

An Overview of Parallel Computing

Hardware • There are many varieties of parallel computing hardware and many different architectures • The original classification of parallel computers is popularly known as Flynn’s taxonomy. • Flynn classified systems according to the number of instruction streams and the number of data streams 1. SISD (Von Neumann machine) 2. MIMD (most general, a collection of autonomous processors operate on their own data streams) 3. SIMD 4. MISD

The Classical von Neumann Machine • Divided into a CPU and main memory • CPU is divided into a control unit and an ALU • The memory stores both instructions and data • The control unit directs the execution of programs • The ALU carries out the calculations. When being used by the program, instructions and data are stored in very fast memory location, called registers. Both data and instructions are moved between memory and registers in CPU via bus. • Bus is a collection of parallel wires, faster buses have more wires

The Classical von Neumann Machine • To be useful, some additional devices are needed including input devices, output devices, and extended storage devices (disk) • The bottleneck is the transfer of data and instructions between memory and CPU • Few computers use classical Neumann machine • Most machines now have a hierarchical memory. Cache is used to achieve faster access.

Pipeline and Vector Architecture The first widely used extension to Neumann model was pipelining. Suppose we have the following program float x[100], y[100], z[100]; for(i =0; i<100; i++) z[i]=x[i]+y[i]; Further a single addition consists of following operations: 1. Fetch the operands from memory; 2. Compare exponents; 3. Shift one operand; 4. Add; 5. Normalize the results; 6. Store results in memory.

Pipeline and Vector Architecture • A further improvement: add vector instructions • With vector instruction, each of the basic instruction only needs to be issued once. One short instruction encodes N operations. • Another improvement is the use of multiple memory banks. • Different authors regard vector processors as different categories (MISD, variant of SIMD, even not really parallel machines) • Examples: CRAY C90 and NEC SX4

Pipeline and Vector Architecture • Advantages: relatively easy to write programs to obtain very high performance, therefore very popular for high performance scientific computing • Disadvantages: Don’t work well for programs that use irregular structures or use many branches

SIMD Systems • A pure SIMD system is opposed to a vector processor since it has single CPU • During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors. Each of them either executes the instruction or idle. Example: for (i=0; i< 1000; i++) if (y[i]!=0.0) z[i]=x[i]/y[i]; else z[i]=x[i];

SIMD Systems • Each subordinate processor would execute Time Step 1: Test local_y=0.0. Time Step 2: a. If local_y was nonzero, z[i]=x[i]/y[i]; b. If local_y was zero, do nothing. Time Step 3: a. If local_y was nonzero, do nothing. b. if local_y was zero, z[i]=x[i]. • It is completely synchronous execution. A given subordinate processor either active or idle at given instant of time

SIMD Systems • The disadvantage is clear: in a program with many conditional branches or long segments of code whose execution depends on conditionals, possibly many processes will be idle for long period of time • Easy program if underlying problem has a regular structure. • The most famous examples of SIMD machines are the CM-1 and CM-2 Connection Machines produced by Thinking Machines.

General MIMD Systems • The key difference between SIMD and MIMD: the processors are autonomous. • MIMD systems are asynchronous. Often no global clock; maybe no correspondence between different processors even if they execute the same program • MIMD systems consist of shared-memory (and distributed memory systems, also sometimes called multiprocessors and multicomputers.

Shared-Memory MIMD • The generic shared-memory architecture

Bus-based Architecture • Simplest interconnection network • If multiple processors access memory, bus will become saturated, thus long delays • A fairly large cache • Due to limited bandwidth of a bus, do not scale to large number of processors.

Switched-based Architecture • Most others rely on some type of switch-based network • A crossbar as a rectangular mesh of wires with switches at the point of intersection, and terminals on its left and top edges.

Switched-based Architecture • Processors or memory modules can be connected to the terminals • The switches can either allow a signal to pass through in both directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. • Any other processor can simultaneously access any other memory module, therefore, don’t suffer from the problems of saturation • However, they are very expensive: an m*n crossbar needs mn hardware switches

Cache Coherence • Cache coherence is a problem for any shared-memory architecture • A processor accesses a shared variable in its cache, how will it know whether the value stored in the variable is current? • Example: assume x=2; //initially P1 P2 Time 0 y0=x; y1=3*x; Time 1 x=7; z=6; Time 2 y=5; z1=4*x; y0 ends up 2 and y1 ends up 6. How about z1?

Cache Coherence • The simplest solution is probably the snoopy protocol • Each CPU has a cache controller • The ache controller monitors the bus traffic. When a processor updates a shared variable, it also updated the corresponding main memory location. The cache controllers on the other processors detect the write to main memory and mark their copies of the variable as invalid • This approach is only suitable for bus-based shared-memory because any traffic on the bus can be monitored by all the controllers

Distributed-Memory MIMD • Each processor has its own private memory • Generic distributed-memory MIMD • If we view it as a graph, the edges are communication wires. Each vertex corresponds to a processor/memory pair (or node), or some vertices correspond to nodes and others correspond to switches • They are static networks and dynamic networks

Distributed-Memory MIMD • Different types of distributed systems (a) a static network (mesh) (b) a dynamic network (crossbar)

Dynamic Interconnection Networks • Dynamic interconnection networks Example: An omega network

Dynamic Interconnection Networks • A less expensive solution is to use the multistage switching network, such as omega network • If p nodes, plogp/2 switches are needed, less than the crossbar using p2 switches • The delay in transmitting a message is increased since logp switches must be set

Static Interconnection Networks • Fully connected interconnection network • Ideal case from the performance and programming • Communication has no delay • Costs are huge

Static Interconnection Networks • A linear array or a ring • Relatively inexpensive (p or p-1 wires) • Easy to increase the size of the network • Number of available wires is extremely limited • The longest path is p-1 or p/2

Static Interconnection Networks • Hypercube: practically closest to the fully connected network • A d-dimension hypercube has 2d nodes • Any two nodes traverse at most d wires • Drawback: relative lack of scalability

Static Interconnection Networks • Mesh or torus

Static Interconnection Networks • Mesh or torus is between hypercube and linear array • Scale better than hypercube • Quite popular

Communication and Routing • If two nodes are not directly connected or if a processor is not directly connected to a memory module, how is data transmitted between the two? • If there are multiple routes, how to decide? Is the route always the shortest path? Most systems use a deterministic shortest-path algorithm • How do intermediate nodes forward communications? Two basic approaches are store-and-forward routing and cut-through routing • Store-and-forward routing uses considerably more memory • Most systems use some variant of cut-through routing

Store-and-Forward Routing • Read in the entire message ,and then send to C

Cut-Through Routing • Immediately forward each identifiable pieces of the message

An Overview of Parallel Computing