Parallel Computing

Parallel Computing Erik Robbins

Limits on single-processor performance • Over time, computers have become better and faster, but there are constraints to further improvement • Physical barriers • Heat and electromagnetic interference limit chip transistor density • Processor speeds constrained by speed of light • Economic barriers • Cost will eventually increase beyond price anybody will be willing to pay

Parallelism • Improvement of processor performance by distributing the computational load among several processors. • The processing elements can be diverse • Single computer with multiple processors • Several networked computers

Drawbacks to Parallelism • Adds cost • Imperfect speed-up. • Given n processors, perfect speed-up would imply a n-fold increase in power. • A small portion of a program which cannot be parallelized will limit overall speed-up. • “The bearing of a child takes nine months, no matter how many women are assigned.”

Amdahl’s Law • This relationship is given by the equation: • S = 1 / (1 – P) • S is the speed-up of the program (as a factor of its original sequential runtime) • P is the fraction that is parallelizable • Web Applet – • http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/amdahl.html

Amdahl’s Law

History of Parallel Computing – Examples • 1954 – IBM 704 • Gene Amdahl was a principle architect • uses fully automatic floating point arithmetic commands. • 1962 – Burroughs Corporation D825 • Four-processor computer • 1967 – Amdahl and Daniel Slotnick publish debate about parallel computing feasibility • Amdahl’s Law coined • 1969 – Honeywell Multics system • Capable of running up to eight processors in parallel • 1970s – Cray supercomputers (SIMD architecture) • 1984 – Synapse N+1 • First bus-connected multi-processor with snooping caches

History of Parallel Computing –Overview of Evolution • 1950’s - Interest in parallel computing began. • 1960’s & 70’s - Advancements surfaced in the form of supercomputers. • Mid-1980’s – Massively parallel processors (MPPs) came to dominate top end of computing. • Late-1980’s – Clusters (type of parallel computer built from large numbers of computers connected by network) competed with & eventually displaced MPPs. • Today – Parallel computing has become mainstream based on multi-core processors in home computers. Scaling of Moore’s Law predicts a transition from a few cores to many.

Multiprocessor Architectures • Instruction Level Parallelism (ILP) • Superscalar and VLIW • SIMD Architectures (single instruction streams, multiple data streams) • Vector Processors • MIMD Architectures (multiple instruction, multiple data) • Interconnection Networks • Shared Memory Multiprocessors • Distributed Computing • Alternative Parallel Processing Approaches • Dataflow Computing • Neural Networks (SIMD) • Systolic Arrays (SIMD) • Quantum Computing

Superscalar • A design methodology that allows multiple instructions to be executed simultaneously in each clock cycle. • Analogous to adding another lane to a highway. The “additional lanes” are called execution units. • Instruction Fetch Unit • Critical component. • Retrieves multiple instructions simultaneously from memory. Passes instructions to… • Decoding Unit • Determines whether the instructions have any type of dependency

VLIW • Superscalar processors rely on both hardware and the compiler. • VLIW processors rely entirely on the compiler. • They pack independent instructions into one long instruction which tells the execution units what to do. • Compiler cannot have an overall picture of the run-time code. • Is compelled to be conservative in its scheduling. • VLIW compiler also arbitrates all dependencies.

Vector Processors • Referred to as supercomputers. (Cray series most famous) • Based on vector arithmetic. • A vector is a fixed-length, one-dimensional array of values, or an ordered series of scalar quantities. • Operations include addition, subtraction, and multiplication. • Each instruction specifies a set of operations to be carried over an entire vector. • Vector registers – specialized registers that can hold several vector elements at one time. • Vector instructions are efficient for two reasons. • Machine fetches fewer instructions. • Processor knows it will have continuous source of data – can pre-fetch pairs of values.

MIMD Architectures • Communication is essential for synchronized processing and data sharing. • Manner of passing messages determines overall design. • Two aspects: • Shared Memory – one large memory accessed identically by all processors. • Interconnected Network – Each processor has own memory, but processors are allowed to access each other’s memories via the network.

Interconnection Networks • Categorized according to topology, routing strategy, and switching technique. • Networks can be either static or dynamic, and either blocking or non-blocking. • Dynamic – Allow the path between two entities (two processors or a processor & memory) to change between communications. Static is opposite. • Blocking – Does not allow new connections in the presence of other simultaneous connections.

Network Topologies • The way in which the components are interconnected. • A major determining factor in the overhead of message passing. • Efficiency is limited by: • Bandwidth – information carrying capacity of the network • Message latency – time required for first bit of a message to reach its destination • Transport latency – time a message spends in the network • Overhead – message processing activities in the sender and receiver

Static Topologies • Completely Connected – All components are connected to all other components. • Expensive to build & difficult to manage. • Star – Has a central hub through which all messages must pass. • Excellent connectivity, but hub can be a bottleneck. • Linear Array or Ring – Each entity can communicate directly with its two neighbors. • Other communications have to go through multiple entities. • Mesh – Links each entity to four or six neighbors. • Tree – Arrange entities in tree structures. • Potential for bottlenecks in the roots. • Hypercube – Multidimensional extensions of mesh networks in which each dimension has two processors.

Static Topologies

Dynamic Topology • Dynamic networks use either a bus or a switch to alter routes through a network. • Bus-based networks are simplest and most efficient when number of entities are moderate. • Bottleneck can result as number of entities grow large. • Parallel buses can alleviate bottlenecks, but at considerable cost.

Switches • Crossbar Switches • Are either open or closed. • A crossbar network is a non-blocking network. • If only one switch at each crosspoint, n entities require n^2 switches. In reality, many switches may be required at each crosspoint. • Practical only in high-speed multiprocessor vector computers.

Switches • 2x2 Switches • Capable of routing its inputs to different destinations. • Two inputs and two outputs. • Four states • Through (inputs feed directly to outputs) • Cross (upper in directed to lower out & vice versa) • Upper broadcast (upper input broadcast to both outputs) • Lower broadcast (lower input directed to both outputs) • Through and Cross states are the ones relevant to interconnection networks.

2x2 Switches

Shared Memory Multiprocessors • Tightly coupled systems that use the same memory. • Global Shared Memory – single memory shared by multiple processors. • Distributed Shared Memory – each processor has local memory, but is shared with other processors. • Global Shared Memory with separate cache at processors.

UMA Shared Memory • Uniform Memory Access • All memory accesses take the same amount of time. • One pool of shared memory and all processors have equal access. • Scalability of UMA machines is limited. As the number of processors increases… • Switched networks quickly become very expensive. • Bus-based systems saturate when the bandwidth becomes insufficient. • Multistage networks run into wiring constraints and significant latency.

NUMA Shared Memory • Nonuniform Memory Access • Provides each processor its own piece of memory. • Processors see this memory as a contiguous addressable entity. • Nearby memory takes less time to read than memory that is further away. Memory access time is thus inconsistent. • Prone to cache coherence problems. • Each processor maintains a private cache. • Modified data needs to be updated in all caches. • Special hardware units known as snoopy cache controllers. • Write-through with update – updates stale values in other caches. • Write-through with invalidation – removes stale values from other caches.

Distributed Computing • Means different things to different people. • In a sense, all multiprocessor systems are distributed systems. • Usually used referring to a very loosely based multicomputer system. • Depend on a network for communication among processors.

Grid Computing • An example of distributed computing. • Uses resources of many computers connected by a network (i.e. Internet) to solve computational problems that are too large for any single super-computer. • Global Computing • Specialized form of grid computing. Uses computing power of volunteers whose computers work on a problem while the system is idle. • SETI@Home Screen Saver • Six year run accumulated two million years of CPU time and 50 TB of data.

Questions?

Parallel Computing