Parallel Computing Platforms

Parallel Computing Platforms • Motivation: High Performance Computing • Dichotomy of Parallel Computing Platforms • Communication Model of Parallel Platforms • Physical Organization of Parallel Platforms • Communication Costs in Parallel Machines ICS 573: High Performance Computing

High Performance Computing • Computing power required to solve, effectively, computationally intensive and/or data intensive problems in science, engineering and other emerging disciplines • Provided using • Parallel computers and • Parallel programming techniques • Is this compute power not available otherwise? ICS 573: High Performance Computing

Elements of a Parallel Computer • Hardware • Multiple Processors • Multiple Memories • Interconnection Network • System Software • Parallel Operating System • Programming Constructs to Express/Orchestrate Concurrency • Application Software • Parallel Algorithms • Goal: Utilize the Hardware, System, & Application Software to either • Achieve Speedup: Tp = Ts/p • Solve problems requiring a large amount of memory. ICS 573: High Performance Computing

Dichotomy of Parallel Computing Platforms • Logical Organization • The user’s view of the machine as it is being presented via its system software • Physical Organization • The actual hardware architecture • Physical Architecture is to a large extent independent of the Logical Architecture ICS 573: High Performance Computing

Logical Organization • An explicitly parallel program must specify concurrency and interaction between concurrent tasks • That is, there are two critical components of parallel computing, logically: • Control structure: How to express parallel tasks • Communication model: mechanism for specifying interaction • Parallelism can be expressed at various levels of granularity - from instruction level to processes. ICS 573: High Performance Computing

Control Structure of Parallel Platforms • Processing units in parallel computers either • operate under the centralized control of a single control unit or • work independently. • If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD). • If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD). ICS 573: High Performance Computing

SIMD and MIMD Processors A typical SIMD architecture (a) and a typical MIMD architecture (b). ICS 573: High Performance Computing

Instruction Stream Data Output stream A Data Input stream A Processor A Data Output stream B Processor B Data Input stream B Data Output stream C Processor C Data Input stream C SIMD Processors ICS 573: High Performance Computing

Instruction Stream A Instruction Stream C Instruction Stream B Data Output stream A Data Input stream A Processor A Data Output stream B Processor B Data Input stream B Data Output stream C Processor C Data Input stream C MIMD Processors ICS 573: High Performance Computing

SIMD & MIMD Processors (cont’d) • SIMD relies on the regular structure of computations (such as those in image processing). • Require less hardware than MIMD computers (single control unit). • Require less memory • Are specialized: not suited to all applications. • In contrast to SIMD processors, MIMD processors can execute different programs on different processors. • Single program multiple data streams (SPMD) executes the same program on different processors. • SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. ICS 573: High Performance Computing

Logical Organization:Communication Model • There are two primary forms of data exchange between parallel tasks: • Accessing a shared data space and • Exchanging messages. • Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. • Platforms that support messaging are also called message passing platforms or multicomputers. ICS 573: High Performance Computing

Shared-Address-Space Platforms • Part (or all) of the memory is accessible to all processors. • Processors interact by modifying data objects stored in this shared-address-space. • If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine. ICS 573: High Performance Computing

NUMA and UMA Shared-Address-Space Platforms Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only. ICS 573: High Performance Computing

NUMA and UMA Shared-Address-Space Platforms • The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. • NUMA machines require locality from underlying algorithms for performance. • Programming these platforms is easier since reads and writes are implicitly visible to other processors. • However, read-write to shared data must be coordinated • Caches in such machines require coordinated access to multiple copies. • This leads to the cache coherence problem. ICS 573: High Performance Computing

Shared-Address-Space vs. Shared Memory Machines • Shared-address-space is as a programming abstraction • Shared memory is as a physical machine attribute • It is possible to provide a shared address space using a physically distributed memory • Distributed shared memory machines • Shared-address-space machines commonly programmed using Pthreads and OpenMP ICS 573: High Performance Computing

Message-Passing Platforms • These platforms comprise of a set of processors and their own (exclusive) memory • Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers. • These platforms are programmed using (variants of) send and receive primitives. • Libraries such as MPI and PVM provide such primitives. ICS 573: High Performance Computing

Physical Organization: Interconnection Networks (ICNs) • Provide processor-to-processor and processor-to-memory connections • Networks are classified as: • Static • Dynamic • Static • Consist of a number of point-to-point links • direct network • Historically used to link processors-to-processors • distributed-memory • Dynamic • The network consists of switching elements that the various processors attach to • indirect network • Historically used to link processors-to-memory • shared-memory systems ICS 573: High Performance Computing

Static and DynamicInterconnection Networks Classification of interconnection networks: (a) a static network; and (b) a dynamic network. ICS 573: High Performance Computing

Interconnection Networks Dynamic Static Bus-based Switch-based 1-D 2-D HC Crossbar Single Multiple SS MS Network Topologies ICS 573: High Performance Computing

Network Topologies: Static ICNs • Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bi-directional, between processors. • Completely connected networks (CCNs): Number of links: O(N2), delay complexity: O(1). • Limited connected network (LCNs) • Linear arrays • Ring (Loop) networks • Two-dimensional arrays • Tree networks • Cube network ICS 573: High Performance Computing

Network Topologies: Dynamic ICNs • A variety of network topologies have been proposed and implemented: • Bus-based • Crossbar • Multistage • etc • These topologies tradeoff performance for cost. • Commercial machines often implement hybrids of multiple topologies for reasons of packaging, cost, and available components. ICS 573: High Performance Computing

Network Topologies: Buses • Shared medium. • Ideal for information broadcast • Distance between any two nodes is a constant • Bandwidth of the shared bus is a major bottleneck. • Local memories can improve performance • Scalable in terms of cost, unscalable in terms of performance. ICS 573: High Performance Computing

A completely non-blocking crossbar network connecting pprocessors to b memory banks. Network Topologies: Crossbars • Uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner. • The cost of a crossbar of p processors grows as • . • Scalable in terms of performance, unscalable in terms of cost ICS 573: High Performance Computing

Network Topologies: Multistage Networks • Strike a compromise between the cost and performance scalability of the Bus and Crossbar networks. The schematic of a typical multistage interconnection network. ICS 573: High Performance Computing

Network Topologies: Multistage Omega Network • One of the most commonly used multistage interconnects is the Omega network. • This network consists of log p stages, wherep is the number of inputs/outputs. • At each stage, input i is connected to output jif: ICS 573: High Performance Computing

Network Topologies: Multistage Omega Network Each stage of the Omega network implements a perfect shuffle as follows: A perfect shuffle interconnection for eight inputs and outputs. ICS 573: High Performance Computing

Network Topologies: Completely Connect and Star Networks • Completely connected network is the static counterpart of a Crossbar network • Performance scales very well, the hardware complexity is not realizable for large values of p. • Star network is the static counterpart of a Bus network • Central processor is the bottleneck (a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. ICS 573: High Performance Computing

Network Topologies: Linear Arrays, Meshes, and k-d Meshes • In a linear array, each node has two neighbors, one to its left and one to its right. • If the nodes at either end are connected, we refer to it as a 1-D torus or a ring. • A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. • A further generalization to d dimensions has nodes with 2d neighbors. • A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. ICS 573: High Performance Computing

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. Network Topologies: Linear Arrays and Meshes ICS 573: High Performance Computing

Network Topologies: Hypercubes and their Construction Construction of hypercubes from hypercubes of lower dimension. ICS 573: High Performance Computing

Network Topologies: Properties of Hypercubes • The distance between any two nodes is at most log p. • Each node has log p neighbors. • The distance between two nodes is given by the number of bit positions at which the two nodes differ. ICS 573: High Performance Computing

Network Topologies: Tree-Based Networks Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network. ICS 573: High Performance Computing

Evaluation Metrics for ICNs • The following evaluation metrics are the criteria used to characterize the cost and performance of static ICNs • Diameter • The maximum distance between any two nodes • Smaller the better. • Connectivity • The minimum number of arcs that must be removed to break it into two disconnected networks • Larger the better • Bisection width • The minimum number of arcs that must be removed to partition the network into two equal halves. • Larger the better • Cost • The number of links in the network • Smaller the better ICS 573: High Performance Computing

Evaluating Static Interconnection Networks ICS 573: High Performance Computing

Evaluating Dynamic Interconnection Networks ICS 573: High Performance Computing

Communication Costs in Parallel Machines • Along with idling and contention, communication is a major overhead in parallel programs. • Communication cost dependents on many features including: • Network topology • Data handling • Routing etc ICS 573: High Performance Computing

Message Passing Costs in Parallel Computers • The communication cost of a data-transfer operation depends on: • Start-up time: ts • add headers/trailer, error-correction, execute the routing algorithm, establish the connection between source & destination • Per-hop time: th • time to travel between two directly connected nodes. • node latency • Per-word transfer time: tw • 1/channel-width ICS 573: High Performance Computing

Store-and-Forward Routing • A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop. • The total communication cost for a message of size mwords to traverse l communication links is • In most platforms, th is small and the above expression can be approximated by ICS 573: High Performance Computing

Routing Techniques Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that the message is in transit. The startup time associated with this message transfer is assumed to be zero. ICS 573: High Performance Computing

Packet Routing • Store-and-forward makes poor use of communication resources. • Packet routing breaks messages into packets and pipelines them through the network. • Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information. • The total communication time for packet routing is approximated by: • The factor tw accounts for overheads in packet headers. ICS 573: High Performance Computing

Cut-Through Routing • Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits. • Since flits are typically small, the header information must be minimized. • This is done by forcing all flits to take the same path, in sequence. • A tracer message first programs all intermediate routers. All flits then take the same route. • Error checks are performed on the entire message, as opposed to flits. • No sequence numbers are needed. ICS 573: High Performance Computing

Cut-Through Routing • The total communication time for cut-through routing is approximated by: • This is identical to packet routing, however, tw is typically much smaller. ICS 573: High Performance Computing

Simplified Cost Model for Communicating Messages • The cost of communicating a message between two nodes lhops away using cut-through routing is given by • In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large. • Furthermore, it is often not possible to control routing and placement of tasks. • For these reasons, we can approximate the cost of message transfer by ICS 573: High Performance Computing

Notes on the Simplified Cost Model • The given cost model allows the design of algorithms in an architecture-independent manner • However, the following assumptions are made: • Communication between any pair of nodes takes equal time • Underlying network is uncongested • Underlying network is completely connected • Cut-through routing is used ICS 573: High Performance Computing

Parallel Computing Platforms

Parallel Computing Platforms

Presentation Transcript

Parallel Computing

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing

Parallel Computing

GPU Parallel Computing

Parallel Computing

Parallel Computing

Reconfigurable Computing Platforms

Parallel Computing

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing Overview

Parallel Computing with OpenMP on distributed shared memory platforms

EECE 571e (Fall 2014) (Massively) Parallel Computing Platforms

Parallel Computing Research

Parallel platforms, etc.

Multiprocessors - Parallel Computing