2. Machine Architectures

2. Machine Architectures

von Neumann Architecture • Based on the fetch-decode-execute cycle • The computer executes a single sequence of instructions that act on data. Both program and data are stored in memory. Flow of instructions Flow of instructions A B C Data

Flynn's Taxonomy • Classifies computers according to… • The number of execution flows • The number of data flows

Single Instruction, Single Data (SISD) • A serial (non-parallel) computer • Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle • Single data: only one data stream is being used as input during any one clock cycle • Most PCs, single CPU workstations, …

Single Instruction, Multiple Data (SIMD) • A type of parallel computer • Single instruction: All processing units execute the same instruction at any given clock cycle • Multiple data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as image processing. • Examples: Connection Machine CM-2, Cray J90, Pentium MMX instructions

The Connection Machine 2 (SIMD) • The massively parallel Connection Machine 2 was a supercomputer produced by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel.

Multiple Instruction, Single Data (MISD) • Few actual examples of this class of parallel computer have ever existed • Some conceivable examples might be: • multiple frequency filters operating on a single signal stream • multiple cryptography algorithms attempting to crack a single coded message • the Data Flow Architecture

Multiple Instruction, Multiple Data (MIMD) • Currently, the most common type of parallel computer • Multiple Instruction: every processor may be executing a different instruction stream • Multiple Data: every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or non-deterministic • Examples: most current supercomputers, computer clusters, multi-processor SMP machines (inc. some types of PCs)

Earth Simulator Center – Yokohama, NecSX (MIMD) • The Earth Simulator is a project to develop a 40 TFLOPS system for climate modeling. It performs at 35.86 TFLOPS. • The ES is based on: • - 5,120 (640 8-way nodes) 500 MHz NEC CPUs • - 8 GFLOPS per CPU (41 TFLOPS total) • - 2 GB RAM per CPU (10 TB total) • - Shared memory inside the node • - 640 × 640 crossbar switch between the nodes • - 16 GB/s inter-node bandwidth

What about Memory? • The interface between CPUs and Memory in Parallel Machines is of crucial importance • The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck • It limits how fast a machine can operate: relationship between computation/communication

Communication in Parallel Machines • Programs act on data. • Quite important: how do processors access each others data? Shared Memory Model Message Passing Model CPU Memory CPU Memory CPU Memory network CPU CPU Memory CPU Memory CPU CPU

Shared Memory • Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space • Multiple processors can operate independently but share the same memory resources • Changes in a memory location made by one processor are visible to all other processors • Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA

Shared Memory (2) Single 4-processor Machine A 3-processor NUMA Machine CPU Memory CPU CPU CPU CPU CPU Memory Memory Memory CPU Fast Memory Interconnect UMA: Uniform Memory Access NUMA: Non-Uniform Memory Access

Uniform Memory Access (UMA) • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory • Sometimes called CC-UMA - Cache Coherent UMA. • Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. • Very hard to scale

Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP. • Not all processors have equal access time to all memories • Sometimes called DSM – Distributed Shared Memory • Advantages • User-friendly programming perspective to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory and CPUs • More scalable than SMPs • Disadvantages • Lack of scalability between memory and CPUs • Programmer responsibility for synchronization constructs that ensure "correct" access of global memory • Expensive: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors

UMA and NUMA SGI Origin 3900: - 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw (Scales up to 512 processors and 1TByte of memory) The new MAC PRO features 2 Intel Core2 Duo processors that share a common central memory (up to 16Gbyte)

Distributed Memory (DM) • Processors have their own local memory. • Memory addresses in one processor do not map to another processor (no global address space) • Because each processor has its own local memory, cache coherency does not apply • Requires a communication network to connect inter-processor memory • When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. • Synchronization between tasks is the programmer's responsibility • Very scalable • Cost effective: use of off-the-shelf processors and networking • Slower than UMA and NUMA machines

Distributed Memory Computer Computer Computer TITAN@DEI, a PC cluster interconnected by FastEthernet CPU CPU CPU Memory Memory Memory network interconnect

Hybrid Architectures • Today, most systems are an hybrid featuring shared distributed memory. • Each node has several processors that share a central memory • A fast switch interconnects the several nodes • In some cases the interconnect allows for the mapping of memory among nodes; in most cases it gives a message passing interface CPU Memory CPU CPU Memory CPU CPU CPU CPU CPU fast network interconnect CPU Memory CPU CPU Memory CPU CPU CPU CPU CPU

ASCI White at theLawrence Livermore National Laboratory • Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node) • Each node has 16GB of memory • A total of 512 nodes, interconnected by a 2GB/sec network node-to-node • The 512 nodes feature a total of 8192 processors, having a total of 8192 GB of memory • It currently operates at 13.8 TFLOPS

Summary

2. Machine Architectures

2. Machine Architectures

Presentation Transcript

Instruction Set Architectures Part 2

LabVIEW State Machine Architectures

Chapter 2: Impact of Machine Architectures

MACHINE 2 MACHINE

Architectures

Lecture 2: Evaluating Computer Architectures

An Overview of Virtual Machine Architectures

A Common Machine Language for Communication-Exposed Architectures

Session 2. Application and Layered Architectures

Visualizing Software Architectures, Part 2

Architectures.

Machine Learning 2

Chapter 2 Applications and Layered Architectures

M-Machine and Grids Parallel Computer Architectures

An Overview Of Virtual Machine Architectures

Machine Translation- 2

An Overview Of Virtual Machine Architectures

Large Scale Machine Translation Architectures

Visualizing Software Architectures, Part 2