CDA-5155 Computer Architecture PrinciplesFall 2000 Multiprocessor Architectures
Review • Protocols: reliable and heterogeneous networking • Interconnect technologies/topologies • Length, latency, diameter, blocking, deadlock, bisection BW, overheads, routing, congestion, connectionless? • CPU interface to memory hierarchy vs. network (SPEC) • Standardization key for LAN, WAN • Internetworking protocols used as LAN protocols • IC revolutionizing networks and processors • Switch is a specialized computer • Amdahl: High BW networks with high overheads
Overview • High performance computing • Parallelism • Taxonomy of multiprocessors • Programming models • Performance • ASCI – Accelerated Strategic Computing Initiative
High Performance Computing • Hardware and software • El dorado - Attack of the killer micros • Microprocessor: the most cost-effective processor • Dynamic supercomputer market • Timesharing workloads • Multiprocessor vs. high performance uniprocessor • Performance and application domains • Throughput (multiprocessing workloads) • Timesharing, file, database, and web servers • Response time (parallel applications) • Single complex problem • Computation/communication = f(#processors, data size)
Parallelism • Two or more things that happen at the same time • Granularity - size of computations performed at the same time between synchronizations • Carry lookahead adder • Pipelined processor • Two-way superscalar processor • Multiprocessor • COW • Levels of parallelism • Bit level • Instruction level • Thread level • Challenges (Amdahl’s law) • Limited amount of parallelism in programs • High cost of communication
Parallel Computers • Parallel computer: collection of processing elements that cooperate and communicate to solve large problems fast. • Questions about parallel computers: • How large a collection? • How powerful are processing elements? • How do they cooperate and communicate? • How are data transmitted? • What type of interconnection? • What are HW and SW primitives for programmer? • Does it translate into performance?
Taxonomy of Parallel Computers Flynn: I & D streams
Shared Memory Model • Each processor can name every physical location in the machine via Load and Store • Data size: byte, word, ... or cache blocks • Process: a virtual address space (>= 1 thread of control) • Multiple processes can overlap (share), but ALL threads share a process address space • Writes to shared address space by one thread are visible to reads of other threads • Usual model: share code, private stack, some shared heap, some private heap • Performance • Latency, BW, scalability when communicate?
Message Passing Model • Nodes: whole computers (CPU, RAM, I/O) • Communication: explicit I/O operations • Send (local buffer, remote process) • Recv (local buffer, remote process) • Synchronization • When send completes • When buffer free • When request accepted • Necessary even for 1 processor
Shared Memory machine1 machine2 machine1 machine2 machine1 machine2 Application Application Application Application Application Application Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Operating system Operating system Operating system Operating system Operating system Operating system Hardware Hardware Hardware Hardware Hardware Hardware
Vector Addition 2 load pipes &1 store pipe 2 load/store pipes
Crossbar-Based SMP Sun Enterprise 10000
ASCI Program • Accelerated Strategic Computing Initiative • Big impulse to the HPC industry • Architecture: clusters of RISC-based SMP nodes • Goals (1995 – 2004) • 1 Teraflops: Intel/Sandia ASCI Red • 3 Teraflops: SGI/LLNL ASCI Blue • 10 Teraflops: IBM/LLNL ASCI White • 30 Teraflops: ? • 100 Teraflops: ?
Intel/Sandia ASCI Red 160 m2 200-MHz Pentium Pro Nodes: service, compute, I/O, and system Six-link router chip (dimensional, wormhole routing) Link BW: 400MB/sec (full duplex)
Customer Govern’t 2% 3% 5% 17% 49% 24%