1 / 100

Parallel Computing Systems Part I: Introduction

Parallel Computing Systems Part I: Introduction. Dror Feitelson Hebrew University. Topics. Overview of the field Architectures: vectors, MPPs, SMPs, and clusters Networks and routing Scheduling parallel jobs Grid computing Evaluating performance. Today (and next week?).

waneta
Télécharger la présentation

Parallel Computing Systems Part I: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing SystemsPart I: Introduction Dror Feitelson Hebrew University

  2. Topics • Overview of the field • Architectures: vectors, MPPs, SMPs, and clusters • Networks and routing • Scheduling parallel jobs • Grid computing • Evaluating performance

  3. Today (and next week?) • What is parallel computing • Some history • The Top500 list • The fastest machines in the world • Trends and predictions

  4. What is a Parallel System? In particular, what is the difference between parallel and distributed computing?

  5. What is a Parallel System? Chandy: it is related to concurrency. • In distributed computing, concurrency is part of the problem. • In parallel computing, concurrency is part of the solution.

  6. Distributed Systems • Concurrency because of physical distribution • Desktops of different users • Servers across the Internet • Branches of a firm • Central bank computer and ATMs • Need to coordinate among autonomous systems • Need to tolerate failures and disconnections

  7. Parallel Systems • High-performance computing: solve problems that are to big for a single machine • Get the solution faster (weather forecast) • Get a better solution (physical simulation) • Need to parallelize algorithm • Need to control overhead • Can assume friendly system?

  8. The Convergence Use distributed resources for parallel processing • Networks of workstations – use available desktop machines within organization • Grids – use available resources (servers?) across organizations • Internet computing – use personal PCs across the globe (SETI@home)

  9. Some History

  10. Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP

  11. Illiac IV • 1974 • SIMD: all processors do the same • Numerical calculations at NASA • Now in Boston computer museum

  12. The Illiac IV in Numbers • 64 processors arranged as 8  8 grid • Each processor has 104 ECL transistors • Each processor has 2K 64-bit words (total is 8 Mbit) • Arranged in 210 boards • Packed in 16 cabinets • 500 Mflops peak performance • Cost: $31 million

  13. Sustained vs. Peak • Peak performance: product of clock rate and number of functional units • Sustained rate: what you actually achieve on a real application • Sustained is typically much lower than peak • Application does not require all functional units • Need to wait for data to arrive from memory • Need to synchronize • Best for dense matrix operations (Linpack) A rate that the vendor guarantees will not be exceeded

  14. Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP • Vector systems by Cray and Japanese firms • 1976: Cray 1 rated at 160 Mflops peak • 1982: Cray X-MP, later Y-MP, C90, … • 1985: Cray 2, NEC SX-2

  15. Cray’s Achievements • Architectural innovations • Vector operations on vector registers • All memory is equally close: no cache • Trade off accuracy and speed • Packaging • Short and equally long wires • Liquid cooling systems • Style

  16. Vector Supercomputers • Vector registers store vectors of fast access • Vector instructions operate on whole vectors of values • Overhead of instruction decode only once per vector • Pipelined execution of instruction on vector elements: one result per clock tick (at least after pipeline is full) • Possible to chain vector operations: start feeding second functional unit before finishing first one

  17. Cray 1 • 1975 • 80 MHz clock • 160 Mflops peak • Liquid cooling • World’s most expensive love seat • Power supply and cooling under the seat • Available in red, blue, black… • No operating system

  18. Cray 1 Wiring • Round configuration for small and uniform distances • Longest wire: 4 feet • Wires connected manually by extra-small engineers

  19. Cray X-MP • 1982 • 1 Gflop • Multiprocessor with 2 or 4 Cray1-like processors • Shard memory

  20. Cray X-MP

  21. Cray 2 • 1985 • Smaller and more compact than Cray 1 • 4 (or 8) processors • Total immersion liquid cooling

  22. Cray Y-MP • 1988 • 8 proc’s • Achieved 1 Gflop

  23. Cray Y-MP – Opened

  24. Cray Y-MP – From Back Power supply and cooling

  25. Cray C90 • 1992 • 1 Gflop per processor • 8 or more processors

  26. The MPP Boom • 1985: Thinking Machines introduces the Connection Machine CM-1 • 16K single-bit processors, SIMD • Followed by CM-2, CM-200 • Similar machines by MasPar • mid ’80s: hypercubes become successful • Also: Transputers used as building blocks • Early ’90s: big companies join • IBM, Cray

  27. SIMD Array Processors • ’80 favorites • Connection machine • Maspar • Very many single-bit processors with attached memory – proprietary hardware • Single control unit: everything is totally synchronized (SIMD = single instruction multiple data) • Massive parallelism even with “correct counting” (i.e. divide by 32)

  28. Connection Machine CM-2 • Cube of 64K proc’s • Acts as backend • Hyper-cube topology • Data vault for parallel I/O

  29. Hypercubes • Early ’80s: Caltech 64-node Cosmic Cube • Mid to late ’80s: Commercialized by several companies • Intel iPSC, iPSC/2, iPSC/860 • nCUBE, nCUBE 2 (later turned into a VoD server…) • Early ’90s: replaced by mesh/torus • Intel Paragon – i860 processors • Cray T3D, T3E – Alpha processors

  30. Transputers • A microprocessor with built-in support for communication • Programmed using Occam • Used in Meiko and other systems PAR SEQ x := 13; c ! x; SEQ c ? y; z := y; -- z is 13 Synchronous communication: an assignment across processes

  31. Attack of the Killer Micros • Commodity microprocessors advance at a faster rate than vector processors • Takeover point was around year 2000 • Even before that, using many together could provide lots of power • 1992: TMC uses SPARC in CM-5 • 1992: Intel uses i860 in Paragon • 1993: IBM SP uses RS/6000, later PowerPC • 1993: Cray uses Alpha in T3D • Berkeley NoW project

  32. Connection Machine CM-5 • 1992 • SPARC-based • Fat-tree network • Dominant in early ’90s • Featured in Jurassic Park • Support for gang scheduling!

  33. Intel Paragon • 1992 • 2 i860 proc’s per node: • Compute • Commun. • Mesh interconnect with spiffi display

  34. Cray T3D/T3E • 1993 – Cray T3D • Uses commodity microprocessors (DEC Alpha) • 3D Torus interconnect • 1995 – Cray T3E

  35. 1993 • 16 RS/6000 processors per rack • Each runs AIX (full Unix) • Multistage network • Flexible configurations • First large IUCC machine IBM SP

  36. Berkeley NoW • The building is the computer • Just need some glue software…

  37. Not Everybody is Convinced… • Japan’s computer industry continues to build vector machines • NEC • SX series of supercomputers • Hitachi • SR series of supercomputers • Fujitsu • VPP series of supercomputers • Albeit with less style

  38. Fujitsu VPP700

  39. NEC SX-4

  40. More Recent History • 1994 – 1995 slump • Cold war is over • Thinking machines files for chapter 11 • KSR Research files for chapter 11 • Late ’90s much better • IBM, Cray retain parallel machine market • Later also SGI, Sun, especially with SMPs • ASCI program is started • 21st century: clusters take over • Based on SMPs

  41. SMPs • Machines with several CPUs • Initially small scale: 8-16 processors • Later achieved large scale of 64-128 processors • Global shared memory accessed via a bus • Hard to scale further due to shared memory and cache coherence

  42. SGI Challenge • 1 to 16 processors • Bus interconnect • Dominated low end of Top500 list in mid ’90s • Not only graphics…

  43. SGI Origin • MIPS processors • Remote memory access An Origin 2000 installed at IUCC

  44. Architectural Convergence • Shared memory used to be uniform (UMA) • Based on bus or crossbar • Conventional load/store operations • Distributed memory used message passing • Newer machines support remote memory access • Nonuniform (NUMA): access to remote memory costs more • Put/get operations (but handled by NIC) • Cray T3D/T3E, SGI Origin 2000/3000

  45. The ASCI Program • 1996: nuclear test ban leads to need for simulation of nuclear explosions • Accelerated Strategic Computing Initiative: Moore’s law not fast enough… • Budget of a billion dollars

  46. The Vision ASCI requirements Technology transfer Performance PathForward Market-driven progress Time

  47. ASCI Milestones • 1996 – ASCI Red: 1 TF Intel • 1998 – ASCI Blue Mountain: 3 TF • 1998 – ASCI Blue Pacific: 3 TF • 2001 – ASCI White: 10 TF • 2003 – ASCI Purple: 30 TF? so far two thirds delivered

  48. The ASCI Red Machine • 9260 processors – PentiumPro 200 • Arranged as 4-way SMPs in 86 cabinets • 573 GB memory total • 2.25 TB disk space total • 2 miles of cables • 850 KW peak power consumption • 44 tons (+300 tons air conditioning equipment) • Cost: $55 million

  49. Clusters vs. MPPs • Mix and match approach • PCs/SMPs/blades used as processing nodes • Fast switched network for interconnect • Linux on each node • MPI for software development • Something for management • Lower cost to set up • Non-trivial to operate effectively

  50. SMP Nodes • PCs, workstations, or servers with several CPUs • Small scale (4-8) used as nodes in MPPs or clusters • Access to shared memory via shared L2 cache • SMP support (cache coherence) built into modern microprocessors

More Related