1 / 60

HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation

Explore Dr. Thomas Sterling's HTMT architecture for achieving petaflops-scale computation using parallel and latency-tolerant design. This architecture is applicable to various fields, including rational drug design, tomographic reconstruction, neural networks, and more.

driver
Télécharger la présentation

HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation Dr. Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 1, 1999

  2. Dr. Thomas Sterling - HTMT Petaflops Architecture

  3. Dr. Thomas Sterling - HTMT Petaflops Architecture

  4. Rational Drug Design Nanotechnology Tomographic Reconstruction Phylogenetic Trees Biomolecular Dynamics Neural Networks Crystallography Fracture Mechanics MRI Imaging Reservoir Modelling Molecular Modelling Biosphere/Geosphere Diffraction Inversion Problems Distribution Networks Chemical Dynamics Atomic Scattering Electrical Grids Flow in Porous Media Pipeline Flows Data Assimilation Signal Processing Condensed Matter Electronic Structure Plasma Processing Chemical Reactors Cloud Physics Electronic Structure Boilers Combustion Actinide Chemistry Radiation CVD Graph Theoretic Fourier Methods Quantum Chemistry Reaction-Diffusion Chemical Reactors Cosmology Transport n-body Astrophysics Multiphase Flow Manufacturing Systems CFD Basic Algorithms & Numerical Methods Discrete Events Weather and Climate PDE Air Traffic Control Military Logistics Structural Mechanics Seismic Processing Population Genetics Monte Carlo ODE Multibody Dynamics Geophysical Fluids VLSI Design Transportation Systems Aerodynamics Raster Graphics Economics Fields Orbital Mechanics Nuclear Structure Ecosystems QCD Pattern Matching Symbolic Processing Neutron Transport Economics Models Genome Processing Virtual Reality Cryptography Astrophysics Electromagnetics Computer Vision Virtual Prototypes Intelligent Search Multimedia Collaboration Tools Computer Algebra Databases Magnet Design Computational Steering Scientific Visualization Data Minning Automated Deduction Number Theory CAD Dr. Thomas Sterling - HTMT Petaflops Architecture Intelligent Agents

  5. Dr. Thomas Sterling - HTMT Petaflops Architecture

  6. Dr. Thomas Sterling - HTMT Petaflops Architecture

  7. A 10 Gflops Beowulf Center for Advance Computing Research 172 Intel Pentium Pro microprocessors California Institute of Technology Dr. Thomas Sterling - HTMT Petaflops Architecture

  8. Emergence of Beowulf Clusters Dr. Thomas Sterling - HTMT Petaflops Architecture

  9. 1st printing: May, 1999 2nd printing: Aug. 1999 MIT Press Dr. Thomas Sterling - HTMT Petaflops Architecture

  10. Dr. Thomas Sterling - HTMT Petaflops Architecture

  11. Beowulf Scalability Dr. Thomas Sterling - HTMT Petaflops Architecture

  12. 2nd LEVEL CACHE 96 MBYTES 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz INTEGRATED SMP - WDM DRAM - 4 GBYTES - HIGHLY INTERLEAVED MULTI-LAMBDA AON CROSS BAR coherence 640 GBYTES/SEC 2nd LEVEL CACHE 96 MBYTES 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz ...

  13. COTS PetaFlop System 128 die/box 4 CPU/die 3 4 ... 5 2 16 1 17 64 ALL-OPTICAL SWITCH 18 63 ... ... 32 49 48 Multi-Die Multi-Processor ... 33 47 46 I/O 10 meters= 50 NS Delay Dr. Thomas Sterling - HTMT Petaflops Architecture

  14. COTS PetaFlops System • 8192 Dies (4 CPU/die-minimum) • Each Die is 120 GFlops • 1 PetaFlop Peak • Power 8192 x200 Watts = 1.6 MegaWatts • Extra Main Memory >3 MegaWatts (512 TBytes) • 15.36 TFlops/Rack (128 die) • 30 KWatts/Rack - thus 64 racks - 30 inch • Common System I/O • 2 Level Main Memory • Optical Interconnect • OC768 Channels (40 GHz) • 128 Channels per Die (DWDM)-5.12 THz • ALL Optical Switching • Bisection Bandwidth of 50 TBytes/sec • 15 TFlops/rack*.1bytes/flop/sec*32 racks • Rack Bandwidth - 15 TFlops*.1= 12 THz Dr. Thomas Sterling - HTMT Petaflops Architecture

  15. The SIA CMOS Roadmap Dr. Thomas Sterling - HTMT Petaflops Architecture

  16. Requirements for High End Systems • Bulk capabilities • performance • storage capacities • throughput/bandwidth • cost, power, complexity • Efficiency • overhead • latency • contention • starvation/parallelism • Usability • generality • programmability • reliability Dr. Thomas Sterling - HTMT Petaflops Architecture

  17. Points of Inflection in the History of Computing • Heroic Era (1950) • technology: vacuum tubes, mercury delay lines, pulse transformers • architecture: accumulator based • model: von-Neumann, sequential instruction execution • examples: Whirlwind, EDSAC • Mainframe (1960) • technology: transistors, core memory, disk drives • architecture: register bank based • model: virtual memory • examples: IBM 7090, PDP-1 Dr. Thomas Sterling - HTMT Petaflops Architecture

  18. Points of Inflection in the History of Computing • Supercomputers (1980) • technology: ECL, semiconductor integration, RAM • architecture: pipelined • model: vector • example: Cray-1 • Massively Parallel Processing (1990) • technology: VLSI, microprocessor, • architecture: MIMD • model: Communicating Sequential Processes, Message passing • examples: TMC CM-5, Intel Paragon • ? (2000) Dr. Thomas Sterling - HTMT Petaflops Architecture

  19. Dr. Thomas Sterling - HTMT Petaflops Architecture

  20. HTMT Objectives • Scalable architecture with high sustained performance in the presence of disparate cycle times and latencies • Exploit diverse device technologies to achieve substantially superior operating point • Execution model to simplify parallel system programming and expand generality and applicability Dr. Thomas Sterling - HTMT Petaflops Architecture

  21. DRAM PIM 3D Mem I/O FARM • Compress/Decompress • ECC/Redundancy • Compress/Decompress • Spectral Transforms OPTICAL SWITCH SRAM PIM • Compress/Decompress • Routing • Data Structure Initializations • “In the Memory” Operations • RSFQ Thread Management • Context Percolation • Scatter/Gather Indexing • Pointer chasing • Push/Pull Closures • Synchronization Activities RSFQ Nodes Hybrid Technology MultiThreaded Architecture Dr. Thomas Sterling - HTMT Petaflops Architecture

  22. Dr. Thomas Sterling - HTMT Petaflops Architecture

  23. Storage Capacity by Subsystem 2007 Design Point Dr. Thomas Sterling - HTMT Petaflops Architecture

  24. Dr. Thomas Sterling - HTMT Petaflops Architecture

  25. HTMT Strategy • High performance • Superconductor RSFQ logic • Data Vortex optical interconnect network • PIM smart memory • Low power • Superconductor RSFQ logic • Optical holographic storage • PIM smart memory Dr. Thomas Sterling - HTMT Petaflops Architecture

  26. HTMT Strategy (cont) • Low cost • reduce wire count through chip-to-chip fiber • reduce processor count through x100 clock speed • reduce memory chips by 3-2 holographic memory layer • Efficiency • processor level multithreading • smart memory managed second stage context pushing multithreading • fine grain regular & irregular data parallelism exploited in memory • high memory bandwidth and low latency ops through PIM • memory to memory interactions without processor intervention • hardware mechanisms for synchronization, scheduling, data/context migration, gather/scatter Dr. Thomas Sterling - HTMT Petaflops Architecture

  27. HTMT Strategy (cont) • Programmability • Global shared name space • hierarchical parallel thread flow control model • no explicit processor naming • automatic latency management • automatic processor load balancing • runtime fine grain multithreading • automatic context pushing for process migration (percolation) • configuration transparent, runtime scalable Dr. Thomas Sterling - HTMT Petaflops Architecture

  28. RSFQ Roadmap(VLSI Circuit Clock Frequency) Dr. Thomas Sterling - HTMT Petaflops Architecture

  29. JJ1 JJ2 RSFQ Building Block L1 Dr. Thomas Sterling - HTMT Petaflops Architecture

  30. Dr. Thomas Sterling - HTMT Petaflops Architecture

  31. Advantages • X100 clock speeds achievable • X100 power efficiency advantage • Easier fabrication • Leverage semiconductor fabrication tools • First technology to encounter ultra-high speed operation Dr. Thomas Sterling - HTMT Petaflops Architecture

  32. SuperconductorProcessor • 100 GHz clock, 33 GHz inter-chip • 0.8 micron Niobium on Silicon • 100K gates per chip • 0.05 watts per processor • 100Kwatts per Petaflops Dr. Thomas Sterling - HTMT Petaflops Architecture

  33. Dr. Thomas Sterling - HTMT Petaflops Architecture

  34. Dr. Thomas Sterling - HTMT Petaflops Architecture

  35. Dr. Thomas Sterling - HTMT Petaflops Architecture

  36. Data Vortex Optical Interconnect Dr. Thomas Sterling - HTMT Petaflops Architecture

  37. Dr. Thomas Sterling - HTMT Petaflops Architecture

  38. DATA VORTEX LATENCY DISTRIBUTION network height = 1024 Dr. Thomas Sterling - HTMT Petaflops Architecture

  39. Single-mode rib waveguides on silicon-on-insulator wafers‡ Hybrid sources and detectors Mix of CMOS-like and ‘micromachining’-type processes for fabrication ‡ e.g: R A Soref, J Schmidtchen & K Petermann, IEEE J. Quantum Electron. 27 p1971 (1991) A Rickman, G T Reed, B L Weiss & F Navamar, IEEE Photonics Technol. Lett. 4 p.633 (1992) B Jalali, P D Trinh, S Yegnanarayanan & F Coppinger IEE Proc. Optoelectron. 143 p.307 (1996) Dr. Thomas Sterling - HTMT Petaflops Architecture

  40. Sense Amps Sense Amps Memory Stack Memory Stack Decode Basic Silicon Macro Sense Amps Sense Amps Node Logic Sense Amps Sense Amps Memory Stack Memory Stack Sense Amps Sense Amps Single Chip PIM Provides Smart Memory • Merge logic and memory • Integrate multiple logic/mem stacks on single chip • Exposes high intrinsic memory bandwidth • Reduction of memory access latency • Low overhead for memory oriented operations • Manages data structure manipulation, context coordination and percolation Dr. Thomas Sterling - HTMT Petaflops Architecture

  41. Multithreaded Control of PIM Functions multiple operation sequences with low context switching overhead maximize memory utilization and efficiency maximize processor and I/O utilization Boolean ALU Memory Stack Row Registers GP - ALU Context Registers Row Buffers Node Logic Hi Speed Links (Firewire) Memory Bus I/F (PCI) FP FP Multithreaded PIM DRAM • multiple banks of row buffers to hold data, instructions, and addr • data parallel basic operations at row buffer • manages shared resources such as FP Direct PIM to PIM Interaction • memory communicates with memory within and across chip boundaries without external control processor intervention by “parcels” • exposes fine grain parallelism intrinsic to vector and irregular data structures • e.g. pointer chasing, block moves, synchronization, data balancing Dr. Thomas Sterling - HTMT Petaflops Architecture

  42. 32MB 32MB FtPt ASAP FtPt ASAP 32MB 32MB FtPt ASAP FtPt ASAP Silicon Budget for HTMT DRAM PIM • Designed to provide proper balance of memory & support for fiber bandwidth • Different Vortex configurations => different #s • In 2004, 16 TB = 4096 groups of 64 chips • Each Chip: Fiber WDM Optical Receiver Interface HRAM & Vortex Output SuperScalar Core Memory Logic By Area Dr. Thomas Sterling - HTMT Petaflops Architecture

  43. Holographic 3/2 Memory Performance Scaling Advantages • petabyte memory • competitive cost • 10 sec access time • low power • efficient interface to DRAM Disadvantages • recording rate is slower than the readout rate for LiNbO3 • recording must be done in GB chunks • long term trend favors DRAM unless new materials and lasers are used Dr. Thomas Sterling - HTMT Petaflops Architecture

  44. 0.3 m 1.4 m 4oK 50 W 77oK SIDE VIEW 1 m Fiber/Wire Interconnects 1 m 3 m Dr. Thomas Sterling - HTMT Petaflops Architecture 0.5 m

  45. SIDE VIEW Nitrogen Helium Tape Silo Array (400 Silos) Hard Disk Array (40 cabinets) 4oK 50 W 77oK Fiber/Wire Interconnects Front End Computer Server 3 m 3 m Console Cable Tray Assembly 0.5 m 220Volts 220Volts WDM Source Generator Generator 980 nm Pumps (20 cabinets) Optical Amplifiers Dr. Thomas Sterling - HTMT Petaflops Architecture

  46. 15 m 27 m Cryogenics Refrigeration Room 27 m 25 m HTMT Facility (Top View) Dr. Thomas Sterling - HTMT Petaflops Architecture

  47. Floor Area Dr. Thomas Sterling - HTMT Petaflops Architecture

  48. Power Dissipation by Subsystem Petaflops Design Point Dr. Thomas Sterling - HTMT Petaflops Architecture

  49. Subsystem Interfaces 2007 Design Point • Same colors indicate a connection between subsystems • Horizontal lines group interfaces within a subsystem Dr. Thomas Sterling - HTMT Petaflops Architecture

  50. Dr. Thomas Sterling - HTMT Petaflops Architecture

More Related