1 / 83

Trends in Supercomputing for the Next Five Years

Trends in Supercomputing for the Next Five Years. Katherine Yelick http://www.cs.berkeley.edu/~yelick/cs267 Based in part on lectures by Horst Simon and David Bailey. Five Computing Trends. Continued rapid processor performance growth following Moore’s law

jabir
Télécharger la présentation

Trends in Supercomputing for the Next Five Years

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trends in Supercomputing for the Next Five Years Katherine Yelick http://www.cs.berkeley.edu/~yelick/cs267 Based in part on lectures by Horst Simon and David Bailey

  2. Five Computing Trends • Continued rapid processor performance growth following Moore’s law • Open software model (Linux) will become standard • Network bandwidth will grow at an even faster rate than Moore’s Law • Aggregation, centralization, colocation • Commodity products everywhere

  3. Overview • High end machine in general • Processors • Interconnect • Systems software • Programming models • Look at the 3 Japanese HPCs • Examine the Top131

  4. Aggregate Systems Performance Increasing Parallelism Single CPU Performance CPU Frequencies History of High Performance Computers 1P 1000000.0000 100T 100000.0000 Earth Simulator ASCI Q ASCI White 10T 10000.0000 SX-6 VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue 1T ASCI Blue Mountain ASCI Red VPP700 1000.0000 SX-4 T3E SR8000 Paragon VPP500 SR2201/2K NWT/166 T3D 100G 100.0000 CM-5 FLOPS SX-3R T90 S-3800 SX-3 SR2201 10G C90 10.0000 VP2600/1 CRAY-2 Y-MP8 0 SX-2 S-820/80 10GHz 1G 1.0000 VP-400 S-810/20 X-MP VP-200 1GHz 100M 0.1000 100MHz 10M 0.0100 10MHz 1M 0.0010 1980 1985 1990 1995 2000 2005 2010 Year

  5. Analysis of TOP500 Data • Annual performance growth about a factor of 1.82 • Two factors contribute almost equally to the annual total performance growth • Processor number grows per year on the average by a factor of 1.30 and the • Processor performance grows by 1.40 compared to 1.58 for Moore's Law • Efficiency relative to hardware peak is declining Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp 1517-1544.

  6. Performance Extrapolation A Laptop

  7. Analysis of TOP500 Extrapolation Based on the extrapolation from these fits we predict: • First 100~TFlop/s system by 2005 • About 1–2 years later than the ASCI path forward plans. • No system smaller than 1 TFlop/s should be able to make the TOP500 • First Petaflop system available around 2009 • Rapid changes in the technologies used in HPC systems, therefore a projection for the architecture/technology is difficult • Continue to expect rapid cycles of re-definition

  8. What About Efficiency? • Talking about Linpack • What should be the efficiency of a machine on the Top131 be? • Percent of peak for Linpack > 90% ? > 80% ? > 70% ? > 60% ? … • Remember this is O(n3) ops and O(n2) data • Mostly matrix multiply

  9. Efficiency is Declining Over time • Analysis of top 100 machines in 1994 and 2004 • Shows the # of machines in the top 100 that achieve a given efficiency on the Linpack benchmark • In 1994 40 machines had >90% efficiency • In 2004 50 have < 50% efficiency

  10. ES ASCI Q VT-Apple NCSA PNNL LANL Lighting LLNL MCR ASCI White NERSC LLNL (6.6) _ Rank Performance

  11. Architecture/Systems Continuum Loosely Coupled • Commodity processor with commodity interconnect • Clusters • Pentium, Itanium, Opteron, Alpha, PowerPC • GigE, Infiniband, Myrinet, Quadrics, SCI • NEC TX7 • HP Alpha • Bull NovaScale 5160 • Commodity processor with custom interconnect • SGI Altix • Intel Itanium 2 • Cray Red Storm • AMD Opteron • IBM Regatta • Custom processor with custom interconnect • Cray X1 • NEC SX-7 • IBM Blue Gene/L (commodity core) • IBM Power PC • Note: commodity here means not designed solely for HPC Tightly Coupled

  12. Cray X1 SGI Altix IBM Regatta Sun HP Bull Fujitsu PowerPower Hitachi SR11000 NEC SX-7 Apple Coming soon … Cray RedStorm Cray BlackWidow NEC SX-8 IBM Blue Gene/L Vibrant Field for High Performance Computers

  13. AMD Opteron 2 GHz, 4 Gflop/s peak HP Alpha EV68 1.25 GHz, 2.5 Gflop/s peak IBM PowerPC 2 GHz, 8 Gflop/s peak Intel Itanium 2 1.5 GHz, 6 Gflop/s peak Intel Pentium Xeon, Pentium EM64T 3.2 GHz, 6.4 Gflop/s peak MIPS R16000 700 GHz, 1.4 Gflop/s peak Sun UltraSPARC IV 1.2 GHz, 2.4 Gflop/s peak Off-the-Shelf Processors

  14. Itanium 2 Processor • Floating point bypass for level 1 cache • Bus is 128 bits wide and operates at 400 MHz, for 6.4 GB/s • 4 flops/cycle • 1.5 GHz Itanium 2 • Linpack Numbers: (theoretical peak 6 Gflop/s) • 100: 1.7Gflop/s • 1000: 5.4 Gflop/s

  15. Processor of choice for clusters 1 flop/cycle, 2 with SSE2 Intel Xeon 3.2 GHz 400/533 MHz bus, 64 bit wide(3.2/4.2 GB/s) Linpack Numbers: (peak 6.4 Gflop/s) 100x100: 1.7 Gflop/s 1000x1000: 3.1 Gflop/s Pentium 4 IA32 • Coming Soon: “Pentium 4 EM64T” • 64 bit • 800 MHz bus 64 bit wide • 3.6 GHz, 2MB L2 Cache • Peak 7.2 Gflop/s using SSE2

  16. Interconnects

  17. High Bandwidth vs Commodity Systems • High bandwidth systems have traditionally been vector computers • Designed for scientific problems • Capability computing • Commodity processors are designed for web servers and the home PC market (should be thankful that the manufactures keep the 64 bit fl pt) • Used for cluster based computers leveraging price point • Scientific computing needs are different • Require a better balance between data movement and floating point operations. Results in greater efficiency.

  18. Commodity Interconnects • Gig Ethernet • Myrinet • Infiniband • QsNet • SCI Clos Fat tree Torus

  19. Switch topology $/node $/node $/node Lt(us)/BW (MB/s) MB/s/$ NIC switch total (MB/s) MPI GigE Bus $ 50 $ 50 $ 100 30 / 100 1.0 SCI Torus $1,600 $ 0 $1,600 5 / 300 0.2 QsNetII Fat Tree $1,200 $1,700 $2,900 3 / 880 0.3 Myrinet Clos $ 700 $ 400 $1,100 6.5/ 240 0.2 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 0.6 Commodity Interconnects • Price performance drives the commodity market • Bandwidth more than latency

  20. Interconnects Used Largest node count min max average GigE 1024 17% 63% 37% SCI 120 64% 64% 64% QsNetII 2000 68% 78% 74% Myrinet 1250 36% 79% 59% Infiniband 4x 1100 58% 69% 64% Proprietary 9632 45% 98% 68% Efficiency for Linpack

  21. ES ASCI Q VT-Apple NCSA PNNL LANL Lighting LLNL MCR ASCI White NERSC LLNL 44 3 19 12 52 1

  22. Machines Built for HEC

  23. Cray X1: Parallel Vector Architecture • 12.8 Gflop/s Vector processors • 4 processor nodes sharing up to 64 GB of memory • Single System Image to 4096 Processors • 64 CPUs/800 GFLOPS in LC cabinet

  24. Visible to SW Visible to SW Transparent to SW Transparent to SW HW Resources Visible to Software Vector IRAM Pentium III • Software (applications/compiler/OS) can control • Main memory, registers, execution datapaths

  25. Special Purpose: GRAPE-6 • The 6th generation of GRAPE (Gravity Pipe) Project • Gravity (N-Body) calculation for many particles with 31 Gflops/chip • 32 chips / board - 0.99 Tflops/board • 64 boards of full system is installed in University of Tokyo- 63 Tflops • On each board, all particles data are set onto SRAM memory, and each target particle data is injected into the pipeline, then acceleration data is calculated • No software! • Gordon Bell Prize at SC for a number of years (Prof. Makino, U. Tokyo)

  26. Leveraging the Low End

  27. Characteristics of Blue Gene/L • Machine Peak Speed 180 Teraflop/s • Total Memory 16 Terabytes • Foot Print 2500 sq. ft. • Total Power 1.2 MW • Number of Nodes 65,536 • Power Dissipation/CPU 7 W • MPI Latency 5 microsec

  28. Building Blue Gene/L Image from LLNL

  29. Sony PlayStation2 • Emotion Engine: • 6 Gflop/s peak • Superscalar MIPS 300 MHz core + vector coprocessor + graphics/DRAM • About $200 • 529M sold • 8K D cache; 32 MB memory not expandable OS goes here as well • 32 bit fl pt; not IEEE • 2.4GB/s to memory (.38 B/Flop) • Potential 20 fl pt ops/cycle • FPU w/FMAC+FDIV • VPU1 w/4FMAC+FDIV • VPU2 w/4FMAC+FDIV • EFU w/FMAC+FDIV • See PS2 cluster project at UIUC • What about PS3?

  30. High-Performance Chips Embedded Applications • The driving market is gaming (PC and game consoles) • which is the main motivation for almost all the technology developments. • Demonstrate that arithmetic is quite cheap. • Not clear that they do much for scientific computing. • Today there are three big problems with these apparent non-standard "off-the-shelf" chips. • Most of these chips have very limited memory bandwidth and little if any support for inter-node communication. • Integer or only 32 bit fl.pt • No software support to map scientific applications to these processors. • Poor memory capacity for program storage • Developing "custom" software is much more expensive than developing custom hardware.

  31. Choosing the Right Option • Good hardware options are available • There is a large national investment in scientific software that is dedicated to current massively parallel hardware architectures • Scientific Discovery Through Advanced Computing (SciDAC) initiative in DOE • Accelerated Strategic Computing Iniative (ASCI) in DOE • Supercomputing Centers of the National Science Foundation (NCSA, NPACI, Pittsburgh) • Cluster computing in universities and labs There is a software cost for each hardware option but, The problem can be solved

  32. Phase I Phase II Phase III Cray HP IBM ?? IBM Sun ?? SGI Cray Sun ‘03 ‘06 ‘10 1 yr $3M/yr 3 yr $18M/yr 4 yr $50M/yr HPCS Program • Phase I: Concept Study • critical technology assessments • revolutionary HPCS concept solutions • new productivity metrics • requirements, scalable benchmark strategies and metrics • Phase II: Research & Development • Develop and evaluate groundbreaking technologies that can contribute to DARPA's productivity objectives • design reviews, risk reduction prototypes and demonstrations that contribute to a preliminary design • challenges & promising solutions identified during the concept study will be explored, developed, and simulated/ prototyped • Phase III: Full-Scale Development & Manufacturing • Pilot systems, Serial 001 in 2010

  33. Options for New Architectures Option Software Impact Cost Timeliness Risk Factors

  34. Software

  35. Is MPI the Right Programming Model? • Programming model has not changes in 10 years • What’s wrong with MPI? • Not bad for regular applications • Bulk-synchronous code with balanced load • What about fine-grained programs? • Pack/unpack • What about fine-grained asynchronous programs? • Pack/unpack, prepost, check-for-done • No explicit notion of distributed data structures

  36. Global Address Space Languages • Static parallelism (like MPI) in all 3 languages • Globally shared address space is partitioned • References (pointers) are either local or global (meaning possibly remote) • Distributed arrays and pointer-based structures x: 1 y: 2 x: 5 y: 6 x: 7 y: 8 Object heaps are shared Global address space l: l: l: g: g: g: Program stacks are private p0 p1 pn

  37. What’s Wrong with PGAS Languages? • Flat parallelism model • Machines are not flat: vectors,streams,SIMD, VLIW, FPGAs, PIMs, SMPs,nodes,… • No support for dynamic load balancing • Virtualize details of memory structure • No virtualization of processor space • No fault tolerance • SPMD model is not a good fit • Little understanding of scientific problems • CAF and Titanium have multiD arrays, numeric debugging • The base languages are not that great • Nevertheless, they are they the right next step

  38. If we spend all our time on these problems, we’ll always be in a niche To Virtualize or Not to Virtualize • Why to virtualize • Portability • Fault tolerance • Machine variability • Application level load imbalance • Why to not virtualize • Deep memory hierarchies • Expensive system overhead • Performance for problems that match the hardware

  39. Running HEC Centers

  40. NERSC’s Strategy Until 2010: Oakland Scientific Facility New Machine Room — 20,000 ft2, Option open to expand to 40,000 ft2. Includes ~50 offices and 6 megawatt electrical supply. It’s a deal: $1.40/ft2 when Oakland rents are >$2.50/ ft2 and rising!

  41. The Oakland Facility Machine Room

  42. Power and cooling are major costs of ownership of modern supercomputers Expandable to 6 Megawatts

  43. Metropolis Center at LANL – home of the 30 Tflop/s Q machine Los Alamos

  44. Strategic Computing Complex at LANL • 303,000 gross sq. ft. • 43,500 sq. ft. unobstructed computer room • Q consumes approximately half of this space • 1 Powerwall Theater (6X4 stereo = 24 screens) • 4 Collaboration rooms (3X2 stereo = 6 screens) • 2 secure, 2 open (1 of each initially) • 2 Immersive Rooms • Design Simulation Laboratories (200 classified, 100 unclassified) • 200 seat auditorium Los Alamos

  45. Earth Simulator Building

  46. For the Next Decade, The Most Powerful Supercomputers Will Increase in Size Power and cooling are also increasingly problematic, but there are limiting forces in those areas. • Increased power density and RF leakage power, will limit clock frequency and amount of logic [Shekhar Borkar, Intel] • So linear extrapolation of operating temperatures to Rocket Nozzle values by 2010 is likely to be wrong. Became This And will get bigger

  47. “I used to think computer architecture was about how to organize gates and chips – not about building computer rooms” Thomas Sterling, Salishan, 2001

  48. The End

  49. Processor Trends (summary) • The Earth Simulator is a singular event • It may become a turning point for supercomputing technology in the US • Return to vectors is unlikely, but more vigorous investment in alternate technology is likely • Independent of architecture choice we will stay on Moore’s Law curve

  50. Five Computing Trends for the Next Five Years • Continued rapid processor performance growth following Moore’s law • Open software model (Linux) will become standard • Network bandwidth will grow at an even faster rate than Moore’s Law • Aggregation, centralization, colocation • Commodity products everywhere

More Related