1 / 55

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future. Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_s07. Outline. Historical perspective (1985 to 2005) from Horst Simon Recent past: what’s new in 2007 Major challenges and opportunities for the future.

amalia
Télécharger la présentation

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 267 Applications of Parallel ComputersSupercomputing:The Past and Future Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_s07

  2. Outline • Historical perspective (1985 to 2005) from Horst Simon • Recent past: what’s new in 2007 • Major challenges and opportunities for the future Slide source: Horst Simon

  3. Signpost System 1985 Cray-2 • 244 MHz (4.1 nsec) • 4 processors • 1.95 Gflop/s peak • 2 GB memory (256 MW) • 1.2 Gflop/s LINPACK R_max • 1.6 m2 floor space • 0.2 MW power Slide source: Horst Simon

  4. Signpost System in 2005 IBM BG/L @ LLNL • 700 MHz (x 2.86) • 65,536 nodes (x 16,384) • 180 (360) Tflop/s peak (x 92,307) • 32 TB memory (x 16,000) • 135 Tflop/s LINPACK (x 110,000) • 250 m2 floor space (x 156) • 1.8 MW power (x 9) Slide source: Horst Simon

  5. 1985 versus 2005 • commodity massively parallel platforms • 1 Tflops sustained is good performance • Fortan/C with MPI, object orientation • Unix, Linux • interactive use • visualization • parallel debugger, development tools • high performance desktop • remote access via 10 Gb/s; grid tools • large group developed software, code share and reuse • parallel algorithms • custom built vector mainframe platforms • 30 Mflops sustained is good performance • vector Fortran • proprietary operating system • remote batch only • no visualization • no tools, hand tuning only • dumb terminals • remote access via 9600 baud • single software developer, develops and codes everything • serial, vectorized algorithms Slide source: Horst Simon

  6. The Top 10 Major Accomplishments in Supercomputing in the Past 20 Years • Horst’s Simon’s list from 2005 • Selected by “impact” and “change in perspective” • 10) The TOP500 list • 9) NAS Parallel Benchmarks • 8) The “grid” • 7) Hierarchical algorithms: multigrid and fast multipole • 6) HPCC initiative and Grand Challenge application • 5) Attack of the killer micros Slide source: Horst Simon

  7. #10) TOP500 - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from Linpack Ax=b, dense problem - Updated twice a year: ISC‘xy in Germany, June xy SC‘xy in USA, November xy - All data available from www.top500.org - Good and bad effects of this list/competition TPP performance Rate Size

  8. TOP500 list - Data shown • Manufacturer Manufacturer or vendor • Computer Type indicated by manufacturer or vendor • Installation Site Customer • Location Location and country • Year Year of installation/last major update • Customer Segment Academic,Research,Industry,Vendor,Class. • # Processors Number of processors • Rmax Maxmimal LINPACK performance achieved • Rpeak Theoretical peak performance • Nmax Problemsize for achieving Rmax • N1/2 Problemsize for achieving half of Rmax • Nworld Position within the TOP500 ranking

  9. 6-8 years Petaflop with ~1M Cores By 2008 Common by 2015? 1Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 10 MFlop/s 1 PFlop system in 2008 Data from top500.org Slide source: Horst Simon

  10. Petaflop with ~1M Cores in your PC by 2025?

  11. #4 Beowulf Clusters • Thomas Sterling et al. established vision of low cost, high end computing • Demonstrated effectiveness of PC clusters for some (not all) classes of applications • Provided software and conveyed findings to broad community (great PR) through tutorials and book (1999) • Made parallel computing accessible to large community worldwide; broadened and democratized HPC; increased demand for HPC • However, effectively stopped HPC architecture innovation for at least a decade; narrower market for commodity systems Slide source: Horst Simon

  12. #3 Scientific Visualization • NSF Report, “Visualization in Scientific Computing” established the field in 1987 (edited by B.H. McCormick, T.A. DeFanti, and M.D. Brown) • Change in point of view: transformed computer graphics from a technology driven subfield of computer science into a medium for communication • Added artistic element • The role of visualization is “to reveal concepts that are otherwise invisible” (Krystof Lenk) Slide source: Horst Simon

  13. Before Scientific Visualization (1985) • Computer graphics typical of the time: • 2 dimensional • line drawings • black and white • “vectors” used to display vector field • Images from a CFD report at Boeing (1985). Slide source: Horst Simon

  14. After scientific visualization (1992) • The impact of scientific visualization seven years later: • 3 dimensional • use of “ribbons” and “tracers” to visualize flow field • color used to characterize updraft and downdraft • Images from “Supercomputing and the Transformation of Science” by • Kauffmanand Smarr, 1992; visualization by NCSA; simulation by Bob • Wilhelmson, NCSA Slide source: Horst Simon

  15. #2 Message Passing Interface (MPI) MPI Slide source: Horst Simon

  16. Parallel Programming 1988 • At the 1988 “Salishan” conference there was a bake-off of parallel programming languages trying to solve five scientific problems • The “Salishan Problems” (ed. John Feo, published 1992) investigated four programming languages • Sisal, Haskel, Unity, LGDF • Significant research activity at the time • The early work on parallel languages is all but forgotten today Slide source: Horst Simon

  17. The availability of real parallel machines moved the discussion from the domain of theoretical CS to the pragmatical application area In this presentation (ca. 1990) Jack Dongarra lists six approaches to parallel processing Note that message passing libraries are a sub-item on 2) Parallel Programming 1990 Slide source: Horst Simon

  18. Parallel Programming 1994

  19. #1 Scaled Speed-Up

  20. The argument against massive parallelism (ca. 1988) Amdahl’s Law: speed = base_speed / ( (1-f) + f/nprocs ) infinitely parallel Cray YMP base_speed = .1 2.4 nprocs = infinity 8 Then speed(Infinitely Parallel) > speed(Cray) only if f > .994 Slide source: Horst Simon

  21. Challenges for the Future • Petascale computing • Multicore and the memory wall • Performance understanding at scale • Topology-sensitive interconnects • Programming models for the masses

  22. Application Status in 2005 Parallel job size at NERSC • A few Teraflop/s sustained performance • Scaled to 512 - 1024 processors

  23. How to Waste Machine $ 2) Use a programming model in which you can’t utilize bandwidth or “low” latency

  24. Integrated Performance Monitoring (IPM) • brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application • maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage • open source, relies on portable software technologies and is scalable to thousands of tasks • developed by David Skinner at NERSC (see http://www.nersc.gov/projects/ipm/ )

  25. Scaling Portability: Profoundly Interesting A high level description of the performance of cosmology code MADCAP on four well known architectures. Source: David Skinner, NERSC

  26. 16 Way for 4 seconds (About 20 timestamps per second per task) *( 1…4 contextual variables)

  27. 64 way for 12 seconds

  28. Applications on Petascale Systems will need to deal with (Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each) Three major issues: • Scaling to 100,000 processors and multi-core processors • Topology sensitive interconnection network • Memory Wall

  29. Even today’s machines are interconnect topology sensitive Four (16 processor) IBM Power 3 nodes with Colony switch

  30. Application Topology 1024 way MILC 336 way FVCAM 1024 way MADCAP If the interconnect is topology sensitive, mapping will become an issue (again) “Characterizing Ultra-Scale Applications Communincations Requirements”, by John Shalf et al., submitted to SC05

  31. Interconnect Topology BG/L

  32. Applications on Petascale Systems will need to deal with (Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each) Three major issues: • Scaling to 100,000 processors and multi-core processors • Topology sensitive interconnection network • Memory Wall

  33. The Memory Wall Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

  34. Characterizing Memory Access Memory Access Patterns/Locality Source: David Koester, MITRE

  35. Apex-MAP characterizes architectures through a synthetic benchmark

  36. Apex-Map Sequential

  37. Apex-Map Sequential

  38. Apex-Map Sequential

  39. Apex-Map Sequential

  40. Multicore: Is MPI Really that Bad? • Experiments by the NERSC SDSA group (Shalf, Carter, Wasserman, et al) • Single Core vs. Dual-core AMD Opteron. • Data collected on jaguar (ORNL XT3 system) • Small pages used except for MADCAP • Moving from single to dual core nearly doubles performance • Worst case is MILC, which is 40% below this doubling

  41. How to Waste Machine $ • Build a memory system in which you can’t utilize bandwidth that is there

  42. Challenge 2010 - 2018: Developing a New Ecosystem for HPC From the NRC Report on “The Future of Supercomputing”: • Platforms, software, institutions, applications, and people who solve supercomputing applications can be thought of collectively as an ecosystem • Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs. Pond ecosystem image from http://www.tpwd.state.tx.us/expltx/eft/txwild/pond.htm

  43. Exaflop Programming? • Start with two Exaflop apps • One easy: if anything scale, this will • One hard: plenty of parallelism, but it’s irregular, adaptive, asynchronous • Rethink algorithms • Scalability at all levels (including algorithmic) • Reducing bandwidth (compress data structures); reducing latency requirements • Design programming model to express this parallelism • Develop technology to automate as much as possible (parallelism, HL constructs, search-based optimization) • Consider spectrum of hardware possibilities • Analyze at various levels of detail (eliminating when they are clearly infeasible) • Early prototypes (expect 90% failures) to validate

  44. Technical Challenges in Programming Models • Open problems in language runtimes • Virtualization: away from SPMD model for load balance, fault tolerance, OS noise, etc. • Resource management: thread scheduler • What we do know how to do: • Build systems with dynamic load balancing (Cilk) that do not respect locality • Build systems with rigid locality control (MPI, UPC, etc.) that run at the speed of the slowest component • Put the programmer in control of resources: message buffers, dynamic load balancing

  45. Challenge 2015 - 2025: Winning the Endgame of Moore’s Law (Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each) Three major issues: • Scaling to 100,000 processors and multi-core processors • Topology sensitive interconnection network • Memory Wall

  46. Summary • Applications will face (at least) three challenges • Scaling to 100,000s of processors • Interconnect topology • Memory access • Three sets of tools (applications benchmarks, performance monitoring, quantitative architecture characterization) have been shown to provide critical insight into applications performance

  47. (4M) (16M) (64M) (256M) (1G) (4G) (16G) Vanishing Electrons (2016) Electrons per device 104 103 (Transistors per chip) 102 101 100 10-1 1985 1990 1995 2000 2005 2010 2015 2020 Source: Joel Birnbaum, HP, Lecture at APS Centennial, Atlanta, 1999 Year

  48. ITRS Device Review 2016 Data from ITRS ERD Section, quoted from Erik DeBenedictis, Sandia Lab.

More Related