1 / 41

100 TF Sustained on Cray X Series

100 TF Sustained on Cray X Series. SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov. Disclaimer.

paulinei
Télécharger la présentation

100 TF Sustained on Cray X Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov

  2. Disclaimer • The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.

  3. Disclaimer (cont.) • Graph-free, chart-free environment • For graphs and chartshttp://www.csm.ornl.gov/evaluation/PHOENIX/

  4. 100 Real TF on Cray Xn • Who needs capability computing? • Application requirements • Why Xn? • Laundry, Clean and Otherwise • Rants • Custom vs. Commodity • MPI • CAF • Cray

  5. Who needs capability computing? • OMB? • Politicians? • Vendors? • Center directors? • Computer scientists?

  6. Who needs capability computing? • Application scientists • According to scientists themselves

  7. Personal Communications • Fusion • General Atomics, Iowa, ORNL, PPPL, Wisconsin • Climate • LANL, NCAR, ORNL, PNNL • Materials • Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin • Biology • NCI, ORNL, PNNL • Chemistry • Auburn, LANL, ORNL, PNNL • Astrophysics • Arizona, Chicago, NC State, ORNL, Tennessee

  8. Scientists Need Capability • Climate scientists need simulation fidelity to support policy decisions • All we can say now is that humans cause warming • Fusion scientists need to simulate fusion devices • All we can do now is model decoupled subprocesses at disparate time scales • Materials scientists need to design new materials • Just starting to reproduce known materials

  9. Scientists Need Capability • Biologists need to simulate proteins and protein pathways • Baby steps with smaller molecules • Chemists need similar increases in complexity • Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) • Low-res, 3D CFD, approximate 3D neutrinos, short times

  10. Why Scientists Might Resist • Capacity also needed • Software isn’t ready • Coerced to run capability-sized jobs on inappropriate systems

  11. Capability Requirements • Sample DOE SC applications • Climate: POP, CAM • Fusion: AORSA, Gyro • Materials: LSMS, DCA-QMC

  12. Parallel Ocean Program (POP) • Baroclinic • 3D, nearest neighbor, scalable • Memory-bandwidth limited • Barotropic • 2D implicit system, latency bound • Ocean-only simulation • Higher resolution • Faster time steps • As ocean component for CCSM • Atmosphere dominates

  13. Community Atmospheric Model (CAM) • Atmosphere component for CCSM • Higher resolution? • Physics changes, parameterization must be retuned, model must be revalidated • Major effort, rare event • Spectral transform not dominant • Dramatic increases in computation per grid point • Dynamic vegetation, carbon cycle, atmospheric chemistry, … • Faster time steps

  14. All-Orders Spectral Algorithm (AORSA) • Radio-frequency fusion-plasma simulation • Highly scalable • Dominated by ScaLAPACK • Still in weak-scaling regime • But… • Expanded physics reducing ScaLAPACK dominance • Developing sparse formulation

  15. Gyro • Continuum gyrokinetic simulation of fusion-plasma microturbulence • 1D data decomposition • Spectral method - high communication volume • Some need for increased resolution • More iterations

  16. Locally Self-Consistent Multiple Scattering (LSMS) • Calculates electronic structure of large systems • One atom per processor • Dominated by local DGEMM • First real application to sustain a TF • But… moving to sparse formulation with a distributed solve for each atom

  17. Dynamic Cluster Aproximation (DCA-QMC) • Simulates high-temp superconductors • Dominated by DGER (BLAS2) • Memory-bandwidth limited • Quantum Monte Carlo, but… • Fixed start-up per process • Favors fewer, faster processors • Needs powerful processors to avoid parallelizing each Monte-Carlo stream

  18. Few DOE SC Applications • Weak-ish scaling • Dense linear algebra • But moving to sparse

  19. Many DOE SC Applications • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication

  20. Why X1? • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication

  21. Tangent: Strongish* Scaling * Greg Lindahl, Vendor Scum • Firm • Semistrong • Unweak • Strongoidal • MSTW (More Strong Than Weak) • JTSoS (Just This Side of Strong) • WNS (Well-Nigh Strong) • Seak, Steak, Streak, Stroak, Stronk • Weag, Weng, Wong, Wrong, Twong

  22. X1 for 100 TF Sustained? • Uh, no • OS not scalable, fault-resilient enough for 104 processors • That “price/performance” thing • That “power & cooling” thing

  23. Xn for 100 TF Sustained • For DOE SC applications, YES • Most-promising candidate -or- • Least-implausible candidate

  24. Why X, again? • Most-powerful processors • Reduce need for scalability • Obey Amdahl’s Law • High memory bandwidth • See above • Globally addressable memory • Lowest, most hide-able latency • Scale latency-bound applications • High interconnect bandwidth • Scale bandwidth-bound applications

  25. The Bad News • Scalar performance • “Some tuning required” • Ho-hum MPI latency • See Rants

  26. Scalar Performance • Compilation is slow • Amdahl’s Law for single processes • Parallelization -> Vectorization • Hard to port GNU tools • GCC? Are you kidding? • GCC compatibility, on the other hand… • Black Widow will be better

  27. “Some Tuning Required” • Vectorization requires: • Independent operations • Dependence information • Mapping to vector instructions • Applications take a wide spectrum of steps to inhibit this • May need a couple of compiler directives • May need extensive rewriting

  28. Application Results • Awesome • Indifferent • Recalcitrant • Hopeless

  29. Awesome Results • 256-MSP X1 already showing unique capability • Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency • POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, … • Many examples from DoD

  30. Indifferent Results • Cray X1 is brute-force fast, but not cost effective • Dense linear algebra • Linpack, AORSA, LSMS

  31. Recalcitrant Results • Inherent algorithms are fine • Source code or ongoing code mods don’t vectorize • Significant code rewriting done, ongoing, or needed • CLM, CAM, Nimrod, M3D

  32. Aside: How to Avoid Vectorization • Use pointers to add false dependencies • Put deep call stacks inside loops • Put debug I/O operations inside compute loops • Did I mention using pointers?

  33. Aside: Software Design • In general, we don’t know how to systematically design efficient, maintainable HPC software • Vectorization imposes constraints on software design • Bad: Existing software must be rewritten • Good: Resulting software often faster on modern superscalar systems • “Some tuning required” for X series • Bad: You must tune • Good: Tuning is systematic, not a Black Art • Vectorization “constraints” may help us develop effective design patterns for HPC software

  34. Hopeless Results • Dominated by unvectorizable algorithms • Some benchmark kernels of questionable relevance • No known DOE SC applications

  35. Summary • DOE SC scientists do need 100 TF and beyond of sustained application performance • Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond

  36. “Custom” Rant • “Custom vs. Commodity” is Red Herring • CMOS is commodity • Memory is commodity • Wires are commodity • Cooling is independent of vector vs. scalar • PNNL liquid-cooling clusters • Vector systems may move to air-cooling • All vendors do custom packaging • Real issue: Software

  37. MPI Rant • Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” • Not ping pong! • An excellent abstraction that is imminently optimizable • Some apps are limited by point-to-point • Remote load/store implementations (CAF, UPC) have performance advantages over MPI • But MPI could be implemented using load/store, inlined, and optimized • On the other hand, easier to avoid pack/unpack with load/store model

  38. Co-Array-Fortran Rant • No such thing as one-sided communication • It’s all two sided: send+receive, sync+put+sync, sync+get+sync • Same parallel algorithms • CAF mods can be highly nonlocal • Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc. • Rarely the case for MPI • We use CAF to avoid MPI-implementation performance inadequacies • Avoiding nonlocality by cheating with Cray pointers

  39. Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E

  40. Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E • Very promising architecture • Dumb name • Interesting competitor with Red Storm

  41. Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/

More Related