1 / 17

VTF Applications Performance and Scalability

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003 LLNL’s IBM SP3 (frost) 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system SP switch2, colony switch

jacob
Télécharger la présentation

VTF Applications Performance and Scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003

  2. LLNL’s IBM SP3 (frost) 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system SP switch2, colony switch 2 GB/sec node-to-node bandwidth bi-directional LANL’s HP/Compaq Alphaserver ES45 (QSC) 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node) 16 GB memory/node ~ 12 TB global file system Quadrics network interconnect (QsNet) 2 mus latency 300 MB/sec bandwidth ASCI Platform Specifics

  3. Multiscale Polycrystal Studies • Quantitative assessment of microstructural effects in macroscopic material response through the computation of full-field solutions of polycrystals • Inhomogeneous plastic deformation fields • Grain-boundary effects: • Stress concentration • Dislocation pile-up • Constraint-induced multislip • Size dependence: (inverse) Hall-Petch effect • Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing • Enable full-scale simulation of engineering systems incorporating micromechanical effects.

  4. Mesh Generation • Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. • texture simulation results agree well with experimental results • Mesh generation method keeps the topology of individual grain shapes • Enables effective interactions between grains • Increasing of the grain count in polycrystals gives a more stable mechanical response. Single grain corresponding to a single cell in a crystal

  5. 1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost

  6. Multiscale Polycrystal Performance • Aggregate parallel performance • LANL’s QSC • Floating point operations 10.67% of peak • Integer operations 15.39% of peak • Memory operations 22.08 % of peak • DCPI hardware counters used to collect data • Qopcounter tool used to analyze DCPI database • LLNL’s Frost • L1 cache hit rate 98% • Load/store instructions executed w/o main memory access • Load Store Unit idle 36% • Floating point operations 4.47% of peak • Hpmcount tool used to count hardware events during program execution

  7. Multiscale Polycrystal Performance II • MPI routines can consume ~ 30% of runtime for large runs on Frost • Workload imbalance as grains are distributed across nodes • MPI_Waitall every step dominating communications time • Nearest neighbor sends take longer from nodes with computationally heavy grains • Routines taking the most CPU time on QSC • resolved_fcc_cuitino 18.85% • upslip_fcc_cuitino_explicit 11.74% • setafcc 9.16% • matvec 8.5 % • ~50% of execution time in 4 routines • Room for performance improvement with better load balancing and routine level optimization

  8. Multiscale Polycrystal Scalingon LLNL’s IBM SP3, Frost elements

  9. Multiscale Polycrystal Scalingon LANL’s HP/Compaq, QSC elements

  10. Scaling for Polycrystalline Copper in a Shear Compression Specimen Configuration elements LANL’s HP/Compaq QSC system

  11. 3D Converging Shock Simulations in a Wedge • 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. • The 2D elliptical interface is computed using local shock polar analysis to yield a perfectly circular transmitted shock • Resolution: 2000x400x400 with over 1T Byte of data generated. Density Pressure

  12. Density Field in a 3D Wedge Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.

  13. Wedge3D Performance on LLNL’s IBM SP3, Frost • Aggregate parallel performance for 1400x280x280 grid • LLNL’s Frost • Floating point operations 5.8 to 10% of peak, depending on node • Hpmcount tool used to count hardware events during program execution • Most time consuming communication calls • MPI_Wait() and MPI_Allreduce • Accounting for 3 to 30% of runtime on 128 way run • 175x70x70 grid per processor • Occasional high MPI time on a few nodes seem to be caused by system daemons competing for resources

  14. Wedge3D Scaling on LLNL’s IBM SP3, Frost grid size XxYxZ

  15. Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC 450K elements 61K -> 915K elements 85K -> 1.1M elements Levels of subdivision 450K to 1.1M elements

  16. Crack Patterns in the Configuration Occurring During Scalability Studies on QSC

  17. Fragmentation 2D Performance on LANL’s HP/Compaq, QSC • Procedures with highest CPU cycle consumption • element_driver 14.9% • assemble 13.9% • NewNeohookean 8.12% • 16 processor run with 2 levels of subdivision (60K elements) • Dcpiprof too used to profile run • Problems processing dcpi database FLOP rates for large runs • reported to LANL support • small runs yield 3% FLOP peak • Only ~ 10% spend in fragmentation routines! • Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])

More Related