VTF Applications Performance and Scalability

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003

LLNL’s IBM SP3 (frost) 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system SP switch2, colony switch 2 GB/sec node-to-node bandwidth bi-directional LANL’s HP/Compaq Alphaserver ES45 (QSC) 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node) 16 GB memory/node ~ 12 TB global file system Quadrics network interconnect (QsNet) 2 mus latency 300 MB/sec bandwidth ASCI Platform Specifics

Multiscale Polycrystal Studies • Quantitative assessment of microstructural effects in macroscopic material response through the computation of full-field solutions of polycrystals • Inhomogeneous plastic deformation fields • Grain-boundary effects: • Stress concentration • Dislocation pile-up • Constraint-induced multislip • Size dependence: (inverse) Hall-Petch effect • Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing • Enable full-scale simulation of engineering systems incorporating micromechanical effects.

Mesh Generation • Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. • texture simulation results agree well with experimental results • Mesh generation method keeps the topology of individual grain shapes • Enables effective interactions between grains • Increasing of the grain count in polycrystals gives a more stable mechanical response. Single grain corresponding to a single cell in a crystal

1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost

Multiscale Polycrystal Performance • Aggregate parallel performance • LANL’s QSC • Floating point operations 10.67% of peak • Integer operations 15.39% of peak • Memory operations 22.08 % of peak • DCPI hardware counters used to collect data • Qopcounter tool used to analyze DCPI database • LLNL’s Frost • L1 cache hit rate 98% • Load/store instructions executed w/o main memory access • Load Store Unit idle 36% • Floating point operations 4.47% of peak • Hpmcount tool used to count hardware events during program execution

Multiscale Polycrystal Performance II • MPI routines can consume ~ 30% of runtime for large runs on Frost • Workload imbalance as grains are distributed across nodes • MPI_Waitall every step dominating communications time • Nearest neighbor sends take longer from nodes with computationally heavy grains • Routines taking the most CPU time on QSC • resolved_fcc_cuitino 18.85% • upslip_fcc_cuitino_explicit 11.74% • setafcc 9.16% • matvec 8.5 % • ~50% of execution time in 4 routines • Room for performance improvement with better load balancing and routine level optimization

Multiscale Polycrystal Scalingon LLNL’s IBM SP3, Frost elements

Multiscale Polycrystal Scalingon LANL’s HP/Compaq, QSC elements

Scaling for Polycrystalline Copper in a Shear Compression Specimen Configuration elements LANL’s HP/Compaq QSC system

3D Converging Shock Simulations in a Wedge • 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. • The 2D elliptical interface is computed using local shock polar analysis to yield a perfectly circular transmitted shock • Resolution: 2000x400x400 with over 1T Byte of data generated. Density Pressure

Density Field in a 3D Wedge Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.

Wedge3D Performance on LLNL’s IBM SP3, Frost • Aggregate parallel performance for 1400x280x280 grid • LLNL’s Frost • Floating point operations 5.8 to 10% of peak, depending on node • Hpmcount tool used to count hardware events during program execution • Most time consuming communication calls • MPI_Wait() and MPI_Allreduce • Accounting for 3 to 30% of runtime on 128 way run • 175x70x70 grid per processor • Occasional high MPI time on a few nodes seem to be caused by system daemons competing for resources

Wedge3D Scaling on LLNL’s IBM SP3, Frost grid size XxYxZ

Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC 450K elements 61K -> 915K elements 85K -> 1.1M elements Levels of subdivision 450K to 1.1M elements

Crack Patterns in the Configuration Occurring During Scalability Studies on QSC

Fragmentation 2D Performance on LANL’s HP/Compaq, QSC • Procedures with highest CPU cycle consumption • element_driver 14.9% • assemble 13.9% • NewNeohookean 8.12% • 16 processor run with 2 levels of subdivision (60K elements) • Dcpiprof too used to profile run • Problems processing dcpi database FLOP rates for large runs • reported to LANL support • small runs yield 3% FLOP peak • Only ~ 10% spend in fragmentation routines! • Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])

VTF Applications Performance and Scalability