Overview of Performance Analysis and Modeling in Recent SNL Activities
This document outlines the recent efforts and activities at Sandia National Laboratories (SNL) regarding application performance analysis and modeling. The goals include improving system performance and scalability, optimizing resource utilization, and guiding the design and purchase of computational systems. The report discusses advancements in various scientific fields, such as computational fluid dynamics and structural mechanics, and presents new frameworks for enhancing the efficiency of high-performance simulations. Insights from the Catamount and Red Storm supercomputer upgrades are also highlighted.
Overview of Performance Analysis and Modeling in Recent SNL Activities
E N D
Presentation Transcript
Application Performance Analysis and Modeling An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Goals • Improve performance and scalability • Better utilization of an existing system • Determine which platform a given simulation should run on • Better utilization of a set of machines • Guide purchase of capacity and capability systems • Help design future supercomputers • Cost/benefit analysis • Focus R&D efforts
Science and Engineering Apps • Continuum • Computational fluid dynamics • Shock physics (CTH) • Arbitrary Lagrangian Eulerian (Alegra) • Structural mechanics • Combustion • Device simulations • E&M • Radiation • Enclosure radiation • DAE • Circuit Modeling • Particles • Molecular dynamics (LAMMPS) • Particle-in-cell
Informatics is an Emerging App Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47(3, March 2004): pp 45-47
Before Upgrade 10,880 2.0 GHz single-core AMD Opteron CPUs 43.52 TF/s peak SeaStar 1.2 2-4 GB per socket #9 on June 2006 Top 500 list Catamount LWK After Upgrade 13,600 2.4 GHz dual-core AMD Opteron CPUs 130.56 TF/s peak SeaStar 2.1 network Doubled NIC bandwidth 2-4 GB per socket #3 on current Top 500 list Catamount LWK with virtual node mode support Red Storm
Upgrade PerformanceSingle-Core vs Dual-Core • CTH - Shape Charge • Constant work/core • Speedup of at least 1.4 out to 2048 sockets • Sage - timing_c • Constant work/core • At scale, speedup is at least a factor of 1.6
Cluster more cost effective MPP more cost effective Efficiency ratio = Cost ratio = 1.8 Peak Linpack Scientific & Eng. Codes Sandia Study: 2004 • Where should codes run? Capability or Capacity? • Looked at the relative scaling efficiency between Red Storm and Cplant and then correlated efficiency to cost. • Answer:Capacity < 256 nodes Capability > 256 nodes Presto CTH DSMC
Red Storm & Thunderbird Performance Comparison (I.e. Capability vs Capacity) Turnover job size = 256
CTH Model with Load Imbalance T = E(κ,φ)N3 + C(λ + τkN2) + S(γlog(P)) +Limbal • T is the execution time per time step • N is size of an edge of a processor’s subdomain • C and S are number of exchanges and collectives • P is the number of processors • k is the number of variables in an exchange • λ and τ are latency and transfer cost • γ is the cost of one stage of collective • E(κ,φ) is the calculation time per cell • Limbal represents the effects of load imbalance
Monte-Carlo Particle Transport Code, ITS • Model built and verified on Red Storm, ASCI Red, Cplant, Vplant • Red Storm performance model used to predict 10K Proc Performance • Algorithm modification (new code) results in much improved scaling • Execution Time = Compute_time(num_hist) + Commu_time(msg_sizes, latency, BW, p) + I/O time • Master-Worker; Many to one; tally data sent to master • Tcomm.= {2 * T48 + T48000 + T432 + T16M + T368 } * num_batches * (p-1) • Input: Latency, Bandwidth(Pt-to-Pt), num_procs(p), num_batches • Model with Rabenseifner’s Algorithm: Tcomm. = 2*ln(p)*(latency) + 2*(p-1)*n*(Msg_transfer_time) + (p-1)*n*(local_reduction_time)/p • p = num processors • n = message size, Bytes
HW Architectural Results SW Structural Simulation Toolkit (SST) • Motivation • Currently developing a simulation environment to ... • Provide validated baseline for future exploration • Answer “What If” questions to guide future design efforts • Understand complex system-level interactions • Goals • Focus on parallel systems: HW & SW • Quick turnaround • Flexibility • Multiple front-ends • Execution driven • Trace driven • Multiple back-ends • Explore novel architectures (e.g. Multi-core, NIC, Memory) • Support conventional architectures (e.g. Single core, DDR) • Reusable, Extensible, & Parallelizable • Customers • Micro-architects • System Architects • Application Performance Analysis
SST: Structure • Front-Ends & Back-Ends Joined by Processor/Thread Interface • Enkidu “glues” back-end components
Applying SST: Red Storm SeaStar NIC • Architectural Features • Embedded 500 Mhz PPC440, local SRAM, DMA Engines, NIC Bus • High speed network interface to 3D mesh router • 800Mhz HyperTransport interface to CPU • Host/NIC communicate through memory • NIC Modeling • PPC 440: Used SimpleScalar • Local SRAM: Existing SST component • Tx/Rx DMA engines • Existing component • Respond to same commands at RS DMA • Flow controlled • HT Interface • Connects CPU/NIC • NIC Bus • Connects internal NIC components (PPC, SRAM, etc...) • HyperTransport Modeling • HyperTransport connection modeled at two components • HTLink models latency • HTLink_bw models link bandwidth • Models contention • Tracks backlog of requests w/ simple BW counting scheme • Implements flow-control with finite request queue • Queue depth set to cover round-trip times and allow full bandwidth
Validating SST: Latency & Bandwidth • Used MPI “ping-pong” and OSU streaming BW • Compared with real Seastar 1.2 and 2.1 chips • Latency, message rate, and bandwidth • within 5% for range of sizes
Validating SST: Primitives • Sources of Error • Small message optimization in Red Storm (<16 bytes) • Lack of cache-line invalidation instruction • Processor model?
HyperTransport • Vary NIC Clock Rate Effect On Latency: • 2X NIC Clock -> 30% improvement in latency • 4X NIC Clock -> 50% improvement • Vary HT Bus Effect On Bandwidth: • No effect to small messages • Does effect peak bandwidth Processor Speed NIC Bus • Vary NIC Bus Effect On Latency: • 1/2 Bus Latency -> 8% latency reduction • Vary HT Bus Effect On Latency: • MPI latency increases linearly with HT latency • 4 HT transactions per MPI message Varying NIC Clk Freq Design Space Exploration Varying HT BW Varying NIC Bus Latency Varying HT Latency
Better Finding the Bottleneck: Computation, Branches, or Memory • In the node, Memory performance is key bottleneck • Even perfect branch prediction and infinite FUs would be less valuable than improving memory latency. • Prefetching, caches don’t help emerging applications
Latency/Bandwidth Sensitivity Latency & Bandwidth are both constraining performance Informatics Applications Scientific Applications Emerging applications more sensitive to Latency and Bandwidth
Memory Operations Dominate • FP ops (“Real work”) < 10% of Sandia codes • Several Integer calculations, loads for each FP load • Memory and Integer Ops dominate • ...and most integer ops are computing memory addresses • Theme: processing is now cheap, data movement is expensive Integer Instruction Usage Instruction Mix
A Hybrid MPI Simulator • Execution driven network simulator • Uses Lamport’s virtual time concept to run application unperturbed • Uses MPI profiling interface • Easy to “hook” into existing applications • A relink is the only application porting demand • No need to instrument or analyze code • MPI Wrapper Library • Keeps track of virtual time • Send and receive MPI data as before • Manages events to coordinate network simulator with the application
MPI Simulator: Virtual Time • Lamport’s idea of synchronizing virtual time between nodes using messages • Events carry MPI envelope information and virtual send time • Algorithm adjusts local virtual time, • if send (Tx) plus network delay is within time of receiver (T1) • Tv = Tx + delay • Otherwise assume message was already here • Tv = T1
Hybrid MPI Simulator Node Architecture Collective Example Model Validation w/NAS Benchmarks Model Tuning
Future challenges • Data locality on chip • Impact of programming models • Accelerators