A Look at Application Performance Sensitivity to the Bandwidth and Latency of Infiniband Networks

A Look at Application Performance Sensitivity to the Bandwidth and Latencyof Infiniband Networks Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal Computer and Computational Sciences Division Los Alamos National Laboratory

Performance and Architecture Lab • Performance analysis team at Los Alamos • Measurement • Modeling • Simulation • Large-scale: • systems (10,000s to 100,000s processors) • applications • Analyze existing systems (or near-to-market systems) • Examine possible future systems • e.g. IBM PERCS (DARPA HPCS), next generation Blue Gene, … • Recent work includes: • Modeling and optimization of ASCI Q (SC03 best paper) • Comparison of systems: e.g. Earth Simulator, & Top 5 (CCPE05) • Blue Gene/L (SC04) • Large-scale Optical Circuit Switch network (SC05)

Assessing impact of network performance • Context • What would be the performance improvement we had: • a network with higher bandwidth ? • a network with lower latency ? • Is it worth procuring an enhanced configuration ? • Approach • Use application performance models • Application abstraction encapsulating performance related features • Compute factors: single processor / node performance • Parallel factors: boundary exchanges, collectives etc. • Parameterized in terms of system characteristics • Node characteristics • Network characteristics (inc. bandwidth and latency)

Applications • Three applications of interest to Los Alamos: • Sweep3D: kernel application representing the heart of a deterministic SN transport calculation • SAGE: AMR hydrocode for shock propagation • Partisn: Deterministic SN transport code • Performance models previously developed • Validated on large-scale systems including: • Blue Gene/L (Lawrence Livermore) up to 32K nodes • Red Storm (Sandia) up to 8K processors • ASCI Q (Los Alamos) up to 8K processors • Typical ~10% error • Once validated can be used to explore performance on new systems

Application characteristics

Network Characteristics • Latency of 4µs seems optimistic (currently) • Latency of 1.5µs is close to PathScale (1.29µs) • Achievable bandwidth assumed is ~80% of peak • Infiniband fabric assumed to be a 12-ary fat-tree with switch latency of 200ns.

Performance studies • Sensitivity to network bandwidth and latency • 4x, 8x, & 12x bandwidths • 4µs, & 1.5µs near-neighbor latency • Effect of node size • Varying the number of processors in a node • Assumes single-core but applicable to multi-core • Assumes node: 2GHz AMD Opterons • Use of measured single processor performance • Vary system size • From 1 processor up to 8,192 processors • Concentrate on 256, 512 and 1024 processor clusters

Communication cost - example Sweep3D SAGE • 4x, 8x, 12x IB with near-neighbor latency of 4µs • 4-way nodes

Performance sensitivity: Partisn • Relative to a baseline configuration: • 4-way, 4x IB with 4µs latency • X-axis indicates node-size sensitivity (1 to 8 way) • Bar height indicates bandwidth sensitivity • 4x = lowest bar value • 12x = highest bar value • 8x = white ‘mid’ line • Difference in solid and shaded bars indicates latency sensitivity (4µs & 1.5µs)

Performance sensitivity: Partisn • 512 processor cluster • Highest sensitivity to node-size • Multiple processors sharing NIC • More sensitive to bandwidth than latency

Performance sensitivity: Sweep3D • 512 processor cluster • Highest sensitivity to latency • Most messages are small (~1KB) • Similar sensitivity to bandwidth and to node-size

Performance sensitivity: SAGE • 512 processor cluster • Similar sensitivity to bandwidth and node size (1 to 4-way) • No change from 4 to 8-way due to application effect • Little sensitivity to latency

Sensitivity summary • Says nothing about cost, or relative workload usage

Conclusions • Performance improvements due to enhanced network is application dependent • Bandwidth on SAGE • Latency on Sweep • Mixture (Node-size and bandwidth) on Partisn • Compute performance dampens any performance enhancement of network • Faster processors would increase performance sensitivity to network • Performance modeling can be used to assess configurations prior to procurement

A Look at Application Performance Sensitivity to the Bandwidth and Latency of Infiniband Networks

A Look at Application Performance Sensitivity to the Bandwidth and Latency of Infiniband Networks

Presentation Transcript

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Why Latency Lags Bandwidth, and What it Means to Computing

A Look at Our Strategic Performance

Bandwidth- and Latency-Aware Peer-to-Peer Instant Friendcast for Online Social Networks

The Sensitivity of Communication Mechanisms to Bandwidth and Latency

Voltammetry: A Look at Theory and Application

TCP transfers over high latency/bandwidth networks Internet2 Member Meeting

Latency Lags Bandwidth (last ~20 years)

So, You Need to Look at a New Application …

Infinite Bandwidth, Zero Latency

A Closer Look at GHG Performance Standards

A Look at the College Application Process

A Quick Look At ASP.NET Application Development

InfiniBand at Sun

Latency versus Bandwidth Latency is time to complete task – eg load memory and do FLOP

Low-Cost, High-Latency, Unlimited-Bandwidth Communication

Why Latency Lags Bandwidth, and What it Means to Computing

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Look at the Quality Standards of Performance Pain

A Deeper Look At Bid and Performance Bonds