System Architecture: Near, Medium, and Long-term Scalable Architectures

Panel Discussion Presentation Sandia CSRI Workshop onNext-generation Scalable Applications:When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories ktpedre@sandia.gov System Architecture:Near, Medium, and Long-termScalable Architectures Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Near Term • Odds are good, but goods are odd... • Multi-core, many-core, mega-core • Heterogeneous ISAs, cores, systems • Accelerators: GPU, Cell, Clearspeed, FPGA, etc. • Embedded: Tilera, SPI, Ambric (336-core), Tensilica • Scalable Architectures • Peak FLOPS not bottleneck • Improving per-socket efficiency on real applications is “low-hanging fruit” • Decreasing memory size & bandwidth per core • Symbiosis of architecture and system software

Near Term (Cont.)‏ • Adapting MPI implementations for architecture • Shared memory copies vs. NIC • Cache pollution, injection • Leverage hierarchy / intra-node locality • Adapting MPI applications for architecture • MPI + shared memory: LIBSM • MPI + something else for intra-node • OpenMP, Thread Building Blocks, ALF Streaming, CUDA, Rapid Mind, Peakstream/Google, etc. • All incompatible, some similar concepts • Adapting architecture for MPI? • Leveraging interconnect capabilities for PGAS

OS Scalability At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount onthis Partisn problem. Doesn’t appear to be a bandwidth issue.

Task and Memory Placement • No standard mechanisms, most punt and hope for best • Explicit vs. implicit mechanisms • More important than node placement?

Intra-node MPI

Virtual Memory Nice, but Gets in Way Dashed Line = Small pages Solid Line = Large pages (Dual-core Opteron)‏ Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck)‏ Solid Shapes = New Constant-Time Algorithm (Slepoy, Thompson, Plimpton)‏ UnexpectedBehavior Due to TLB TLB misses increased with large pages,but time to service miss decreased dramatically (10x).Page table fits in L1! (vs. 2MB per GB with small pages)‏

So, Answer is Large Pages? • DRAM bank conflicts can be considerable depending on data alignment • OS-level and hardware mitigation strategies

Affects SpMV Also(28 Node HPCCG Run)‏

Medium Term • More accelerators, normalization • Attractive power and memory efficiency • Commodity processors will integrate GPUs on-chip • HPC-centric off-chip accelerators • General-purpose cores not getting much faster • Leverage architecture for specific app domains • Some common mechanism will/must emerge for dealing with data-parallel accelerators • General-purpose cores become more light-weight, better match for light-weight system software • Chip stacking • Off-chip optics

Long Term • MPP-on-a-chip • On and off-chip optics • More intelligent memory systems • Application driven architectures

System Architecture: Near, Medium, and Long-term Scalable Architectures