Sequoia RFP and Benchmarking Status

UNCLASSIFIED Sequoia RFP and BenchmarkingStatus Scott Futral Mark K. Seager Tom Spelce Lawrence Livermore National Laboratory 2008 SciComp Summer Meeting

Overview • Sequoia Objectives • 25-50x BlueGene/L (367TF/s) on Science Codes • 12-24x Purple on Integrated Design Codes • Sequoia Procurement Strategy • Sequoia is actually a cluster of procurements • Risk management pervades everything • Sequoia Target Architecture • Driven by programmatic requirements and technical realities • Requires innovation on several fronts Sequoia will deliver petascale computing for the mission and pushes the envelope by 10-100x in every dimension!

By leveraging industry trends, Sequoia will successfully deliver a petascale UQ engine for the stockpile • Sequoia Production Platform Programmatic Drivers • UQ Engine for mission deliverables in the 2011-2015 timeframe • Programmatic drivers require unprecedented leap forward in computing power • Program needs both Capability and Capacity • 25-50x BGL (367TF/s) for science codes (knob removal) • 12-24x Purple for capability runs on Purple (8,192 MPI tasks UQ Engine) • These requirements met with current industry trends drive us to a different target architecture than Purple or BGL

Predicting stockpile performance drives five separate classes of petascale calculations • Quantifying uncertainty (for all classes of simulation) • Identify and model missing physics • Improving accuracy in material property data • Improving models for known physical processes • Improving the performance of complex models and algorithms in macro-scale simulation codes Each of these mission drivers require petascale computing

Sequoia Strategy • Two major deliverables • Petascale Scaling “Dawn” Platform in 2009 • Petascale “Sequoia” Platform in 2011 • Lessons learned from previous capability and capacity procurements • Leverage best-of-breed for platform, file system, SAN and storage • Major Sequoia procurement is for long term platform partnership • Three R&D partnerships to incentivize bidders to stretch goals • Risk reduction built into overall strategy from day-one • Drive procurement with single peak mandatory • Target Peak+Sustained on marquee benchmarks • Timescale, budget, technical details as target requirements • Include TCO factors such as power

To Minimize Risk, Dawn Deployment Extends the Existing Purple and BG/L Integrated Simulation Environment • ASC Dawn is the initial delivery system for Sequoia • Code development platform and scaling for Sequoia • 0.5 petaFLOP/s peak for ASC production usage • Target production 2009-2014 • Dawn Component Scaling • Memory B:F = 0.3 • Mem BW B:F = 1.0 • Link BW B:F = 2.0 • Min Bisect B:F = 0.001 • SAN GB/s:PF/s = 384 • F is peak FLOP/s

Sequoia Target Architecture in Integrated Simulation Environment Enables a Diverse Production Workload • Diverse usage models drive platform and simulation environment requirements • Will be 2D ultra-res and 3D high-res Quantification of Uncertainty engine • 3D Science capability for known unknowns and unknown unknowns • Peak of 14 petaFLOP/s with option for 20 petaFLOP/s • Target production 2011-2016 • Sequoia Component Scaling • Memory B:F = 0.08 • Mem BW B:F = 0.2 • Link BW B:F = 0.1 • Min Bisect B:F = 0.03 • SAN BW GB/:PF/s = 25.6 • F is peak FLOP/s

Sequoia Targets A Highly Scalable Operating System Light weight kernel on compute node • Optimized for scalability and reliability • As simple as possible. Full control • Extremely low OS noise • Direct access to interconnect hardware • OS features • Linux compatible with OS functions forwarded to I/O node OS • Support for dynamic libs runtime loading • Shared memory regions • Open source 1-N CN… Application Application Application Application NPTL Posix threads NPTL Posix threads NPTL Posix threads Posix threads, OpenMP and SE/TM glibc dynamic loading glibc dynamic loading glibc dynamic loading glibc dynamic loading GLIBC GLIBC GLIBC MPI MPI MPI GLIBC MPI ADI ADI ADI ADI Function Shipped syscalls syscalls syscalls syscalls Futex Futex Futex Shared Memory Shared Memory Shared Memory RAS RAS RAS SMP RAS hardware transport hardware transport hardware transport hardware transport Sequoia CN and Interconnect Sequoia CN and Interconnect Sequoia CN and Interconnect Sequoia CN and Interconnect Compute Nodes Linux on I/O Node • Leverage huge Linux base & community • Enhance TCP offload, PCIe, I/O • Standard File Systems Lustre, NFSv4, etc • Factor to Simplify: • Aggregates N CN for I/O & admin • Open source FSD SLURMD Perf tools totalview Linux/Unix Function Shipped syscalls Lustre Client NFSv4 LNet UDP TCP/IP Sequoia ION and Interconnect I/O Node

Sequoia Target Application Programming Model Leverages Factor and Simplify to Scale Applications to O(1M) Parallelism • MPI Parallelism at top level • Static allocation of MPI tasks to nodes and sets of cores+threads • Allow for MPI everywhere, just in case… • Effectively absorb multiple cores+threads in MPI task • Support multiple languages • C/C++/Fortran03/Python • Allow different physics packages to express node concurrency in different ways

With Careful Use of Node Concurrency We can Support A Wide Variety of Complex Applications • MPI Tasks on a node are processes (one shown) with multiple OS threads (Thread0-3 shown) • Thread0 is “Main thread” Thread1-3 are helper threads that morph from Pthread to OpenMP worker to TM/SE compiler generated threads via runtime support • Hardware support to significantly reduce overheads for thread repurposing and OpenMP loops and locks MPI_FINALIZE OpenMP OpenMP OpenMP OpenMP MPI_INIT TM/SE TM/SE MPI Call MPI Call MPI Call MPI Call MPI Call MPI Call MPI Call Thread0 Thread1 Funct2 Funct1 Exit Funct1 MAIN MAIN W W Thread2 W W Thread3 W W 1-3 1-3 1-3 1-3 1-3 1-3 1-3 • Pthreads born with MAIN • Only Thread0 calls functions to nest parallelism • Pthreads based MAIN calls OpenMP based Funct1 • OpenMP Funct1 calls TM/SE based Funct2 • Funct2 returns to OpenMP based Funct1 • Funct1 returns to Pthreads based MAIN

Sequoia Distributed Software Stack Targets Familiar Environment for Easy Applications Port Code Development Tools C/C++/Fortran Compilers, Python User Space Kernel Space Function Shipped syscalls APPLICATION Parallel Math Libs OpenMP, Threads, SE/TM Optimized Math Libs SOCKETS Clib/F03 runtime LWK, Linux Lustre Client SLURM/Moab RAS, Control System Code Dev Tools Infrastructure TCP UDP LNet MPI2 IP ADI Interconnect Interface External Network

Sequoia Platform Target Performance is a Combination of Peak and Application Sustained Performance • “Peak” of the machine is absolute maximum performance • FLOP/s = FLoating point OPeration per second • Sustained is weighted average of five “marquee” benchmark code “Figure of Merit” • Four IDC package benchmarks and one “science workload” benchmark from SNL • FOM chosen to mimic “grind times” and factor out scaling issues BlueGene/L – 0.4 TF/s Purple – 0.1PF/s

Sequoia Benchmarks have already incentivized the industry to work on problems relevant to our mission needs • What’s missing? • Hydrodynamics • Structural mechanics • Quantum MD 23

Validation and Benchmark Efforts Platforms Purple (IBM Power5, AIX) BGL (IBM PPC440, LWK) BGP (IBM PPC450, LWK, SMP) ATLAS ( AMD Opteron, TOSS) Red Storm ( AMD Opteron, Catamount) Franklin (AMD Opteron, CNL ) Phoenix (Vector, UNICOS)

The strategy for aggregating performance incentivizes vendors in two ways. 1 – Peak (petaFLOP/s) 2 – #MPI / Node <= Memory per Node / 2 GB awFOM = wFOMAMG + wFOMIRS + wFOMSPhot + wFOMUMT + wFOMLAMMPS

AMG Results

AMG message size distribution An improved messaging rate would significantly impact AMG communication performance.

UMT and Sphot results

Observations of messaging rate for UMT indicate we need to have messaging rate as an interconnect requirement 130 23 Messaging is very bursty, and most messaging occurs at a high messaging rate.

IRS- Implicit Radiation Solver results

BG/L MPI COM PREP BC COMPUTE COM PREP MPI MPI APPLICATION wire COMMUNICATION COMPUTATION IRS Load Imbalance has two components: compute and communications IMBALANCE (MAX / AVG) #PE Model Power5 BG/L Red Storm 512 1.1429 1.521 1.061 1,000 1.1111 1.487 1.092 1.064 2,197 1.0833 1.428 1.080 1.052 4,096 1.0667 1.352 1.067 1.030 8,000 1.0526 1.052

Summary • Sequoia is a carefully choreographed risk mitigation strategy to develop and deliver a huge leap forward in computing power to the National Stockpile Stewardship Program • Sequoia will work for weapons science and integrated design codes when delivered because of our evolutionary approach to yield a revolutionary advance on multiple fronts • The ground work on system requirements, benchmarks, and SOW are in place for launch of a successful procurement competition for Sequoia

Sequoia RFP and Benchmarking Status

Sequoia RFP and Benchmarking Status

Presentation Transcript

Sequoia Trees

Status Report on Electric RFP

Giant Sequoia Tree

The Giant Sequoia

Sequoia National Park

Sequoia Middle School

Diebold and Sequoia

sTLD RFP: Status

SiD DBD Benchmarking Current Status I

Redwood ( Sequoia sempervirens )

Sequoia

Neutron Benchmarking WG Status Update

Status of SiD Benchmarking

Sequoia Voting Systems

Sequoia National Park

Sequoia National Park

Shailendra Signh Sequoia

SEQUOIA CONSULTING GROUP

RFP Monkey RFP Tracking