ORNL Computing Story

Arthur MaccabeDirector, Computer Science and Mathematics Division May 2010 ORNL Computing Story Managed by UT-Battelefor the Department of Energy

Our vision for sustained leadershipand scientific impact Provide the world’s most powerful open resourcefor capability computing Follow a well-defined path for maintaining world leadershipin this critical area Attract the brightest talent and partnerships from all over the world Deliver leading-edge science relevant to the missionsof DOE and key federal and state agencies Invest in education and training Unique opportunity for multi-agency collaboration based on requirements and technology

ORNL has a role in providing a healthy HPC ecosystem for several agencies • Leadership computing at the exascale • Most urgent, challenging, and important problems • Scientific computing support • Scalable applications developed and maintained • Computational endstations for community codes • High-speed WAN 2009: Jobs with ~105 CPU cores Capability computing(>100 users) Large hardware systems and mid-range computers (clusters) • Applications having some scalability developed and maintained, portals, user support • High-speed WAN 2009: Jobs with ~103 CPU cores Capacity computing(>1000 users) Mid-range computing (clouds or clusters) and datacenters • Software either commercially available or developed internally • Knowledge discovery tools and problem solving environment • High-speed LAN and WAN 2009: Jobs with ~1 to 102 CPU cores Ubiquitous access to data and workstation level computing(>10,000 users) Breadth of access

ORNL’s computing resources Classified ESnet Internet2 Network routers TeraGrid Science Data Net Leadership Capability National Security Climate Capacity Computing DOEJaguar 2.3 PF NSFKraken 1 PF NSF Athena 164 TF NOAA TBD 1 PF+ DOEJaguar 240 TF ORNLFrost Oak Ridge Institutional Clusters (LLNL model) Multiprogrammatic clusters Cores: 2,048Memory: 3 TB Cores: 224,256Memory: 300 TB Cores: 99,072Memory: 132 TB Cores: 18,060Memory: 18 TB Cores: TBD Memory: TBD Cores: 31,328Memory: 62 TB

We are DOE’s lead laboratory for open scientific computing Jaguar performance Today: 2300 TF+ • World’s most capable complexfor computational science: Infrastructure, staff, multiagency programs • Outcome: Sustained world leadership in transformational researchand scientific discovery using advanced computing Why ORNL? Strategy Leadership areas Infrastructure SCplan_0804

Advancing Scientific Discovery*5 of the top 10 ASCR science accomplishments in Breakthroughs report used OLCF resources and staff Electron pairing in HTC cuprates* The 2D Hubbard model emits a superconducting state for cuprates & exhibits an electron pairing mechanism most likely caused by spin-fluctuation exchange. PRL (2007, 2008) Carbon Sequestration Simulations of carbon dioxide sequestration show where the greenhouse gas flows when pumped into underground aquifers. Taming turbulent heat loss in fusion reactors* Advanced understanding of energy loss in tokamak fusion reactor plasmas,. PRL(vol 99) and Physics of Plasmas (vol 14) Stabilizing a lifted flame* Elucidated the mechanisms that allow a flame to burn stably above burners, namely increasing the fuel or surrounding air co-flow velocity Combustion and Flame(2008) How does a pulsar get its spin?* Discovered the first plausible mechanism for a pulsar’s spin that fit observations, namely the shock wave created when the star’s massive iron core collapses. Jan 4, 2007 issue of Nature Shining the light on dark matter* A glimpse into the invisible world of dark matter, finding that dark matter evolving in a galaxy such as our Milky Way remains identifiable and clumpy. Aug 7, 2008 issue of Nature

Science applications are scaling on Jaguar Finalist

Build a 280 MW substation on ORNL campus Upgrade building power to 25 MW Deploy a 10,000+ ton chiller plant Upgrade UPS and generator capability Utility infrastructure tosupport multiple data centers 8 Managed by UT-Battellefor the Department of Energy

ORNL’s data center:Designed for efficiency 13,800 volt power into the building saves on transmission losses 480 volt power to computers saves $1M in installation costs and reduce losses Liquid Cooling is 1,000 times more efficient than air cooling Result:With a PUE of 1.25, ORNL has one of the world’s most efficient data centers Variable Speed Chillers save energy Vapor barriers and positive air pressure keep humidity out of computer center Flywheel based UPS for highest efficiency

Innovative best practices peeded to increase computer center efficiency ORNL Steps “High Performance Buildings for High Tech Industries in the 21st Century” Dale Sartor, Lawrence Berkeley National Laboratory

Today, ORNL’s facility is among the world’s most efficient data centers Power utilization efficiency (PUE) = Data center power / IT equipment ORNL’s PUE=1.25

State-of-the-art owned network is directly connected to every majorR&E network at multiple lambdas 12 Managed by UT-Battellefor the Department of Energy

Center-wide file system • “Spider” provides a shared, parallel file system for all systems • Based on Lustre file system • Demonstrated bandwidth >240 GB/s • Over 10 PB of RAID-6 capacity • 13,440 1-TB SATA Drives • 192 storage servers • Available from all systems via our high-performance scalable I/O network (Infiniband) • Currently mounted on over 26,000 client nodes

Completing the simulation environment to meet the science requirements Everest Powerwall Application Development Cluster Remote Visualization Cluster Scalable I/O Network (SION) 4x DDR Infiniband Backplane Network End-to-End Cluster Data Archive 25 PB XT5 Login XT4 Spider

We have increased system performance by 1,000 times since 2004 Cray XT5 Systems 12-core, dual-socket SMP 2000+ TF and 1000 TF Cray XT4 Quad-Core 263 TF Cray XT4 119 TF Cray XT3 Dual-Core 54 TF Cray XT3 Single-core 26 TF Cray X1 3 TF 2007 2005 2006 2008 2009

Science requires advanced computational capability 1000x over the next decade OLCF-5: 1 EF OLCF-4: 100-250 PF based on DARPA HPCS technology OLCF-3: 10-20 PF Leadership system with some HPCS technology Cray XT5 2+ PF Leadership system for science 2009 2011 2015 2018

Jaguar: World’s most powerful computerdesigned for science from the ground up

National Institute for Computational SciencesUniversity of Tennessee and ORNL partnership 16,704 six-core AMD Operton™ Processors 1,042 Teraflops 130 TB memory 3.3 PB Disk Space 48 Service and I/O nodes World’s most powerful academic supercomputer 18 Managed by UT-Battellefor the Department of Energy

An International, Dedicated High-End Computing Project to Revolutionize Climate Modeling Collaborators Project Use dedicated HPC resources – Cray XT4 (Athena) at NICS – to simulate global climate change at the highest resolution ever. Six months of dedicated access. • Expected Outcomes • Better understand global mesoscale phenomena in the atmosphere and ocean • Understand the impact of greenhouse gases on the regional aspects of climate • Improve the fidelity of models simulating mean climate and extreme events Codes

ORNL / DOD HPC collaborations 25 MW power 8,000 tons cooling 32,000 ft2 raised floor Peta/Exa-scale HPC Technology Collaborations in support of national security • System design, performance and benchmark studies • Wide-area network investigations • Extreme Scale Software Center • Focused on widening the usage and improving productivity the next generation of “extreme-scale” supercomputers • Systems software, tools, environments and applications development • Large scale system reliability, availability and serviceability (RAS) improvements 20 Managed by UT-Battellefor the Department of Energy

We are partners in the $250MDARPA HPCS programPrototype Cray system to be deployed at ORNL Impact: • Performance (time-to-solution): Speed up critical national security applications by a factor of 10 to 40 • Programmability (idea-to-first-solution): Reduce costand time of developing application solutions • Portability (transparency): Insulate research and operational application software from system • Robustness (reliability): Apply all known techniquesto protect against outside attacks, hardware faults,and programming errors HPCS program focus areas • Applications: • Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling, and biotechnology Fill the Critical Technology and Capability Gap Today (late 80s HPC technology)  Future (Quantum/Bio Computing) Slide courtesy of DARPA

The next big climate challengeNature, Vol. 453, Issue No. 7193, March 15, 2008 Develop a strategy to revolutionize prediction of the climate through the 21st century to help address the threat of global climate change Current inadequacy in provision of robust estimates of risk to society is strongly influenced by limitations in computer power A World Climate Research Facility (WCRF) for climate prediction should be established that will enable the national centers to accelerate progress in improving operational climate prediction at decadal to multi-decadal lead times Central component of the WCRF will be one or more dedicated high-end computing facilities that will enable the revolution in climate prediction… with systems at least 10,000 times more powerful than the currently available computers, is vital for regional climate predictions to underpin mitigation policies and adaptation needs

Oak Ridge Climate Change Science Institute James Hack Director David Bader Deputy Director A new multi-disciplinary, multi-agency organization bringing together ORNL’s climate change programs to promote a cohesive vision of climate change science and technology at ORNL • >100 Staff members matrixed from across ORNL • World’s leading computing systems (>3 PF in 2009) • Specialized facilities and laboratories • Programs

Oak Ridge Climate Change Science InstituteIntegration of models, measurements, and analysis Earth system modelsfrom local to global scales Process understanding: Observation, experiment, theory Integratedassessment • Atmosphere • Ocean • Ice • Terrestrial and marine biogeochemistry • Land use • Hydrologic cycle • Aerosols, water vapor, clouds, atmosphere dynamics • Ocean dynamics and biogeochemistry • Ice dynamics • Terrestrial ecosystem feedbacks and response • Land-use trends and projections • Extreme events, hydrology, aquatic ecology • Adaptation • Mitigation • Infrastructure • Energy and economics Partnerships will be essential High-performance computing: OLCF, NICS, NOAA Data systems, knowledge discovery, networking Observation networks Experimental manipulation facilities Facilities and infrastructure http://climatechangescience.ornl.gov

What we would like to be able to say about climate-related impacts within next 5 years • What specific changes will be experienced, and where? • When will the changes begin and how will they evolve over the next two-to-three decades? • How severe will the changes be? How do they compare with historical trends and events? • What will be the impacts over space and time? • People (e.g., food, water, health, employment, social structure) • Nature (e.g., biodiversity, water, fisheries) • Infrastructure (e.g., energy, water, buildings) • What specific – and effective – adaptation tactics are possible? • How might adverse impacts be avoided or mitigated?

High-resolution Earth system modeling A necessary core capability Strategy: Develop predictive global simulation capabilities for addressing climate change consequences Driver: Higher fidelity simulations with improved predictive skill on decadal time scales on regional space scales Objective: Configurable high-resolution scalable atmospheric, ocean, terrestrial, cryospheric, and carbon component models to answer policy and planning relevant questions about climate change Impact: Exploration of renewable energy resource deployment, carbon mitigation strategies, climate adaptation scenarios (agriculture, energy and water resource management, protection of vulnerable infrastructure, national security) Objectives and Impact Net ecosystem exchange of CO2 Mesoscale-resolved column integrated water vaporJaguar XT5 simulation Eddy-resolved sea surface temperatureJaguar XT5 simulation

NOAA collaboration as example • Interagency Agreement with Department of Energy Oak Ridge Operations Office • Signed August 6, 2009 • Five-year agreement • $215M Work for Others (initial investment of $73M) • Facilities and science activity • Provides dedicated specialized high performance computing collaborative services for climate modeling • Builds on existing three-year MOU w/ DOE • Common research goal to develop, test, and apply state-of-the-science computer-based global climate simulation models based upon strong scientific foundation Collaboration with Oak Ridge National Laboratory Synergy among research efforts across multiple disciplines and agencies Leverages substantial existing infrastructure on campus Access to world-class network connectivity Leverages ORNL Extreme Scale Software Center Utilizes proven project management expertise

Domain Science: A partnership of experiment, theory and simulation workingtowards shared goals. Theoretical & Computational Science Computer Science New Science Applied Math Delivering Science Having HPC facilities embedded in an R&D organization (CCSD) comprised of staff with expertise in CS, math, scalable application development, knowledge discovery, and computational science enables integrated approaches to delivering new science • “Effective speed” increasescome from faster hardware and improved algorithms • Science Prospects and Benefits with HPC in the Next Decade • Speculative requirements for scientific application on HPC platforms (2010–2020) • Architectures and applications can be co-designed in order to create synergy in their respective evolutions. Nanoscience

We have a unique opportunity for advancing math and computer science critical to mission success through multi-agency partnership Institute for AdvancedArchitectures and Algorithms Extreme Scale Software Development Center • Jointly funded by NNSAand SC in 2008 ~$7.4M • ORNL-Sandiapartnership • $7M in 2008 • Aligned withDOE-SC interests IAA is the medium through which architectures and applications can be co-designed in order to create synergy in their respective evolutions. • Two national centers of excellence in HPC architecture and software established in 2008 • Funded by DOE and DOD • Major breakthrough in recognition of our capabilities

Preparing for the ExascaleBy Analyzing Long-Term Science Drivers and Requirements We have recently surveyed, analyzed, and documented the science drivers and application requirements envisioned for exascale leadership systems in the 2020 timeframe These studies help to Provide a roadmap for the ORNL Leadership Computing Facility Uncover application needs and requirements Focus our efforts on those disruptive technologies and research areas in need of our and the HPC community’s attention

What Will an EF System Look Like? • *www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf • All projections are daunting • Based on projections of existing technology both with and without “disruptive technologies” • Assumed to arrive in 2016-2020 timeframe • Example 1 • 400 cabinets, 115K nodes @ 10 TF per node, 50-100 PB, optical interconnect, 150-200 GB/s injection B/W per node, 50 MW • Examples 2-4 (DOE “Townhall” report*)

Moving to the Exascale • The U.S. Department of Energy requires exaflops computing by 2018 to meet the needs of the science communities that depend on leadership computing • Our vision: Provide a series of increasingly powerful computer systems and work with user community to scale applications to each of the new computer systems • Today: Upgrade of Jaguar to 6-core processors in progress • OLCF-3 Project: New 10-20 petaflops computer based on early DARPA HPCS technology 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Future systems 1 EF OLCF-3 250 PF HPCS System 100 PF Today 1020 PF 2 PF, 6-core 1 PF ORNL Multiprogram Computing and Data Center (140,000 ft2) ORNL Multipurpose Research Facility ORNL Computational Sciences Building OLCF Roadmap from 10-year plan

Multi-core Era: A new paradigmin computing • Massively Parallel Era • USA, Japan, Europe • Vector Era • USA, Japan We have always had inflection points where technology changed

What do the Science Codes Need? What system features do the applications need to deliver the science? • 20 PF in 2011–2012 time frame with 1 EF by end of the decade • Applications want powerful nodes, not lots of weak nodes • Lots of FLOPS and OPS • Fast, low-latency memory • Memory capacity ≥ 2GB/core • Strong interconnect Node peak FLOPS Memory bandwidth Interconnect latency Memory latency Interconnect bandwidth Node memory capacity Disk bandwidth Large storage capacity Disk latency WAN bandwidth MTTI Archival capacity

How will we deliver these features, and address the power problem? DARPA ExaScale Computing Study (Kogge et al.): We can’t get to the exascale without radical changes Future systems will get performance by integrating accelerators on the socket (already happening with GPUs) AMD Fusion™ Intel Larrabee IBM Cell (power + synergistic processing units) This has happened before (3090+array processor, 8086+8087, …) • Clock rates have reached a plateau and even gone down • Power and thermal constraints restrict socket performance • Multi-core sockets are driving up required parallelism and scalability

OLCF-3 system description • Same number of cabinets, cabinet design, and cooling as Jaguar • Operating system upgrade of today’s Cray Linux Environment • New Gemini interconnect • 3-D Torus • Globally addressable memory • Advanced synchronization features • New accelerated node design • 10-20 PF peak performance • Much larger memory • 3x larger and 4x faster file system • ≈ 10 MW of power

OLCF-3 node description Node 1 Node 0 • Accelerated Node Design • Next generation interconnect • Next generation AMD processor • Future NVIDIA accelerator • Fat nodes • 70 GB memory • Very high performance processors • Very high memory bandwidth Interconnect

NVIDIA’s commitment to HPC May 04, 2009 NVIDIA Shifts GPU Clusters Into Second Gear by Michael Feldman, HPCwire Editor GPU-accelerated clusters are moving quickly from the "kick the tires" stage into production systems, and NVIDIA has positioned itself as the principal driver for this emerging high performance computing segment. The company's Tesla S1070 hardware, along with the CUDA computing environment, are starting to deliver real results for commercial HPC workloads. For example Hess Corporation has a 128-GPU cluster that is performing seismic processing for the company. The 32 S1070s (4 GPUs per board) are paired with dual-socket quad-core CPU servers and are performing at the level of about 2,000 dual-socket CPU servers for some of their workloads. For Hess, that means it can get the same computing horsepower for 1/20 the price and for 1/27 the power consumption. Features for computing on GPUs • Added high-performance 64-bit arithmetic • Adding ECC and parity that other GPU vendors have not added • Critical for a large system • Larger memories • Dual copy engines for simultaneous execution and copy • S1070 has 4 GPUs exclusively for computing • No video out cables • Development of CUDA and recently announced work with PGI on Fortran CUDA

Our strengths in key areaswill ensure success SCIENCE Exaflops • Multidisciplinary application development teams • Partnerships to drive application performance • Science base and thought leadership Applications • Exceptional expertise and experience • In-depth applications expertise in house • Strong partners • Proven management team People • Broad system software development partnerships • Experienced performance/optimization tool development teams • Partnerships with vendors and agencies to lead the way • Leverage DOD and NSF investments Software Petaflops • Driving architectural innovation needed for exascale • Superior global and injection bandwidths • Purpose built for scientific computing • Leverages DARPA HPCS technologies Systems • Power (reliability, availability, cost) • Space (current and growth path) • Global network access capable of 96 X 100 Gb/s Facilities

Oak Ridge National Laboratory:Meeting the challenges of the 21st century www.ornl.gov

ORNL Computing Story

ORNL Computing Story

Presentation Transcript

ORNL Net100 status

Testing PanDA at ORNL

End-to-End Computing at ORNL

ORNL Net100 status

David Radford ORNL

The ORNL Cluster Computing Experience…

WEB100 Evaluation at ORNL

ORNL Computing Resources

ORNL ESPC Overview

ORNL Commissioning Experiences

End to End Computing at ORNL

ORNL DAAC: Introduction

ORNL – SAFEGUARDS AND SECURITY

Computing Atomic Nuclei Witold Nazarewicz (UTK/ORNL)

Overview of Geospatial Computing at ORNL

SURA/ORNL 2002

The ORNL Cluster Computing Experience…

Overview of Geospatial Computing at ORNL

Computing Atomic Nuclei Status and Challenges Witold Nazarewicz (UTK/ORNL/UW)

End-to-End Computing at ORNL

ORNL Net100 status