"Towards Petascale Grids as a Foundation of E-Science"

Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary "Towards Petascale Grids as a Foundation of E-Science"

Vision of Grid Infrastructure in the past… OR Very divergent & distributed supercomputers, storage, etc. tied together & “virtualized” Bunch of networked PCs virtualized to be a Supercomputer The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal programming model for many applications

But this is not meant to be Don Quixote or wrong tree dog bark picture here

500GB 48disks 500GB 48disks 500GB 48disks TSUBAME: the first 100 Teraflops Supercomputer for Grids 2006-2010 Voltaire ISR9288 Infiniband 10Gbps x2 ~1310+50 Ports~13.5Terabits/s(3Tbits bisection) Sun Galaxy 4 (Opteron Dual core 8-socket)10480core/655Nodes32-128GB21.4TeraBytes50.4TeraFlopsOS Linux (SuSE 9, 10) NAREGI Grid MW “Fastest Supercomputer in Asia” 29th Top500@48.88TFNow 103 TeraFlops Peak as of Oct. 31st! 10Gbps+External NW Unified IB network NEC SX-8i(for porting) Sun BladeInteger Workload Accelerator(90 nodes, 720 CPU Storage1.0 Petabyte (Sun “Thumper”)0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)50GB/s aggregate I/O BW 1.5PB ClearSpeed CSX600SIMD accelerator360 648 boards, 35 52.2TeraFlops 60GB/s

TSUBAME Job Statistics Dec. 2006-Aug.2007 (#Jobs) • 797,886 Jobs (~3270 daily) • 597,438 serial jobs (74.8%) • 121,108 <=8p jobs (15.2%) • 129,398 ISV Application Jobs (16.2%) • However, >32p jobs account for 2/3 of cumulative CPU usage 90% Coexistence of ease-of-use in both - short duration parameter survey- large scale MPI Fits the TSUBAME design well

In the Supercomputing Landscape,Petaflops class is already here… in early 2008 Other Petaflops 2008/2009- LANL/IBM “Roadrunner”- JICS/Cray(?) (NSF Track 2)- ORNL/Cray- ANL/IBM BG/P- EU Machines (Julich…)… 2008 LLNL/IBM “BlueGene/P” ~300,000 PPC Cores, ~1PFlops~72 racks, ~400m2 floorspace~3MW Power, copper cabling > 10 Petaflops> million cores> 10s Petabytesplanned for 2011-2012in the US, Japan, (EU), (other APAC) 2008Q1 TACC/Sun “Ranger” ~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops~100 racks, ~300m2 floorspace2.4MW Power, 1.4km IB cx4 copper cabling2 Petabytes HDD

In fact we can build one now (!) • @Tokyo---One of the Largest IDC in the World (in Tokyo...) • Can fit a 10PF here easy (> 20 Rangers) • On top of a 55KV/6GW Substation • 150m diameter (small baseball stadium) • 140,000 m2 IDC floorspace • 70+70 MW power • Size of entire Google(?) (~million LP nodes) • Source of “Cloud” infrastructure

Optical Fiber (bits per second) (Doubling time 9 Months) Data Storage (bits per square inch) (Doubling time 12 Months) Silicon Computer Chips (Number of Transistors) (Doubling time 18 Months) Gilder’s Law – Will make thin-client accessibility to servers essentially “free” Performance per Dollar Spent 0 1 2 3 4 5 Number of Years (Original slide courtesy Phil Papadopoulos @ SDSC) Scientific American, January 2001

NAME Discipline Problem/Method Structure MADCAP Cosmology CMB Analysis Dense Matrix FVCAM Climate Modeling AGCM 3D Grid CACTUS Astrophysics General Relativity 3D Grid LBMHD Plasma Physics MHD 2D/3D Lattice GTC Magnetic Fusion Vlasov-Poisson Particle in Cell PARATEC Material Science DFT Fourier/Grid SuperLU Multi-Discipline LU Factorization Sparse Matrix PMEMD Life Sciences Molecular Dynamics Particle DOE SC Applications Overview(following slides courtesy John Shalf @ LBLNERSC)

System Technology MPI Latency Peak Bandwidth Bandwidth Delay Product SGI Altix Numalink-4 1.1us 1.9GB/s 2KB Cray X1 Cray Custom 7.3us 6.3GB/s 46KB NEC ES NEC Custom 5.6us 1.5GB/s 8.4KB Myrinet Cluster Myrinet 2000 5.7us 500MB/s 2.8KB Cray XD1 RapidArray/IB4x 1.7us 2GB/s 3.4KB Latency Bound vs. Bandwidth Bound? • How large does a message have to be in order to saturate a dedicated circuit on the interconnect? • N1/2 from the early days of vector computing • Bandwidth Delay Product in TCP • Bandwidth Bound if msg size > Bandwidth*Delay • Latency Bound if msg size < Bandwidth*Delay • Except if pipelined (unlikely with MPI due to overhead) • Cannot pipeline MPI collectives (but can in Titanium) (Original slide courtesy John Shalf @ LBL)

Diagram of Message Size Distribution Function　（MADBench-P2P) 60% of messages > 1MB BW Dominant, Could be executed on WAN (Original slide courtesy John Shalf @ LBL)

Message Size Distributions(SuperLU-PTP) > 95% of messages < 1KByte Low latency, tightly coupled LAN (Original slide courtesy John Shalf @ LBL)

Collective Buffer Sizes- demise of metacomputing - 95% Latency Bound!!! => For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps (Original slide courtesy John Shalf @ LBL)

So what does this tell us? • A “grid” programming model for parallelizing a single app is not worthwhile • Either simple parameter sweep / workflow, or will not work • We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote) • Grid programming at “diplomacy” level • Must look at multiple applications, and how they compete / coordinate • The apps execution environment should be virtualized --- grid being transparent to applications • Zillions of apps in the overall infrastructure, competing for resources • Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.) • NAREGI focuses on these scenarios

Use case in NAREGI: RISM-FMO Coupled Simulation Electronic structure of Nano-scale molecules in solvent is calculated self-consistent by exchanging solvent charge distribution and partial charge of solute molecules. FMO Electronic structure RISM Solvent distribution GridMPI Mediator Mediator Solvent charge distribution is transformed from regular to irregular meshes Suitable for SMP Suitable for Cluster Mulliken charge is transferred for partial charge of solute molecules *Original RISM and FMO codes are developed by Institute of Molecular Science and National Institute of Advanced Industrial Science and Technology, respectively.

Compiling OK! NG! Test Run OK! Test Run Test Run OK! Registration & Deployment of Applications • Application Summary • Program Source Files • Input Files • Resource Requirements etc. PSE Server ApplicationDeveloper ACS (Application Contents Service) ①Register Application ②Select Compiling Host ⑤Select Deployment Host ⑦RegisterDeployment Info. Application sharing in Research communities ⑥Deploy Information Service Resource Info. ③Compile ④Send back CompiledApplication Environment Server#1 Server#2 Server#3

Description of Workflow and Job Submission Requirements http(s) tomcat Workflow Servlet Web server(apache) Wokflow Description By NAREGI-WFML applet Program icon NAREGI JM I/F module Data icon Appli-A JSDL /gfarm/.. Appli-B BPEL <invoke name=EPS-jobA> ↓JSDL -A <invoke name=BES-jobA> ↓JSDL -A ………………….. JSDL Global file information GridFTP (Stdout Stderr) BPEL+JSDL Application Information SuperScheduler Information Service DataGrid PSE Server

Workflow Abstract JSDL ResourceQuery Distributed Information Service Super Scheduler Client DAI CIM Reservation, Submission, Query, Control… Reservation based Co-Allocation Resource Info. Concrete JSDL Concrete JSDL Accounting Computing Resource Computing Resource GridVM GridVM UR/RUS Reservation Based Co-Allocation • Co-allocation for heterogeneous architectures and applications • Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...

Communication Libraries and Tools • Modules • GridMPI: MPI-1 and 2 compliant grid ready MPI library • GridRPC: OGF/GridRPC compliant GridRPC library • Mediator: Communication tool for heterogeneous applications • SBC: Storage based communication tool • Features • GridMPI • MPI for a collection of geographically distributed resources • High performance optimized for high bandwidth network • GridMPI • Task parallel simple seamless programming • Mediator • Communication library for heterogeneous applications • Data format conversion • SBC • Storage based communication for heterogeneous applications • Supporting Standards • MPI-1 and 2 • OGF/GridRPC

GridRPC (Ninf-G2) GridMPI Data Parallel MPI Compatibility Task Parallel, Simple Seamless programming RPC 100000 CPU RPC 100-500 CPU Grid Ready Programming Libraries • Standards compliant GridMPI and GridRPC

Application-1 Mediator Mediator Application-2 Data Format Conversion Data Format Conversion GridMPI ( ) Application-3 SBC library SBC library Application-2 SBC protocol ( ) Communication Tools for Co-Allocation Jobs • Mediator • SBC (Storage Based Communication)

VM VM VM MPI MPI MPI App B（CPU) VM Job MigrationPower Optimization VM VM VM MPI MPI MPI Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI) Cluster A (fast CPU, slow networks) Resource Manager App A(High BW) Host Host Host Host Host Host Resource manager, aware of individual application characteristics App A (High Bandwidth) Cluster B (high bandwidth, large memory) VM VM VM MPI MPI MPI Host Host Host Host Host Host MPI Comm Log Redistribution

"Towards Petascale Grids as a Foundation of E-Science"