The JST-CREST MegaScale Project Towards Scaling to a Million CPUs （ Sorry, not a NAREGI talk)

The JST-CRESTMegaScale ProjectTowards Scaling to a Million CPUs（Sorry, not a NAREGI talk) Satoshi (Matsuoka) Tokyo Institute of Technology / National Institute of Informatics CCGSC 2004 in Lyon, France Sept. 28th, 2004

No “Free Lunch” in Grid Infrastructure • Easier the Grid to use, the more it will be used (observed success) • But then you need a humongous infrastructure, but how do you get it? • “Petascale Grids”: we need “total architecting”, R&D, and deployment from processors, machines, networks, lower-level middleware, programming and PSE-level middleware, all the way to applications, as well as people, and system management

NAREGI Project (Japanese National Research Grid Initiative) (SuperSINET, Optiputer, parts of GFarm) JST “Megascale” Project Petascale Infrastructure Management (TeraGrid, EGEE, etc.) Technical Elements of Petascale Grids • Large Scale Grid Programming Models and Environments Support • Large Scale Grid Resource Management Middleware • Terabit lambda-based optical networking and its integration • Low-power, ultra-dense commodity clusters • Low-power, high-performance CPUs

Q: How do we achieve “PetaFlops” “Megascale CPUs?” PI: Hiroshi Nakashima (Toyohashi IT) – Megascale Cluster Federation (Grid) programming Co-Pis Hiroshi Nakamura (U-Tokyo) – Low power processor architecture Mitsuhisa Sato (Univ. Tsukuba) – Low power compiler and runtime Taisuke Boku (Univ. Tsukuba) – Dependable multi-way interconnect Satoshi Matsuoka (Titech) – Dependable and Autonomous Cluster Middleware i.e. the usual suspects in Japan The JST-CREST “Megascale” Project (2001-2006) Ideas: 1. Low power, commodity dense cluster design thru architecture and software efforts 2. Cluster scalability and dependability and autonomy w/many “dense” nodes 3. Federate clusters via Grid (resource-aware) programming 4. Build actual dense, low power commodity HPC cluster as proof-of-concept • Megascale Cluster Prototype 1Q 2004 • - 670GigaFlops, 150GB/Rack • 2Gbps Interconnect/node, 0.6 Terabits/Rack • Combined SAN/Cluster/Grid interconnect • 2nd Megascale Prototype 2005 • Processor Card Upgrade, HyperTransport-PCI/X • 1.6 TeraFlops, 300GB/Rack • 1.7 TeraByte/s Mem BW/Rack

MegaScale Background(2) • Based on Commodity Technologies: • Large-Scaling underlying basis Low Power/High-Performance HW/SW techniques, Low cost/high-density packaging • Large-Scale Dependability Fault Modeling/Detection/Recovery, Autonomic/Self-Configure => Efficient and Large-Scale • Large-Scale Programmability “Programming in-the-large” Must be “resource aware”  Grid programming

Grid (resource aware) Programming LP-HP CPU MegaProto Coordinated LP Dependable Interconnect Dependable, AutonomicCluster LP-HP Compiler MegaScale Overview (1) Workload Modeling Low Power/High Density dependability programmability

MegaScale Overview (2) Application register Optimizing Compiler w/Hardware Coordinated Design Coordination SCM Cache • Low-Power, High-performance Compilation Technologies • On-chip memory/cache optimization • Register optimization (less spillage) • Architecture/profile driven compilation Cache profile object code SCIMA ALU FPU Reorg Reorganization NIA ・・・

MegaScale Overview (3) RI2N P P P P SW SW Gen. Static perf. model Task code static perf. analyzer Perf. profiler • FT / Autonomic /ScalableCluster Middleware • Automated Cluster Config • Checkpoint optimizations • Cluster-aware fault injection MegaScript meta class HugeSim < Task def initialize(*arg) @exefile='./huge-sim' @parameter = arg; end def behavior n = @parameter FOR n compute(n*n); end;end;end Perf. model Runtime model object task task p BranchAndBound.exec( HugeSim,[10000,...]) task Task scheduler

FT/Autonomic Middleware- how to deal with commodity HW at a large scale - • Luci: Scalable and Autonomic Cluster Configuration Management (w/NEC) • Can configure 1000s of nodes O(1) time, meta-packages • Automated management, configuration change detection • Component-based design of Fault Aware-MPI • Mix-and-match components, abstract fault models for large-scale and heterogeneity • Checkpoint (CP) optimizations for large-scale clusters • Speculative CP, Skewed CP • Scalable Cluster-aware fault modeling • Cluster-aware fault injection (w/Bart Miller) • Cluster-aware (performance) fault modeling & detection • RI2N: Redundant, Fault-tolerant networking for large-scale clusters • Using commodity networks with redundant paths

Scalable Checkpointing viaSpeculative Checkpointing& Skewed Checkpointing Ikuhei Yamagata, Satoshi Matsuoka GSIC, Tokyo Institute of Technology Hiroshi Nakamura RCAST, University of Tokyo

Background • megascale computing : • Megascale # system components • High failure rate, low availability • Multiple faults could occur • co-ordinate checkpointing • Most “solid” • Feasible for smaller clusters with low memory footprints • Need low checkpoint overhead & dealing with multiple faults

Ordinary Pairwise (Buddy) checkpoing • Popular strategy (1-mirror) • Store checkpoint image to oneself and a neighboring node • If both nodes fail at the same time, the checkpoint is lost network

Past work on multiple node failures • Checkpoint Mirroring(MIR) • Central File Server checkpointing(CFS) • 2-level Recovery Scheme

MIR/CFS network • MIR Characteristics • Overhead for 1-MIR CP&recovery is small • Overhead for k-MIR CP&recovery is large for large k MIR • K-muliple failures • Mirror (copy&save) checkpoints to k （k-mirror MIR）（ordinary CP is 1-mirror MIR） CFS • Save CPs of all the nodes on a reliable shared filesystem (CFS) （CFS/N denotes N CFSs for the entire system） network • CFS Characteristics • Cope with multiple failures • I/O Contention during CP&recovery causes bottleneck • # of CFSs per node basically determines overall performance • CFS is expensive CFS

2-Level Recovery Scheme • 2-level Recovery Scheme Characteristics • Quick recovery from single-node failures • Reduce frequency of high overead CFS CP&Recovery • Still need expensive CFSs Employ MIR & CFS hierarchically • Recover quickly from highly probable single-node failures • 1-mirror MIR • Cope with Multiple Failures • Perform CFS per every several MIRs • Recover from MIR CP on single-node failure • Recover from CFS CP on multiple-node failures

Proposal1： Speculative Checkpointing • Objective • Reducing I/O of parallel CPs onto CFS • Distribute I/O on time axis • Proposal and Results • Proposed Speculative Checkpointing that distributes I/O • Variant of Incremental Checkpointing • Prototype implementation exhibited 41% speedup in a parallel CP environment

Previous Method: Incremental CP • Incremental Checkpointing [Feldman’89] • Checkpoint only the modified pages (dirty page) on each CP • Reduces checkpoing size Does not solve I/O congestion in time dimension I/O Load Time PE0 PE1 PE2 CP CP

Ideal Checkpoint • Distribute Checkpoint load in time axis to “spread out” I/O load I/O load time PE0 PE1 PE2 CP CP

Checkpoint Our Proposal: Speculative CP • Perform Asynchronous and Speculative CPs inbetween synchronous (incremental) CPs • Pages that have changed and are speculated not to change till the next synchronous I/O load time PE0 PE1 PE2 CP CP

Prototype Speculative Checkpointer Implementation (2) • Pages that have not been modified for a certain period of time • Once modified expect low re-modification probability • Will not increase CP • Currently adopt coordinated CP interval as time interval basis Ex. Pages that have not been modified for 2 successive CP intervals are subject to speculative CP page No change write time Speculative CP t2 t3 t4 t1 t1,t2,t3,t4 are coordinated CP points

Prototype Speculative Checkpointer Implementation (1) • Add Page CP Speculation Algorithm to existing incremental CP • Base on Libckpt[Plank’95] • Page CP Speculation based on • Past Access pattern • Program locality • Avoid Speculation failure • Fail to speculate – I/O congestion not avoided • Wrongly speculate – unnecessary increase in CP • We currently focus on 2.

Comparative Scenarios (inc120).incremental CP interval = 120s Synch Synch 120sec Incremental CP Incremental CP (inc120+spc). Incremental CP interval = 120s + speculative CP (synched) Synch Synch Speculative CP 60sec 60sec Incremental CP Incremental CP (inc120+spc distributed) .incremental CP interval 120s+ speculative CP (distributed) Synch Synch Speculative CP Incremental CP Speculative CP Incremental CP

Speculative CP allowed 41% speedupin the best case Experimental Results (sec)

Proposal2： Skewed Checkpointing • Policies and Ideas • Same overhead as 1-mirror for single-node failures • Cope with multiple failures flexibly • Low CP overhead (always encountered) • Recovery overhead could be high (as multiple failures do not happen often)

network CP1 network CP2 CP3 network CP1 CP2 CP3 CP1 CP2 CP3 CP1 CP2 CP3 Skewed Checkpointing:algorithm • Extend 1-mirror MIR • Perform 1-mirror MIR per each CP cycle • Change CP storage target • Save to node distance d = 2^k i.e. (d = 1, 2, 4, 8, …) • Save past k CPs to cope with k-multiple failures • Max m multiple failures

#1 #2 #3 skewed checkpointing • Save to node distance 2^k • CP#1 distance 1, CP#2 distance 2, CP#3 distance 4 • Perform this cyclically ７０ -multiple faults are recoverable ６１５２ ex. nodes 0,1 fail  CP#1 NG CP#2, #3 OK ex. nodes 0,1,4 fail  CP #1,3 NG、CP#2 OK ４３

network CP1 network CP2 Singlefailure process CP3 network CP1 CP2 CP3 Skewed Checkpointing:recovery • k=3 • 3 sets of CPs • Single node failure immediately after CP3

network CP1 network CP2 Dual NodeFailure process CP3 network CP1 CP2 CP3 Skewed Checkpointing:recovery • k=3 • 3 sets of CPs • Single node failure immediately after CP3

Skewed Checkpointing • Skewed Checkpointing Characteristics • CP overhead equivalent to 1-mirror MIR • Recovery overhead from single-node failure quick (equivalent to 1-mirror MIR), short distance • Possible to recover from multiple-node failures at a higher cost • Does not employ CFS => can utilize cheap local storage, equivalent system cost to MIR

Evaluation Methodology (3) • Measure CP & Recovery time for each CP methodology • Sample Matrix Multiply Job • CP size on each node is 500[MB] • Compute average overhead for arbitrary execution time & arbitrary failure rate lambda • Use the following values

CP & Recovery Time Eval Env. Results Assume ＣＰ&recovery time is proportional to #nodes for CFS

CP average overhead vs. frequency Average Overhead for 16 nodes for each CP methodology

Skewed Checkpointing Evaluation • Skewed Checkpointing has lowest average overhead • Advantageous for nodes > 1000 • k=4 and k=m not different (in reality this may not be the case) #Nodes

Conclusion and Future Work • Proposed Skewed Checkpointing, low overhead CP methodology for multiple failures • Skewed Checkpointing is best fit for large cluster and Grid environments • Need to evaluate the multiplicity of • Long-distance CP operations are expensive • For near-neighbor CPs group failures may occur • Megaproto checkpointing • diskless node? • Need to combine skewed vs. speculative for selective usage

ALU FPU register reconfigurable SCM Cache SCM Cache NIA ・・・ Memory (DRAM) Network SCIMA (Software Controlled Integrated Memory Architecture) • addressable SCM in addition to ordinary cache • a part of logical address space • no inclusive relations with Cache • SCM and cache are reconfigurable at the granularity of way (SCM: Software Controllable Memory) overview of SCIMA address space

SCIMA achieves high-performance and low-power consumption Energy-Delay Product • Energy-Delay product improves by 24% - 78%

MegaProto Design Objectives (1) • MegaProto Objectives • Project Claim ＝Low Power, High Density  MegaProto as Proof of Concept • As testbed for middleware R&D • Low-Power, High-Performance Compilation, RI2N Dependable Networking, Advanced Cluster Fault Modeling/Detection/Tolerance, MegaScript Grid programming • How do we design a prototype ASAP?

MegaProto Design Objectives (2) • Target Power & Perf @ (19"x42U) • Total Power ＝10KW • Peak Performance ＝1TFlops • Perf/Power ＝100MFlops/W  EarthSimulator x 10 • How do we scale to 1PFlops • As is ＝10MW Within Reason • x10 improvement ＝ 1MW  Common Computing Center

Can Low-power Commodity Processors Compete With Traditional HP Commodty Processors? • Measuring System Power • CT-30000 • Hall device • Connection box + A/D board • Precise and easy power measurement • Few microsec. dynamic resolution • Non-intrusive power measurement (don’t need to cut power lines…)

Comparison of Target Platforms

Benchmark programs • Datascan • A synthetic program w/repeated memory access, increasing the access area to affect cache hit/miss. • Dhrystone 2.1 • Fits entirely into I-cache • NPB (NAS Parallel Benchmarks 2.3) • Scientific parallel HPC workload • Matrix Multiply • Investigate cache blocking optimization for power/performance

P4 power consumption (Datascan)

Crusoe power consumption (Datascan)

Large cooling device Large cooling device Power Consumption (Datascan) P4 XScale But will this really be effective in reducing overall power in real apps? Pen-M Crusoe

Achieving Low Power and High Performance with Cache Optimizations • Cache optimizations could allow for high performance and low total power • Q: can we “have the cake and eat it, too” for low-power processors? • Examine effect of power consumption with cache blocking optimizations Cache blocking optimization in Matrix Multiply

Cache Blocking in Crusoe • Cache blocking is very effective to reduce execution time and power consumption. • Low power processors are affected to a greater degree by cache optimization than high performance processors The result of each blocking size in Crusoe Execution time of matrix multiply

Prototype Low Power Cluster using Crusoe • First Prototype • Comparative study vs. high performance processors in parallel setting • Investigate high density packaging

Crusoe Cluster Power Efficiency • Power performance ratio: Performance [Mop/s] / Power [Ws] (Energy Delay Product) • Crusoe cluster achieves better performance and efficiency than a single P4. • 22.8% improvement in performance • 28% reduction in the total power consumption • 58% improvement in power efficiency • High-density clusters with low power processors can achieve better power/performance than that of high performance processors. Matrix Multiply Results 0.0122 0.0231 0.0146[Mops/s/Ws] NPB LU (Class A) Results

Crusoe Cluster Performance w/DVS (LongRun) • Some fixed clock freq. achieves better performance than DVS (Longrun) • Very low clock freq. yields poor performance, not much power benefit • In IS and FT benchmarks • Little difference in performance between low-high clock frequencies. • Optimal power efficiency (about 10%) achievable with certain fixed clock –> DVS needs improvement Performance ratio in NPB class A at various clocks Total power consumption and power efficiency in IS and LU CLASS A

The JST-CREST MegaScale Project Towards Scaling to a Million CPUs （ Sorry, not a NAREGI talk)