Cray XT3 Experience so far

Cray XT3 Experience so far Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch

Support Team Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support RAA - CSCS

Short History • Cray turn over on 8 July 05 • Early users • CSCS staff • PSI staff • Very early adopters • Initially marked by great instability RAA - CSCS

What is an XT3? • Massively parallel Opteron-based UP • Catamount job launcher on compute • One executable • No sockets, no fork, no shared memory • Suse Linux on “service nodes” • I/O nodes, login nodes, system db, boot • “yod or pbs-mom” nodes RAA - CSCS

Current configuration • 12 cabinets (rack) of approximately 96 compute nodes per rack • 4 login nodes with DNS rotary name • Software maintenance workstation (smw) Suse Linux w/ 2.4 kernel PBS/Pro TotalView Portland Group Compilers 6.0.5 RAA - CSCS

Changes • 6 Unicos/LC upgrades - today 1.03.14 (1.3) • Multiple hardware upgrades • Most important was raising the Seastar ASIC from 1.5v to 1.6v • Multiple software upgrades • Multiple Lustre versions (CFS) High performance file system • Multiple firmware upgrades - “portals” • We have been up for 1 week at a time! RAA - CSCS

Acceptance Period Began: 21 Nov 05 Functional tests Performance tests Stability test for 30 days (7am - 6pm). RAA - CSCS

Cray XT 3: Commissioning & Acceptance • Acceptance completed 22 December 2006 • Put into production (i.e. opened to all users) Jan18; slow ramp-up planned • Rebooted almost daily during acceptance, less often afterwards • Current usage: • >75% node utilization • Jobs using 64-128 nodes are “typical” Two groups telling us they have done science they couldn’t do before. RAA - CSCS

Cray XT3 Use Model:Fit the work to the hardware • Palu - 1100 compute nodes: Batch only • Gele - 84 compute nodes: New users • compile, debug, scale • Test system - 56 nodes: Test environment • Delivery planned March 06 RAA - CSCS

Open issues and next steps • High speed network not 100% stable • High-speed file system Lustre young and immature • Current batch system limited in functionality • Bugs lead to intermittent node failures RAA - CSCS

High Speed Network • Stability improved significantly in last last month • Stability still not satisfactory; Cray analyzing problems • Most errors affect/abort single jobs • Occasional errors require a reboot (~1/week?) • No high speed network => Nothing Works RAA - CSCS

File system Lustre • Genuine parallel file system • Great performance • Very young and immature feature set • When Lustre is unhealthy (~once or twice a month) the system must be rebooted • Cray continuously improving Lustre; deadline for stable version hard to predict • Real Lustre Errors versus HSN problems! Difficult to differentiate RAA - CSCS

Job Scheduling • PBS/Pro batch system • Current PBS can not implement desired management policies (Priority to large jobs, Back filling, Reservations) • Cray is delivering a “special” version of PBS with a TCL scheduler • Will implement Priorities and Back filling • Testing begins February ‘06 • Roadmap for Reservations still t.b.d. RAA - CSCS

Problemes du Jour:Intermittent node failures • 3-15 times per day nodes die due to several bugs • Cray working on each of these bugs • Extremely difficult to diagnose • Currently nodes stay down until next machine reboot • Single node reboot available March - April 2006?? RAA - CSCS

Near Term Future • Single node reboot • PBS scheduler - a beginning • Upgrade gele - novice user platform • Bring up smallest system - systems • Try out dual core - upgrade some system • Test Linux on compute node RAA - CSCS

Summary • System into production and available for development work • System gaining maturity, but still far from SX-5 or IBM Power-4 standards. • Main current open issues in • Parallel file system maturity • Scheduling system • Node failures More Interruptions that we would like! RAA - CSCS

Cray XT3 Experience so far

Cray XT3 Experience so far

Presentation Transcript

So far…

So far..

My experience so far…

So far

So far

So far...

So far…

So far…

So far….

So far

So far...

So Far

Profiling S3D on Cray XT3 using TAU

So far ...

So far:

So far…

Massively Parallel Magnetohydrodynamics on the Cray XT3

So far

So far:

So far…

So Far…

So far