160 likes | 285 Vues
Cray XT3 Experience so far. Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch. Support Team. Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support. Short History.
E N D
Cray XT3 Experience so far Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch
Support Team Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support RAA - CSCS
Short History • Cray turn over on 8 July 05 • Early users • CSCS staff • PSI staff • Very early adopters • Initially marked by great instability RAA - CSCS
What is an XT3? • Massively parallel Opteron-based UP • Catamount job launcher on compute • One executable • No sockets, no fork, no shared memory • Suse Linux on “service nodes” • I/O nodes, login nodes, system db, boot • “yod or pbs-mom” nodes RAA - CSCS
Current configuration • 12 cabinets (rack) of approximately 96 compute nodes per rack • 4 login nodes with DNS rotary name • Software maintenance workstation (smw) Suse Linux w/ 2.4 kernel PBS/Pro TotalView Portland Group Compilers 6.0.5 RAA - CSCS
Changes • 6 Unicos/LC upgrades - today 1.03.14 (1.3) • Multiple hardware upgrades • Most important was raising the Seastar ASIC from 1.5v to 1.6v • Multiple software upgrades • Multiple Lustre versions (CFS) High performance file system • Multiple firmware upgrades - “portals” • We have been up for 1 week at a time! RAA - CSCS
Acceptance Period Began: 21 Nov 05 Functional tests Performance tests Stability test for 30 days (7am - 6pm). RAA - CSCS
Cray XT 3: Commissioning & Acceptance • Acceptance completed 22 December 2006 • Put into production (i.e. opened to all users) Jan18; slow ramp-up planned • Rebooted almost daily during acceptance, less often afterwards • Current usage: • >75% node utilization • Jobs using 64-128 nodes are “typical” Two groups telling us they have done science they couldn’t do before. RAA - CSCS
Cray XT3 Use Model:Fit the work to the hardware • Palu - 1100 compute nodes: Batch only • Gele - 84 compute nodes: New users • compile, debug, scale • Test system - 56 nodes: Test environment • Delivery planned March 06 RAA - CSCS
Open issues and next steps • High speed network not 100% stable • High-speed file system Lustre young and immature • Current batch system limited in functionality • Bugs lead to intermittent node failures RAA - CSCS
High Speed Network • Stability improved significantly in last last month • Stability still not satisfactory; Cray analyzing problems • Most errors affect/abort single jobs • Occasional errors require a reboot (~1/week?) • No high speed network => Nothing Works RAA - CSCS
File system Lustre • Genuine parallel file system • Great performance • Very young and immature feature set • When Lustre is unhealthy (~once or twice a month) the system must be rebooted • Cray continuously improving Lustre; deadline for stable version hard to predict • Real Lustre Errors versus HSN problems! Difficult to differentiate RAA - CSCS
Job Scheduling • PBS/Pro batch system • Current PBS can not implement desired management policies (Priority to large jobs, Back filling, Reservations) • Cray is delivering a “special” version of PBS with a TCL scheduler • Will implement Priorities and Back filling • Testing begins February ‘06 • Roadmap for Reservations still t.b.d. RAA - CSCS
Problemes du Jour:Intermittent node failures • 3-15 times per day nodes die due to several bugs • Cray working on each of these bugs • Extremely difficult to diagnose • Currently nodes stay down until next machine reboot • Single node reboot available March - April 2006?? RAA - CSCS
Near Term Future • Single node reboot • PBS scheduler - a beginning • Upgrade gele - novice user platform • Bring up smallest system - systems • Try out dual core - upgrade some system • Test Linux on compute node RAA - CSCS
Summary • System into production and available for development work • System gaining maturity, but still far from SX-5 or IBM Power-4 standards. • Main current open issues in • Parallel file system maturity • Scheduling system • Node failures More Interruptions that we would like! RAA - CSCS