1 / 13

RAL Site Report

RAL Site Report. Martin Bly HEPiX @ SLAC – 11-13 October 2005. Overview. Intro Hardware OS/Software Services Issues. RAL T1. Rutherford Appleton Lab hosts the UK LCG Tier-1 Funded via GridPP project from PPARC Supports LCG and UK Particle Physics users VOs:

inge
Télécharger la présentation

RAL Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAL Site Report Martin Bly HEPiX @ SLAC – 11-13 October 2005

  2. Overview • Intro • Hardware • OS/Software • Services • Issues RAL Site Report - HEPiX @ SLAC

  3. RAL T1 • Rutherford Appleton Lab hosts the UK LCG Tier-1 • Funded via GridPP project from PPARC • Supports LCG and UK Particle Physics users • VOs: • LCG: Atlas, CMS, LHCb, (Alice), dteam • Babar • CDF, D0, H1, Zeus • Bio, Pheno • Expts: • Minos, Mice, SNO, UKQCD • Theory users • … RAL Site Report - HEPiX @ SLAC

  4. Tier 1 Hardware • ~950 CPUs in batch service • 1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off) • 1.0GHz systems retiring as they fail, phase out end Oct '05 • New procurement • Aiming for 1400+ SPECint2000/CPU • Systems for testing as part of evaluation of tender • First delivery early '06, second delivery in April/May '06 • ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc) • 60+ disk servers • Mostly SCSI attached IDE or SATA ~220TB unformatted • New procurement: probably PCI/SATA solution • Tape robot • 6K slots, 1.2PB, 10 drives RAL Site Report - HEPiX @ SLAC

  5. Tape Robot / Data Store • Current data: 300TB, PP -> 200+TB (110TB Babar) • Castor 1 system trials • Many CERN-specifics • HSM (Hierarchical Storage Manager) • 500TB, DMF (Data Management Facility) • SCSI/FC • Real file system • Data migrates to tape after inactivity • Not for PP data • Due November 05 • Procurement for a new robot underway • 3PB, ~10 tape drives • Expect to order end Oct 05 • Delivery December 05 • In service by March 06 (for SC4) • Castor system RAL Site Report - HEPiX @ SLAC

  6. Networking • Tier-1 backbone at 4x1Gb/s • Upgrading some links to 10Gb/s • Multi-port 10Gb/s layer-2 switch stack as hub when available • 1Gb/s production link Tier-1 to RAL site • 1Gb/s link to SJ4 (internet) • 1Gb/s HW firewall • Upgrade site backbone to 10Gb/s expected late '05, early '06 • Link Tier-1 to site at 10Gb/s – possible mid-2006 • Link site to SJ5 @ 10Gb/s – mid '06 • Site firewall remains an issue – limit 4Gb/s • 2x1Gb/s link to UKLight • Separate development network in UK • Links to CERN @ 2Gb/s, Lancaster @ 1Gb/s (pending) • Managed ~90MB/s during SC2, less since • Problems with small packet loss causing traffic limitations • Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible • UKLight link to CERN requested @ 4Gb/s for early '06 • Over-running hardware upgrade (4 days expanded to 7 weeks) RAL Site Report - HEPiX @ SLAC

  7. Tier1 Network Core – SC3 ADS Caches RAL Site Non-SC hosts N x 1Gb/s dCache pools 5510-1 7i-1 Router A FW 1Gb/s 1Gb/s 4 x 1Gb/s 1Gb/s to SJ4 dCache pools N x 1Gb/s 4 x 1Gb/s 2 x 1Gb/s to CERN Gridftp servers 5510-2 7i-3 UKLight Router 4 x 1Gb/s 2 x 1Gb/s 290Mb/s to Lancaster Non-SC hosts RAL Site Report - HEPiX @ SLAC

  8. OS/Software • Main services: • Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5) • LCG 2_6_0 • Torque/MAUI • 1 Job/CPU • Disk: RH72 custom, RH73 custom • Some internal services on SL4 (loggers) • Project to use SL4.n for disk servers underway • Solaris disk servers decommissioned • Most hardware sold • AFS on AIX • Transarc • Project to move to Linux (SL3/4) RAL Site Report - HEPiX @ SLAC

  9. Services (1) - Objyserv • Objyserv database service (Babar) • Old service on traditional NFS server • Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server • Additional server solution with same technology not tenable • New service: • Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM • SL3, RAID1 data disks • 4 servers per host system • Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection • Able to cope with some ease: 600+ clients • Contact Chris Brew RAL Site Report - HEPiX @ SLAC

  10. Services (2)Home file system • Home file system migration • Old system: • ~85GB on A1000 RAID array • Sun Ultra10, Solaris 2.6, 100Mb/s NIC • Failed to cope with some forms of pathological use  • New system: • ~270GB SCSI RAID5, 6 disk chassis • 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC • SL3, ext3 • Stable under I/O and quota testing, and during backup • Migration: • 3 weeks planning • 1 week of nightly rsync followed by checksuming • Convince ourselves the rsync works • 1 day farm shutdown to migrate • 1 single file detected to have checksum error • Quotas for users unchanged… • Keep the old system on standby to restore its backups RAL Site Report - HEPiX @ SLAC

  11. Services (3) – Batch Server • Catastrophic disk failure on Saturday late evening over a holiday weekend • Staff not expected back till 8:30am Wednesday • Problem noted Tuesday morning • Initial inspection - disk a total failure • No easy access to backups • Backup tape numbers in logs on failed disk! • No easy recovery solution with no other system staff available • Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no accounting data, no new jobs started. • Wednesday: • Hardware `revised’ with two disks, Software RAID1, clean install of SL3 • Backups located, batch/scheduling configs recovered from tape store • System restarted with MAUI off to allow Torque to sort itself out • Queues came up closed • MAUI restarted • Service picked up smoothly • Lessons: • Know where the backups are and how to identify which tapes are the right ones • Unmodified batch workers are not good enough for system services RAL Site Report - HEPiX @ SLAC

  12. Issues • How to run resilient services on non-resilient hardware? • Committed to run 24x365, 98%+ uptime • Modified batch workers with extra disks and HS caddies as servers • Investigating HA-Linux • Batch server and scheduling experiments positive • RB,CE, BDII, R-GMA … • Databases • Building services maintenance • Aircon, power • Already two substantial shutdowns in 2006 • New building • UKLight is a development project network • There have been problems with managing expectations for production services on a development network • Unresolved packet loss in CERN-RAL transfers • Under investigation • 10Gb/s kit expensive • Components we would like are not yet affordable/available • Pushing against LCG turn-on date RAL Site Report - HEPiX @ SLAC

  13. Questions?

More Related