1 / 11

Tier1 Status Report

Tier1 Status Report. Martin Bly RAL 27,28 April 2005. Topics. Hardware Atlas DataStore Networking Batch services Storage Service Challenges Security. Hardware. Approximately 550 CPU nodes ~980 processors deployed in batch Remainder are services nodes, servers etc. 220TB disk space

Télécharger la présentation

Tier1 Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier1 Status Report Martin Bly RAL 27,28 April 2005

  2. Topics • Hardware • Atlas DataStore • Networking • Batch services • Storage • Service Challenges • Security Tier1 Status Report - HEPSysMan, RAL

  3. Hardware • Approximately 550 CPU nodes • ~980 processors deployed in batch • Remainder are services nodes, servers etc. • 220TB disk space ~ 60 servers, ~120 arrays • Decommissioning • Majority of the P3/600MHz systems decommissioned Jan 05 • P3/1GHz systems to be decommissioned in July/Aug 05 after commissioning of Year 4 procurement. • Babar SUN systems decommissioned by end Feb 05 • CDF IBM systems decommissioned and sent to Oxford, Liverpool, Glasgow and London • Next procurement • 64bit AMD or Intel CPU nodes – power, cooling • Dual cores possibly too new • Infortrend Arrays / SATA disks / SCSI connect • Future • Evaluate new disk technologies, dual core CPUs, etc. Tier1 Status Report - HEPSysMan, RAL

  4. Atlas DataStore • Evaluating new disk systems for staging cache • FC attached SATA arrays • Additional 4TB/server, 16TB total • Existing IBM/AIX servers • Tape drives • Two additional 9940B drives, FC attached • 1 for ADS, 1 for test CASTOR installation • Developments • Evaluating a test CASTOR installation • Stress testing ADS components to prepare for Service Challenges • Planning for a new robot • Considering next generation of tape drives • SC4 (2006) requires step in cache performance • Ancillary network rationalised Tier1 Status Report - HEPSysMan, RAL

  5. Networking • Planned upgrades to Tier1 production network • Started November 04 • Based on Nortel 5510-48T `stacks’ for large groups of CPU and disk server nodes (up to 8/stack, 384 ports) • High speed backbone inter-unit interconnect (40Gb/s bi-directional) within stacks • Multiple 1Gb/s uplinks aggregated to form backbone • currently 2 x 1Gb/s, max 4 x 1Gb/s • Update to 10Gb/s uplinks and head node as cost falls • Uplink configuration with links to separate units within each stack and the head switch will provide resilience • Ancillary links (APCs, disk arrays) on separate network • Connected to UKLight for SC2 (c.f. later) • 2 x 1Gb/s links aggregated from Tier1 Tier1 Status Report - HEPSysMan, RAL

  6. Batch Services • Worker node configuration based on traditional style batch workers with LCG configuration on top. • Running SL 3.0.3 with LCG 2_4_0 • Provisioning by PXE/Kickstart • YUM/Yumit, Yaim, Sure, Nagios, Ganglia… • All rack-mounted workers dual purpose, accessed via a single batch system PBS server (Torque). • Scheduler (MAUI) allocates resources for LCG, Babar and other experiments using Fair Share allocations from User Board. • Jobs able to spill into allocations for other experiments and from one `side’ to the other when spare capacity is available, to make best use of the capacity. • Some issues with jobs that use excess memory (memory leaks) not being killed by Maui or Torque – under investigation. Tier1 Status Report - HEPSysMan, RAL

  7. Service Systems • Service systems migrated to SL 3 • Mail hub, NIS servers, UIs • Babar UIs configured as DNS triplet • NFS / data servers • Customised RH7.n  • Driver issues • NFS performance of SL 3 uninspiring c/w 7.n • dCache systems at SL 3 • LCG service nodes at SL 3, LCG-2_4_0 • Need to migrate to LCG-2_4_0 or loose work Tier1 Status Report - HEPSysMan, RAL

  8. Storage • Moving to SRMs from NFS for data access • dCache successfully deployed in production • Used by CMS, ATLAS… • See talk by Derek Ross • Xrootd deployed in production • Used by Babar • Two `redirector’ systems handle requests • Selected by DNS pair • Hand off request to appropriate server • Reduces NFS load on disk servers  • Load issues with Objectivity server • Two additional servers being commissioned • Project to look at SL 4 for servers • 2.6 kernel, journaling file systems - ext3, XFS Tier1 Status Report - HEPSysMan, RAL

  9. Service Challenges I • The Service Challenges are a program infrastructure trials designed to test the LCG fabric at increasing levels of stress/capacity in the run up to LHC operation. • SC2 – March/April 05: • Aim: T0->T1s aggregate of >500MB/s sustained for 2 weeks • 2Gb/sec link via UKlight to CERN • RAL sustained 80MB/sec for two weeks to dedicated (non-production) dCache • 11/13 gridftp servers • Limited by issues with network • Internal testing reached 3.5Gb/sec (~400MB/sec) aggregate disk to disk • Aggregate to 7 participating sites: ~650MB/sec • SC3 – July 05 -Tier1 expects: • CERN -> RAL at 150MB/s sustained for 1 month • T2s -> RAL (and RAL -> T2s?) at yet-to-be-defined rate • Lancaster, Imperial … • Some on UKlight, some via SJ4 • Production phase Sept-Dec 05 Tier1 Status Report - HEPSysMan, RAL

  10. Service Challenges II • SC4 - April 06 • CERN-RAL T0-T1 expects 220MB/sec sustained for one month • RAL expects T2-T1 traffic at N x 100MB/sec simultaneously. • June 06 – Sept 06: production phase • Longer term: • There is some as yet undefined T1 -> T1 capacity needed. This could be add 50 to 100MB/sec. • CMS production will require 800MB/s combined and sustained from batch workers to the storage systems within the Tier1. • At some point there will be a sustained double rate test – 440MB/sec T0-T1 and whatever is then needed for T2-T1. • It is clear that the Tier1 will be able to keep a significant part of a 10Gb/sec link busy continuously, probably from late 2006. Tier1 Status Report - HEPSysMan, RAL

  11. Security • The Badguys™ are out there • Users are vulnerable to loosing authentication data anywhere • Still some less than ideal practices • All local privilege escalation exploits must be treated as a high priority must-fix • Continuing program of locking down and hardening exposed services and systems • You can only be more secure • See talk by Roman Wartel Tier1 Status Report - HEPSysMan, RAL

More Related