210 likes | 347 Vues
RAL Tier1 Operations. Andrew Sansum 18 th April 2012. Staffing. Staff changes since GridPP27: Leavers Kier Hawker (Database Team Leader) New Starters Orlin Alexandrov (Grid Team) Dimitrios (Fabric Team) Vasilij Savin (Fabric Team) New Roles Ian Collier - “ Grid Team ” Leader
E N D
RAL Tier1 Operations Andrew Sansum 18th April 2012
Staffing Staff changes since GridPP27: Leavers • Kier Hawker (Database Team Leader) New Starters • Orlin Alexandrov (Grid Team) • Dimitrios (Fabric Team) • Vasilij Savin (Fabric Team) New Roles • Ian Collier - “Grid Team” Leader • Richard Sinclair Database Team Leader • James Adams – storage system development Tier-1 Status
Some Changes • CVMFS in use for Atlas & LHCb: • The Atlas (NFS) software server used to give significant problems. • Some CVMFS teething issues but overall much better! • Virtualisation: • Starting to bear fruit. Uses Hyper-V. • Numerous test systems • Production systems that do not require particular resilience. • Quattor: • Large gains already made. Tier-1 Status
Database Infrastructure We making Significant Changes to the Oracle Database Infrastructure. Why? • Old servers are out of maintenance • Move from 32bit to 64bit databases • Performance improvements • Standby systems • Simplified architecture
Database Disk Arrays - Future Oracle RAC Nodes Fibrechannel SAN Data Guard Power Supplies (on UPS) Disk Arrays Tier-1 Status
Castor Changes since last GridPP Meeting: • Castor upgrade to 2.1.10 (March) • Castor version 2.1.10-1 (July) needed for the higher capacity "T10KC" tapes. • Updated Garbage Collection Algorithm (to “LRU” rather than the default which is based on size). (July) • (Moved ‘logrotate’ to 1pm rather than 4am.) Tier-1 Status
Recent Developments (I) • Hardware • Procured and commissioned 2.6PB disk • Procured and commissioned 15KHS06 disk • T10KC tape drives deployed and (1.5PB) ATLAS data migrated • New head nodes and core infrastructure storage capacity • Procured A new Tier-1 core network and new Site network • ORACLE Database Hardware upgrade and re-organisation • Rebuilding database SAN infrastructure • Increased CASTOR database resilience. Now have two copies of CASTOR database. Maintained in step by Oracle Data-guard. • Upgraded 3D service to ORACLE 11 • Virtualisation infrastructure (Hyper-V) now approved for critical production systems (deployment starting). Tier-1 Status
CASTOR (significant improvements in latency) • Upgraded to CASTOR 2.1.11-8 (major upgrade) • Head node replacement • EMI/UMD upgrades of Grid Middleware Tier-1 Status
Castor Issues. • Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst) • Load can become concentrated on one or two disk servers. • Exacerbated if uneven distribution if disk server sizes. • Solutions: • Add more capacity; clean-up. • Changes to tape migration policies. • Re-organization of service classes. Tier-1 Status
Disk Server Outages by Cause (2011) Tier-1 Status
Double Disk Failures (2011) In process of updating the firmware on the particular batch of disk controllers. Tier-1 Status
Data Loss Incidents Summary of losses since GridPP26 Total of 12 incidents logged: • 1 – Due to a disk server failure (loss of 8 files for CMS) • 1 – Due to a bad tape (loss of 3 files for LHCb) • 1 - Files not in Castor Nameserver but no location. ( 9 LHCb files) • 9 – Cases of corrupt files. In most cases the files were old (and pre-date Castor checksumming). Checksumming in place of tape and disk files. Daily and random checks made on disk files. Tier-1 Status
T10KC Tapes In Production Type Capacity In Use Total Capacity A 0.5TB 5570 2.2PB B 1TB 2170 1.9PB (CMS) C 5TB Tier-1 Status
T10000C Issues • Failure of 6 out of 10 tapes. • Current A/B failure rate roughly 1 in 1000. • After writing part of a tape an error was reported. • Concerns are three fold: • A high rate of write errors cause disruption • If tapes could not be filled our capacity would be reduced • We were not 100% confident that data would be secure • Updated Firmware in drives. • 100 tapes now successfully written without problem. • In contact with Oracle. Tier-1 Status
A couple of final comments Disk server issues are the main area of effort for hardware reliability / stability. ...but do not forget the network. Hardware that has performed reliably in the past may throw up a systematic problem. Tier-1 Status
Formal Operations Processes WLCG DAILY ops Exception Review Requirements Production Scheduling Team Fault Review Change Review Exception Handling Management Meeting SIR Review Liaison Meeting Tier-1 Status
Service Exceptions 2011 • Definitions • Service exception – High priority fault alert raising a pager call • Callout – Service exception raised outside formal working hours • Operations Team • Daytime – “Admin on Duty” (AoD). Holds pager, handles service exceptions – passes on to daytime teams. • “Nighttime” – Primary Oncall (Like AoD) – holds pager fixes easy problems, operationally “in Charge”. Second line On-call (one per team) guarantees response. Some (not guaranteed) third line support or escalation in serious incidents. • Exceptions Count in 2011 • 461 Service exceptions • 265 callouts Tier-1 Status
Plans for Future • ORACLE 11 upgrade for CASTOR/LFC/FTS needed by July • CASTOR • Switch on transfer manager (reduce transfer startup latency) • Upgrade to 2.1.11-9 (needed before Oracle 11 upgrade) • Upgrade to 2.1.12 • Network (move Tier-1 backbone to 40Gb/s) • Site “front of house” network upgrade “early summer” • Tier-1 new routing and spine layer .. DRI …. Tier-1 Status