GridPP22 – Service Resilience and Disaster Planning

GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

Resilience and Disaster Planning • The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site. • One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality. PLAN - A

Towards Plan-B Fortifying the Service • Increasing the hardware’s capacity to handle faults. • Duplicating services or machines. • Automatic restarts. • Fast intervention. • In depth investigation of the reason for failure. • Disaster Planning • Taking control early enough. • (Pre-) establishing possible options. • Understanding user priorities. • Timely Action. • Effective Communication. See talks by Jeremy (today) and Andrew (tomorrow)

Disasters: Not “if” but “when+where” wLCG weekly operations report, Feb-09

Disasters: Not “if” but “how big” A typical campus incident

Purpose of GridPP22 • To understand the experiment priorities and plans (insofar as they are defined) in the case of various disaster scenarios. • To extract commonalities across our user-base, to inform our priorities and planning in such an event. • To examine (and help crystallise) the current state of site (and experiment) resilience and disaster planning. • Raise collective awareness and encourage collaboration and dissemination of best-practice. • An ounce of prevention is worth a pound of cure.

...and talking of quotes • When anyone asks me how I can best describe my experience in nearly forty years at sea, I merely say, uneventful. Of course there have been winter gales, and storms and fog and the like. But in all my experience, I have never been in any accident... or any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort. • E. J. Smith, 1907, Captain, RMS Titanic ( Who ordered the ICE ? – E.J. Smith, 1912)

Status Update Swansea, Sep 08

WLCG Growth September 2008 March 2009

A Magic Moment

Tier-1 Reliability Last 6 months (Sep-Feb) : RAL Reliability = 98% Target reliability for best 8 sites = 98% (from Jan). RAL was in top-5. But this was measured with OPs VO... ATLAS: Last 6 months (Sep-Feb) : RAL Reliability = 90% Target reliability for best 8 sites = 98% RAL was 8th out of 11 sites. Atlas VO But RAL was one of the best sites for both CMS and LHCb

UK CPU Contribution 6 - Months 6 - Months

UK Site Contributions 2007(8) NorthGrid: 34(22)% London: 28(25)% ScotGrid: 18(17)% Tier-1: 13(15)% SouthGrid: 7(16)% GridIreland: 6.1% (~) 1 - Year 6 - Month

CPU Efficiencies 1-Year 6-Months 6-Months 6-Months

Storage (doh!)

UK Tier-2 Storage Integrals (08Q4) Pledged: 1500 TB Provided: 2700 TB Used: 420 TB

Data Transfers

STEP09 (i.e. CCRC09) • Currently, it seems likely that this will be in June. • There may be conflicts with the (much delayed) move to R89. • It raises issues to do with upgrades such as CASTOR 2.1.8.

Current Issues: CASTOR • Current version (CASTOR 2.1.7) appears to function at an acceptable level though there are a number of serious bugs that we are learning to work around (notably the BigID and CrossTalk problems). • These problems have also been observed elsewhere which adds pressure for them to be addressed. • CASTOR 2.1.8 is under test at CERN and shortly at RAL. Consensus is that we need to be very cautious in moving to this version, even though it may address some of the 2.1.7 bugs and offer additional features (eg. Xrootd functionality). • Ultimately, this decision must be driven by the experiments (is a consensus possible?). Strongly prefer not to be the first non-CERN site to upgrade. • Possible conflict with the STEP09 exercise (can we upgrade early enough not to risk participation? Does it make any sense to upgrade after?) • Is there a bigger risk of not upgrading (degrading support for 2.1.7 ?)

Current Issues: R89 • Plan-A: R89 must be accepted by STFC by 1st May to allow a 2-week migration towards the end of June. • Plan-B (if there is a small delay) is a 1-week migration of critical components only. • Plan-C (if there is a longer delay) is to remain completely in the ATLAS building. • Must balance establishing a stable service for LHC data with the advantages of moving to a better environment. • Other factors are STEP09; Castor-upgrade; Costs; and Convenience. • Hand-over delayed from Dec-22nd 2008 by a number of issues: • Cleanliness (addressed) • Inaudible fire alarms (addressed) • Cooling system (outstanding).

Tier-1 Hardware • The FY2008 hardware procurement is currently waiting to be delivered pending resolution of the R89 situation: • CPU: ~2500 KSI2K to be add to the existing 4590 KSI2K. • DISK: ~1500 TB to be added to the existing 2222 TB. • Tape: up to 2000TB can be added to existing 2195 TB. • The FY09 hardware procurement will start as soon as the experiments have determined revised requirements based on the new LHC schedule (i.e. soon).

Current Issues: EGI/NGI • What follows on from EGEE-III in April 2010? • Current idea is an EGI body (European Grid Infrastructure) coordinating a set of national NGIs, together with a middleware consortium and a set of Specialist Service Centres (e.g. one for HEP). • EGI-DS underway. • Timescales and transition are problematic. • What is the UK NGI? Some evolution of the NGS with components of GridPP? • Timescales and transition are problematic. • Funding is complicated. • Initial step: A joint GridPP/NGS working group to try and identify common services. • See talks by John and Robin on the last afternoon.

Summary • This meeting is about making the most of the window of opportunity before LHC data, to ensure that our Grid services are resilient and our disaster planning is as in place, not just at RAL but also at the Tier-2s. • Meanwhile, the UK continues to perform well and make incremental improvements in our delivery to the experiments and the wLCG. • There are, and will continue to be, vexing issues and future uncertainties. We must all keep our eye-on-the-ball. • Remember, it’s not “if” a disaster strikes but “when and where”. LHC Data TBD !

GridPP22 – Service Resilience and Disaster Planning