1 / 23

GridPP22 – Service Resilience and Disaster Planning

GridPP22 – Service Resilience and Disaster Planning. David Britton, 1/Apr/09. Resilience and Disaster Planning. The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site.

ghalib
Télécharger la présentation

GridPP22 – Service Resilience and Disaster Planning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

  2. Resilience and Disaster Planning • The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site. • One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality. PLAN - A

  3. Towards Plan-B Fortifying the Service • Increasing the hardware’s capacity to handle faults. • Duplicating services or machines. • Automatic restarts. • Fast intervention. • In depth investigation of the reason for failure. • Disaster Planning • Taking control early enough. • (Pre-) establishing possible options. • Understanding user priorities. • Timely Action. • Effective Communication. See talks by Jeremy (today) and Andrew (tomorrow)

  4. Disasters: Not “if” but “when+where” wLCG weekly operations report, Feb-09

  5. Disasters: Not “if” but “how big” A typical campus incident

  6. Purpose of GridPP22 • To understand the experiment priorities and plans (insofar as they are defined) in the case of various disaster scenarios. • To extract commonalities across our user-base, to inform our priorities and planning in such an event. • To examine (and help crystallise) the current state of site (and experiment) resilience and disaster planning. • Raise collective awareness and encourage collaboration and dissemination of best-practice. • An ounce of prevention is worth a pound of cure.

  7. ...and talking of quotes • When anyone asks me how I can best describe my experience in nearly forty years at sea, I merely say, uneventful. Of course there have been winter gales, and storms and fog and the like. But in all my experience, I have never been in any accident... or any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort. • E. J. Smith, 1907, Captain, RMS Titanic ( Who ordered the ICE ? – E.J. Smith, 1912)

  8. Status Update Swansea, Sep 08

  9. WLCG Growth September 2008 March 2009

  10. A Magic Moment

  11. Tier-1 Reliability Last 6 months (Sep-Feb) : RAL Reliability = 98% Target reliability for best 8 sites = 98% (from Jan). RAL was in top-5. But this was measured with OPs VO... ATLAS: Last 6 months (Sep-Feb) : RAL Reliability = 90% Target reliability for best 8 sites = 98% RAL was 8th out of 11 sites. Atlas VO But RAL was one of the best sites for both CMS and LHCb

  12. UK CPU Contribution 6 - Months 6 - Months

  13. UK Site Contributions 2007(8) NorthGrid: 34(22)% London: 28(25)% ScotGrid: 18(17)% Tier-1: 13(15)% SouthGrid: 7(16)% GridIreland: 6.1% (~) 1 - Year 6 - Month

  14. CPU Efficiencies 1-Year 6-Months 6-Months 6-Months

  15. Storage (doh!)

  16. UK Tier-2 Storage Integrals (08Q4) Pledged: 1500 TB Provided: 2700 TB Used: 420 TB

  17. Data Transfers

  18. STEP09 (i.e. CCRC09) • Currently, it seems likely that this will be in June. • There may be conflicts with the (much delayed) move to R89. • It raises issues to do with upgrades such as CASTOR 2.1.8.

  19. Current Issues: CASTOR • Current version (CASTOR 2.1.7) appears to function at an acceptable level though there are a number of serious bugs that we are learning to work around (notably the BigID and CrossTalk problems). • These problems have also been observed elsewhere which adds pressure for them to be addressed. • CASTOR 2.1.8 is under test at CERN and shortly at RAL. Consensus is that we need to be very cautious in moving to this version, even though it may address some of the 2.1.7 bugs and offer additional features (eg. Xrootd functionality). • Ultimately, this decision must be driven by the experiments (is a consensus possible?). Strongly prefer not to be the first non-CERN site to upgrade. • Possible conflict with the STEP09 exercise (can we upgrade early enough not to risk participation? Does it make any sense to upgrade after?) • Is there a bigger risk of not upgrading (degrading support for 2.1.7 ?)

  20. Current Issues: R89 • Plan-A: R89 must be accepted by STFC by 1st May to allow a 2-week migration towards the end of June. • Plan-B (if there is a small delay) is a 1-week migration of critical components only. • Plan-C (if there is a longer delay) is to remain completely in the ATLAS building. • Must balance establishing a stable service for LHC data with the advantages of moving to a better environment. • Other factors are STEP09; Castor-upgrade; Costs; and Convenience. • Hand-over delayed from Dec-22nd 2008 by a number of issues: • Cleanliness (addressed) • Inaudible fire alarms (addressed) • Cooling system (outstanding).

  21. Tier-1 Hardware • The FY2008 hardware procurement is currently waiting to be delivered pending resolution of the R89 situation: • CPU: ~2500 KSI2K to be add to the existing 4590 KSI2K. • DISK: ~1500 TB to be added to the existing 2222 TB. • Tape: up to 2000TB can be added to existing 2195 TB. • The FY09 hardware procurement will start as soon as the experiments have determined revised requirements based on the new LHC schedule (i.e. soon).

  22. Current Issues: EGI/NGI • What follows on from EGEE-III in April 2010? • Current idea is an EGI body (European Grid Infrastructure) coordinating a set of national NGIs, together with a middleware consortium and a set of Specialist Service Centres (e.g. one for HEP). • EGI-DS underway. • Timescales and transition are problematic. • What is the UK NGI? Some evolution of the NGS with components of GridPP? • Timescales and transition are problematic. • Funding is complicated. • Initial step: A joint GridPP/NGS working group to try and identify common services. • See talks by John and Robin on the last afternoon.

  23. Summary • This meeting is about making the most of the window of opportunity before LHC data, to ensure that our Grid services are resilient and our disaster planning is as in place, not just at RAL but also at the Tier-2s. • Meanwhile, the UK continues to perform well and make incremental improvements in our delivery to the experiments and the wLCG. • There are, and will continue to be, vexing issues and future uncertainties. We must all keep our eye-on-the-ball. • Remember, it’s not “if” a disaster strikes but “when and where”. LHC Data TBD !

More Related