270 likes | 402 Vues
This paper reviews the incident that occurred at INFN-T1 on March 9, leading to a complete center shutdown due to a cooling problem. We discuss our on-call procedures, the response timeline, and the challenges faced during the recovery. Key lessons learned include the need for clearer emergency procedures, better monitoring of critical systems, and improvements in communication among on-call staff. This incident highlighted the importance of preparedness and the systematic approach to handle cooling failures in high-demand environments.
E N D
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014
Outline • INFN-T1 on-call procedure • Incident • Recover Procedure • What we learned • Conclusions Andrea Chierici
On-call service • CNAF staff on-call on a weekly basis • 2/3 times per year • Must live within 30min from CNAF • Service phone receiving alarm SMSes • Periodic training on security and intervention procedures • 3 incidents in last three years • only this last one required the site to be totally powered off Andrea Chierici
Service Dashboard Andrea Chierici
What happened on the 9th of March • 1.08am: fire alarm • On-call person intervenes and calls Firefighters • 2.45am: fire extinguished • 3.18am: high temp warning • Air conditioning blocked • On-call person calls for help • 4.40am: decision is taken to shut down the center • 12.00pm: chiller under maintenance • 17.00pm: chiller fixed, center can be turned back on • 21.00pm: farm back on-line, waiting for storage Andrea Chierici
10th of March • 9.00am: support call to switch storage back on • 6.00pm: center open again for LHC experiments • Next day: center fully open again Andrea Chierici
Chiller power supply Andrea Chierici
Incident representation Ctrl sys Pow1 Ctrl sys Pow 2 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici
Incident examination • 6 chillers for the computing room • 5 share the same power supply for the control logic (we did not know that!) • Fire in one of the control logic, power was cut to 5 chillers out of 6 • 1 chiller was still working and we weren’t aware of that! • Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. Andrea Chierici
Facility monitoring app Andrea Chierici
Chiller n.4 BLACK: ElectricPowerin (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C) Andrea Chierici
Incident seen by inside Andrea Chierici
Incident seen by outside Andrea Chierici
Recover procedure • Facility: support call for an emergency intervention on chiller • recovered the burned bus and the control logic n.4 • Storage: support call • Farming: took the chance to apply all security patches and latest kernel to nodes • Switch on order: LSF server, CEs, UIs • For a moment we were thinking about upgrading to LSF 9 Andrea Chierici
Failures (1) • Old WNs • BIOS battery exhausted, configuration reset • PXE boot, hyper-threading, disk configuration (AHCI) • lost IPMI configuration (30% broken) Andrea Chierici
Failures (2) • Some storage controllers were replaced • 1% PCI cards (mainly 10Gbit network) replaced • Disks, power supplies and network switches were almost not damaged Andrea Chierici
We fixed our weak point Ctrl sys Pow 3 Ctrl sysPow 2 Ctrl sys Pow1 Ctrl sys Pow 6 Ctrl sys Pow 4 Ctrl sys Pow 5 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici
We miss an emergency button • Shut the center down is not easy: a real “emergency shutdown” procedure is missing • We could have avoided switching down the whole center if we have had more control • Depending on the incident, some services may be left on-line • Person on-call can’t know all the site details Andrea Chierici
Hosted services • Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control • We need an emergency procedure for those too • We need a better understanding of the SLAs Andrea Chierici
We benchmarked ourselves • It took 2 days to get the center back on-line • less than one to open LHC experiments • everyone was aware about what to do • All working nodes rebooted with a solid configuration • A few nodes were reinstalled and put back on line in a few minutes Andrea Chierici
Lesson learned • We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) • The new dashboard appears to be the right place • We created a task-force to implement a controlled shutdown procedure • Establish a shutdown order • WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches • In case of emergency, on-call person is required to take a difficult decision Andrea Chierici
Testing shutdown procedure • The shutdown procedure we are implementing can’t be easily tested • How to perform a “simulation”? • Doesn’t sound right to switch the center off just to prove we can do it safely • How do other sites address this? • Should periodic bios battery replacements be scheduled? Andrea Chierici