100 likes | 236 Vues
This document outlines the alarm management procedures utilized in computer centers. Alarms are monitored 24/7 by operators and indicate necessary actions, ranging from logging events to summoning experts at odd hours. A structured workflow ensures that alarms are logged and addressed according to predefined procedures. Operators follow specific steps to identify appropriate actions, access necessary tools, and escalate issues when needed. The guidelines emphasize collaboration between Service Managers and SysAdmins, providing a framework for effective problem resolution and management of alarms.
E N D
Alarms in CC A brief overview of alarms’ management in the computer centre
Introduction • All alarms of the computer centre are presented to the Operators 24/7 • Alarm means that an action is required • Can range from simply logging the event to calling out an expert at 2:00 AM
Computers’ alarms workflow Contract type E if not administrated by SysAdmins Contract type D
A typical use case • Log each event (alarm) • Do not analyze a situation • Apply procedures ! • Step by step • Tools must be provided, access granted • Fix and/or escalate problems • Allowed to call people outside working hours LAS OPM SDB
Use case: steps 1 & 2 • OPM (web server) returns a ranked list of matching procedures • Operator selects most appropriate • LAS is a web based GUI
Use case: steps 3 & 4 • SDB (Service DB) lists Services Managers • Use the short URL provided at the bottom of each page to reference them in procedures! • Procedure content • List of nodes (applies to) • One entry per alarm • Commands to type are highlighted • Support links to SDB
Service Managers’ Controls Not covered Do-it-yourself (tuning, corrective actions, etc...) We pay for a number of alarms per month Assistance needed (out of working hours, h/w faults, etc...)
Providing procedures • In General: • Only Service Managers know what to do if anything goes wrong on their service(s) • Simple or urgent actions Operators • e.g. reboot machine, take it out of production, ... • More complex solutions SysAdmins • e.g. regenerate certificates, looking in log files, ... • Different Service Managers: • Application SM: service related procedures • Infrastructure SM: machine related procedures
Providing procedures: GUI • http://cern.ch/service-cc-opm/ (demo) • Hints / restrictions: • Quick help for impatient (6 steps) • Start from proposed template (Operators) • Save locally, edit, upload new procedure • Validation! • Further edits can be done on-line (IE vs FF)
Useful links • Service Managers Guidelines • http://cern.ch/it-div-fio-sao/guides/SM_guidelines.htm • Lemon alarm system • http://cern.ch/lemon-status/ • Using the SysAdmin Service • http://cern.ch/service-cc-sysadmin/SM_guidelines.htm