EGEE Operation Procedures
This document outlines the operational procedures for the ROC-on-Duty, focusing on the monitoring and management of EGEE Grid operations. It covers the responsibilities of operators, weekly meetings, tools used for monitoring (such as SAM and gstat), ticket management processes, and escalation procedures. The tutorial emphasizes the coordination between different ROC centers and how to effectively respond to incidents within the grid infrastructure. Additionally, it provides guidance for creating and managing tickets and offers insights into best practices for site operators.
EGEE Operation Procedures
E N D
Presentation Transcript
EGEE Operation Procedures Alexandre Duarte CERN IT-GD-OPS
COD • COD is Operator on Duty • global LCG/EGEE GRID monitoring • 1 (2) ROCs responsible for the whole GRID operations at a time • 12 ROCs involved • weekly rotation • weekly WLCG-OSG-EGEE Operations meeting • ROCS, Tier1, experiments • all sites invited Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Procedures • https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures • Looking at monitoring tools • SAM, gstat, Certificate Monitoring pages • Open tickets using COD Dasboard • Escalate expired tickets • Process site responses (update tickets accordingly) • End of duty: hand-over notes Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Dashboard • summary of necessary monitoring information + tools for ticket processing • tickets linked to GGUS • GOCDB information • SAM + gstat results • ticket creation and management tool • tools for related e-mail Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Dashboard Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Procedure • defines the steps to be taken during the lifetime of a ticket • avaliable on CIC Operations Portal • (https://edms.cern.ch/document/701575) • distinction between sites depending on the amount of resources Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Steps • ticket creation • first mail (to: site + ROC) • second mail (to: site + ROC) • suspension from the GRID • before 4.: • mail to ROC • weekly operations meeting call • mail to OMC for validation Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Procedure • site categories • low: CPU <20 • normal: 20 < CPU < 100 • high: 100 < CPU • between 2.-3. and 3.-4. • low + normal: 3 days • high: 1 days Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
When deadline reached Create ticket Problem solved ? Close ticket Escalate mail Extend deadline last escalation ? Suspend site mail mail mail site responds Escalation Procedure yes no no Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
What a site should do • Look at the monitoring tools (SAM) • try to notice & fix failures before the CODs • COD notification about a failure • fix it ASAP • Scheduled downtime • announce it in advance • announce when it's finished • problems → contact the ROC • best way: Create a ticket • question → ask the ROC Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006