Global Grid Operations and their Tools

Hélène Cordier EGEE/WLCG Operations IN2P3 Computing Centre Lyon (France) - helene.cordier@in2p3.fr Global Grid Operationsand their Tools

Contents Grid Operations Issues EGEE/WLCG ways of solving these issues Use case and daily operations WLCG specifics and current developements in Operations Future Work

CPU, countries, sites • 35000 CPU • 45 countries (31 partner countries) • 237 sites (131 partner sites) Ian Bird - OGF/EGEE User Forum - May 9th 2007 3

Workload 98000 jobs/day 13000 jobs/day Ian Bird - OGF/EGEE User Forum - May 9th 2007 4

Operating a grid • Middleware deployment - Availability and Functionnality and Knowledge Management • Management of Pre-production –Functionnality • Security issues - Reliability • Monitoring sites and services - Availability then Evaluation • Accounting – Assessment • Support end-users and sites – Support • Interoperability : m/w, operations • Supervising Production : operations responsability – metrics – Dependability and Sustainability, putting it all together, communications between various actors – Knowledge Management

Middleware and Certification The goal is to produce a middleware distribution that can be deployed widely Certification testing includes: Installation and configuration Component (service) functionality System testing (trying to emulate real workloads and stress testing) Support, analysis, debugging VDT/OSG CERTIFICATION OMII- Europe Middleware providers Testing & Certification … Integration Pre-production service Production service CERT M/W Certification activities CERT+M/W OPERATIONS

Pre-production service • Pre-production service is now ~ 27 sites in 16 countries • Provides access to some 3000 CPU • Some sites allow access to their full production batch systems for scale tests • Sites install and test different configurations and sets of services • Services may be initially demonstrated in this environment • Before further development • New VO-s: adapt their applications & gain experience • (e.g. DILIGENT)

Use of the infrastructure >20k jobs running simultaneously SA1 - Ian Bird - EGEE-II 1st EU Review - 15-16 May 2007 8

Operator Dashboard concept Operator Monitoring tool #1 Sites info Monitoring tool #2 Sites info Dashboard Monitoring tool #n Monitoring tool #1 Monitoring tool #2 Mail client Monitoring tool #n Mail sender Ticketing system Ticketing system MANY ENTRY POINTS SINGLE ENTRY POINT

Site User Operator Regional Center Daily operations CIC DB GOC DB Information on sites Information on VOs Integration Tools GGUS Operations Portal cic.gridops.org Monitoring tools User Support & Ticketing system JS IS Communication tools BROADCAST

Repository for site information • Keep a central repository of information on the components of the grid • Site registry (name, location, contact information, administrator contact, security contact, …) • Site status (candidate, uncertified, production, suspended, …) • History of scheduled unavailability of the site • Grid services operated by the site: computing elements, storage elements, file catalogue services, virtual organization management services, resource brokers, etc. • Services that sites want to be monitored by the grid operators • Updating this information is a shared responsibility between the site operator and the federation manager • WLCG/EGEE • central repository of site information (a.k.a. Grid Operations Centre) developed and operated by Rutherford Appleton Laboratory (RAL), UK. • http://goc.grid-support.ac.uk/gridsite/gocdb • This repository is used by the grid monitoring services (more on this later)

Monitoring • Grid operators need to have a global view of the status of the infrastructure • Grid information is highly dynamic • Tools required to collect information on the grid component state • Availability of resources and services, based on the static information stored in the central site repository • Collection of metrics on availability of resources and services • WLCG/EGEE • Service of probes sent to every site to check it on a regular basis • Service for regularly testing the consistency of the dynamic information published by the site in the grid information system • Information on the result of those tests is available to grid operators, site managers and end-users • Virtual Organization managers can use this information to select a set of sites they intend to use • Monitoring services developed and operated by CERN, Academia Sinica (Taiwan), GridPP (UK) and INFN (Italy)

OPERATIONS PORTAL GGUS WSDL WSDL Ticket follow-up Problem detection & reporting Operator on duty End User Tickets workflow Problem detection FZK, Karlsruhe, Germany IN2P3-CC, Lyon, France dashboard Ticket Ticket UK … FR GER IT Regional Support Units

Operations –Global Grid User Support

Tracking incidents • Incident tracking model • Unique channel for opening tickets • End-users : e.g job submission failures, data transfer failed • Operators : e.g job submission failures • Classification and 1rst assignment done by the ticket process manager • Tickets are assigned to support units - one per domain of expertise • Grid operators, applications, federations, m/w experts, … • WLCG/EGEE • Central incident tracking tool developed/operated by Forschungszentrum Karlsruhe (DE) • https://gus.fzk.de/ • Same tool used by grid operators and end users • e-mail and web interface • Sites failing the tests receive are assigned a ticket • Escalation procedure for solving site-related problems • Involves the regional operator and the site operator • Interface with ticket handling tools used by sites/federations (if needed) • Tools for collecting metrics on the responsiveness of support units

Putting all together • Web portal for integrating all the tools and sources of operations-related information into one single place • Developed and operated by CC-IN2P3, failover instance at CNAF • http://cic.gridops.org/ • Provides and maintains an integrated operations dashboard for grid on duty operator • Provides mechanisms for keeping information needed for appropriate hand over between operators on duty • Easy access to appropriate contact information on every actor involved in the operations of the grid • Provides communication tools

Alarms Dashboard

Alarm Details

Service Interruptions

Tracking incidents

Opening tickets

GGUS Ticket 1/2

GGUS Ticket 2/2

1st level support Operator-on-duty Monitoring shows a problem Operator submits a GGUS ticket against the site’sfederation and CC’s to the site (when known) federation 2nd level support Federation and site work to resolve the problem If the Federation and site cannot resolve the problem, the Tier1/ROC contacts the relevant Support Unit or assistance. Site SupportUnit(experts) 3rd level support Operations support modelOperators’s escalation process

Operations tickets vs. all GGUS tickets • 25% of all GGUS tickets over almost 2 years • Av 200 tickets/month • ENOC tickets since August 2006

ROC av. solution time to GGUS tickets • ROCs are attentive to operational tickets

Current Work and Summary • Achieve a real 24x7 production quality-like service : Failover mechanisms • Increase automation of daily monitoring tools and alarms treatment. • Achieve sustainable structure through WLCG production. • Achieve scalable structure with a constant increase in the number of sites and diversity of users. • Diverse monitoring tools are developed throughout federations because a grid cannot stand on its own. Failures cause are numerous. • Site administrators and end-user need to assess that its services are available and reliable.

Credits and References • Gstat • http://goc.grid.sinica.edu.tw/gstat/ • GGUS • http://gus.fzk.de/ • GOC-DB • http://goc.grid-support.ac.uk/ • SAM • http://goc.grid.sinica.edu.tw/gocwiki/Service_Availability_Monitoring_Environment • https://WLCG-sam.cern.ch:8443/sam/sam.cgi • CMS DASHBOARD • http://arda-dashboard.cern.ch/cms • GridIce • http://grid.infn.it/gridice • Lavoisier • http://grid.in2p3.fr/lavoisier Operations Portal http://cic.gridops.org EGEE http://www.eu-egee.org WLCG http://www.cern.ch/WLCG Numerous slides from : Ian Bird - OGF/EGEE User Forum - May 9th 2007 Rob Quick, Workshop on Grid services Monitoring HPDC’07 – June 27th 2007

Global Grid Operations and their Tools