Grid Operations SA1 Status Report

Grid OperationsSA1 Status Report Steven Newhouse (substituting for Maite Barroso Lopez) EGEE-III Final Review, 23-24 June, 2010

SA1 Activity Overview 28 countries, 175 FTE SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operations Operate the EGEE production infrastructure, providing a high quality service to all the application groups Enabled through support structures including: • Regional Operation Centres • Global Grid User Support (GGUS) • Release and deployment management • Grid security coordination • Operational procedures and tools Structural changes needed in order to transition to a sustainable model SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Size of the infrastructure Number of EGEE-III certified sites (April 2010): 317 (in 52 countries) Computing resources, doubled since last year: • 243,020 cores (139,000 last year) • 500 MSI2k (212 MSI2k last year) (MSI2k = Normalised CPU time to a reference value of Mega SPECint 2000) Storage resources: • 40 PB disk, 61 PB tape (25 PB disk, 38PB tape last year) SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Usage of the infrastructure (I) CPU utilization over time: Consistent doubling every 12-18 months Usage from non-LHC communities stable LHC usage growing, real activity: Real data processing, simulations, analysis SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Usage of the infrastructure (II) Number of jobs • Average of 15 million jobs per month (14M last year) • 525.000 jobs/day (480.000 jobs/day last year) • Non-LHC VOs with stable workload • LHC running increasingly high workloads • anticipate millions / day soon SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Site availability and reliability (I) No changes in algorithm regarding last year • Service available if any of the service instances are available (logical OR) • Site available if all services are available (logical AND of all service types) EGEE League Tables (monthly reports) published and followed up with ROCs • Also available for top 10 VOs Underperforming sites requested to report justifications, collected in a wiki page • https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

EGEE League Tables April 2010 Most regions have a good performance well above targets (70% availability and 75% reliability) April had highest average availability in the last year: 91% New ROCs started with lower A/R levels than experienced ones, took a few months to achieve targets SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Site availability and reliability (II) Sites with availability <50% for three consecutive months are removed from the Production infrastructure • 7 sites removed, from AsiaPacific and Russia Simplification of downtime declaration rules • To enforce scheduled downtimes • Unscheduled downtimes has a big impact in site availability! And to users! • Ratio of scheduled/unscheduled downtimes doubled between May-09 and Apr-10 • Sites are doing a better job following the procedures and pro-actively informing their user communities of planned maintenances Most regions have a good performance well above targets • 70% availability • 75% reliability SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Site availability seen by VOs Views from WLCG dashboards Built on top of VO specific grid monitoring probe results Using same grid monitoring framework SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Site availability seen by VOs Views from the Complex Systems VO Monitoring Built on top of VO specific grid monitoring probe results Using same grid monitoring framework SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Release and deployment management Move to Staged rollout as transition to EGI model • Controlled deployment of middleware updates into production • Get early feedback on the release under a production load • Done by a few sites on behalf of the community (to avoid duplication) • All ROCs agree with the concept, however few sites involved - so far • Requires commitment to upgrade quickly in production • As it is in production, might affect availability and reliability Process for retiring old versions of middleware services • Prototyped with Resource Brokers • Being implemented with gLite 3.1 services (replaced by gLite 3.2) • Big/medium sites able to follow new versions • Small ones slow, different issues, e.g. Obsolete H/W not running SL5 • Sites per Software version: 48% gLite 3.2, 52% gLite 3.1 List of gLite service versions supported by SA1 Operations https://twiki.cern.ch/twiki/bin/view/EGEE/SupportedServiceVersions SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Distributed Operational Security Strong expertise in incident handling • Defined timelines for response and resolution • Security drills to educate and enforce procedures. Main source of incidents: failure to apply security patches • First alert in August 2009, EGEE completely patched after 2.5 months • First time a security patch was enforced at the sites • Escalated to EGEE-III PMB and WLCG Management Board • 60+ sites warned about suspension within 7 days, only 4 suspended • Second alert in November 2009, EGEE patched after 1 month • 30+ sites warned about suspension, none suspended Effective security patching monitoring • Security monitoring tools developed and deployed to prevent security incidents • Identify security weaknesses at sites to prevent incidents Extensive connections outside the project • Build links between ROC security contacts and local NRENs SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Operational Security Support Grid Security Vulnerability Group • Help identify security weaknesses inorder to prevent incidents in deployed Grid middleware • 33 issues handled during the last year (60 since the start of EGEE-III) • Transition in progress to cover all the software distributed by EGI Joint Security Policy Group • 5 existing policies revised and 2 new polices: • VO Registration, VO Membership Management, Security Incident Response, Grid AUP, Site Registration • User-level Job Accounting, VO Portals • Security Policy Framework developed to support move to EGI International Grid Trust Federation • Bridged Research & Education federated ID and grid certificates • New guidelines for key management and the use of automated clients published SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Global Grid User Support (GGUS) The EGEE helpdesk is a distributed system based on regional ticketing systems with central coordination and a central entry point for user support It is the only front-end exposed to end users • Developers & testers submit bugs directly to Savannah • GGUS tickets leading to Savannah bugs • Closed when Savannah bug reaches status “Ready for review”. • 44tickets ended in Savannah bugs since January 2010 • 28 of these for components under the responsibility of JRA1 Changes to support increased automation • Direct routing to sites (bypass 1st line support) • Reduction in tickets handled manually in the last year (2023 versus 3912) • ~80% of all 9419 tickets are now assigned directly • Additional automation options implemented when justified • Direct assignment to some VOs and GGUS Support Units • ALARM and TEAM tickets • Ticket escalation by submitter SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Global Grid User Support Originating Community (Select VO from dropdown list) – Y2 None: tickets where the submitting user does not explicitly define the associated VO Atlas is the major user (31%), followed by LHCb(9%), CMS (3%) andbiomed (3%) Each VO has a different support model SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Global Grid User Support User support tickets created in Y2 for each support unit type 1stline support (TPM) also solving 4% of the tickets 2nd line SpecialisedSoftware Support Units solving 8%of the tickets Most tickets solved at ROCs/sites, 87% SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operator on duty Role of oversight and 1st level support for grid production infrastructure • Teams of operators on duty (COD) opening tickets on sites in response to grid monitoring alarms New model, based on the devolution to regions, fully implemented Summer ‘09 • First-line support done by each region, plus common layer for procedures, tools, escalation (the R-COD) Improvements: • Regional teams closer to the sites • Aware of different site deployment scenarios • More incentivised to prevent incidents • Shorter response time: 5.75 days (ticket assignment  closing) • Before regionalisation average response time 8 days • Distribution of responsibilities to ROCs  NGIs • Concern: Skills at ROC level will get partitioned when ROCs become NGIs SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operator on duty Tickets opened in the last year Tickets opened per month NAGIOS • 2570 tickets opened by the R-COD in the last year with an average response time 5.75 days • Stable number of tickets opened by month, also after migration from SAM to Nagios • Move to new technology did not impact operations • Expect reduction in regional alarms as more sites start deploying Nagios • Sites see issues before they are reported to them by R-CODs SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operations Tools • Aims • Improve reliability and availability of sites via operational tools • Prepare operational tools for use in an EGI/NGI structure • Achievements: • Better tools for site admins faster problem detection & solution • Operations dashboard with regional views, to raise alarms and GGUS tickets, used by R-COD • MyEGEEvisualization portal, collaboration with OSG • Availability/Reliability now calculated from regional Nagios results • Adoption and integration of mature open-source software • Site and Regional Monitoring via Nagios monitoring framework • Integration of operational tools via ActiveMQ messaging system SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operations Tools Last 2 years: Now: SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

EGI: NGI prototyping Integration of NGIs into the infrastructure has started • Prototyped with 2 NGIs from 2 different multi-country ROCs • Poland from Central Europe • Greece from South East Europe • Centrally: Registration in managerial and operational structures • Locally: Tool deployment and adaption of processes (as required) Process now understood and documented Now being followed by many other NGIs • Issues: Tight coordination needed between tools and teams to enable the new NGI • Issues: With these two examples NGI staff overlapped with ROC staff with many years cumulated knowledge/experience • Most NGIs will need close support and training SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Lessons learnt Regionalization is expensive due to extra layers • Move to national based infrastructures has ‘cheaper’ core Operation tools • Importance of agreeing on formats for information exchange between tools • Rely where possible on widely adopted public domain solutions (e.g. Nagios, ActiveMQ) • Need to have a more integrated operational tool development activity SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Summary Provided a stable and reliable infrastructure • While doubling (approx.) in capacity • Significant operational changes Need to improve operational tools • Home-grown tools & infrastructures costly to maintain and support Good transition to EGI, no major issues left • Full Implementation of ROC  NGI ongoing Global scientific collaborations relying on it • WLCG main user group, solid base for ESFRI engagement Collaboration was essential, important not to break this with regionalization/NGIs SA1 – Steven Newhouse- EGEE-III Final Review 23-24 June 2010

Grid Operations SA1 Status Report