WLCG Operations

WLCG Operations John Gordon, CCLRCGridPP18Glasgow 21 March 2007

3 Grids EGEE OSG Nordugrid

WLCG=3 Grids • EGEE+OSG+NGDF • Would like it to be one seamless grid but not yet • High-level tasks like Simulation Production can be split into 3 parts and farmed out • Interoperability has some successes in job submission and information publishing • For us WLCG Operations = EGEE Operations • Many parts to infrastructure – concentrate here on Production Service • How does it relate to you? • What action can you take?

Test-beds & Services Operations Coordination Centre Certification Testbeds (SA3) Regional Operations Centres Pre-production Service Global Grid User Support Production Service EGEE Network Operations Centre (SA2) Operational Security Coordination Team Security & Policy Groups Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Operations Advisory Group (+NA4) The EGEE Infrastructure Support Structures • Infrastructure: • Physical test-beds & services • Support organisations & procedures • Policy groups

Middleware Release • Technical Coordination Group • Agrees the contents and priorities for what goes into the integration and testing process • Not all desired new components or updates may make the next distribution • Depends on priorities and urgency for other pieces • Moving away from big-bang releases to component upgrades • Concept of a baseline release and then updates and patches • New baseline when significant changes (dependencies, …)

Certification • Extensive certification test-bed: • Close to 100 machines involved • Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites • Emulate the deployment environments • Or at least the main ones … • Certification testing: • Installation and configuration • Component (service) functionality • System testing (trying to emulate real workloads and stress testing) • Beginning to use virtualization to simplify the testing environment • Deployment into the pre-production system • Final step of certification – validation by real sites • Validation by applications – also allows to prepare apps for new versions Mostly hidden from you, but a lot of effort goes into it.

Operations • Operations Meetings • Weekly reports • GGUS • TPM, COD • Accounting • Monitoring

Grid management: structure • Operations Coordination Centre (OCC) • management, oversight of all operational and support activities • Regional Operations Centres (ROC) • providing the core of the support infrastructure, each supporting a number of resource centres within its region • Grid Operator on Duty • Resource centres • providing resources (computing, storage, network, etc.); • Grid User Support (GGUS) • At FZK, coordination and management of user support, single point of contact for users

Regional Operations Centre Regional Operations Centre Resource Centre Resource Centre … … Resource Centre Resource Centre Grid monitoring The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources Monitoring shows a problem Grid Operator on-duty (COD) OSCT Regional Operations Centre … …

Grid Operator on Duty • Role: • Watch the problems detected by the grid monitoring tools • Problem diagnosis • Report these problems (GGUS tickets) • Follow and escalate them if needed (well defined procedure) • Provide help, propose solutions • Build and maintain a central knowledge database (WIKI) • Who? • 10 ROC teams working in pairs (one lead and one backup) on a weekly rotation

Grid monitoring tools • Tools used by the Grid Operator on Duty team to detect problems • Distributed responsibility • CIC portal • single entry point • Integrated view of monitoring tools • Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) • Grid Operations Centre Core Database (GOCDB) • GIIS monitor (Gstat) • GOC certificate lifetime • GOC job monitor • Others

COD Tickets • Don’t ignore them! • If problems seem to fix themselves (BDII load) then keep some stats (tickets/interventions) and report to Jeremy/Philippa • Don’t just fix problems • Report trends, repeat problems, solutions • The problem at your site is often a symptom of an underlying problem • Middleware, deployment, configuration, documentation. • Your intervention might help to fix them

SAM Availability Algorithm • CE = OR of your CEs • SE = OR of your SEs • Up if CE.AND.SE.AND.BDII.AND.SRM • If Down Then Down until next Up • Availability = % of time Up • Reliability = % of time Up excluding Scheduled Downtime

What to do? • SAM Monitoring will be used to judge your site in many ways • MoU, user satisfaction, Operations • Get used to it! • Complaining about the middleware doesn’t work • Continue to raise tickets and operations reports • Look for workrounds • Look at SAM failures for long-term fixes. • If you can’t reduce the number of problems, reduce their effect • Automation, alarms • Many other tools • Nagios? • Work on your problems but also work as a team.

Accounting • Each Tier1 submits manual report of:- • Cputime, wallclocktime, disk, tape • Allocated and used • Per LHC VO • Aggregated into a monthly report • Which accumulates through the year • Compared with MoU and installed capacity

Automated Accounting • This report is being Automated • From March the results will be taken from APEL • Overlap with manual report for 3 months • Storage Accounting too (Greg’s talk) • Once automatic, easy to extend to Tier2s • Be warned!

What to do • Study APEL for your site • Look for gaps in data • Check SI2K values published • Compare with local records • Check Storage Accounts • If you are not being used by VOs, investigate

Summary • Act on trouble tickets • Work on improving your SAM figures • Check your accounting

Message • Site view may be from the bottom up • We are motivated to put constituent parts in place and run them well • WLCG view is from the top down. • From up there they see the Tier1s clearly and are driving them • They’ll spot you soon, so be prepared. • Learn from the Tier1 • GridPP has been a success in delivering to LHC • … but the pressure will increase over 2007 • Keep up the good work!

WLCG Operations

WLCG Operations

Presentation Transcript

Support Tools, Underlying Services and WLCG Operations

WLCG

WLCG Operations and Tools TEG: Immediate Outlook

ALICE WLCG operations report

WLCG Operations Coordination

WLCG Remarks

WLCG Operations Coordination: CVMFS Deployment Status

WLCG Vision

ALICE WLCG operations report

Report from the WLCG Operations and Tools TEG

WLCG Operations TEG – WG3

WLCG Operations Coordination: CVMFS Deployment Status

WLCG Update

WLCG Update

WLCG status

Input from WLCG Operations Coordination

WLCG Operations Coordination report

WLCG Operations and Tools TEG Monitoring – Experiment Perspective

WLCG Operations and Tools TEG Status Report

WLCG Technical Evolution Group: Operations and Tools

OSG Operations and WLCG Interoperations

WLCG-RUS