1 / 30

WLCG Operations

WLCG Operations. John Gordon, CCLRC GridPP18 Glasgow 21 March 2007. 3 Grids. EGEE OSG Nordugrid. WLCG=3 Grids. EGEE+OSG+NGDF Would like it to be one seamless grid but not yet High-level tasks like Simulation Production can be split into 3 parts and farmed out

ghazi
Télécharger la présentation

WLCG Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WLCG Operations John Gordon, CCLRCGridPP18Glasgow 21 March 2007

  2. 3 Grids EGEE OSG Nordugrid

  3. WLCG=3 Grids • EGEE+OSG+NGDF • Would like it to be one seamless grid but not yet • High-level tasks like Simulation Production can be split into 3 parts and farmed out • Interoperability has some successes in job submission and information publishing • For us WLCG Operations = EGEE Operations • Many parts to infrastructure – concentrate here on Production Service • How does it relate to you? • What action can you take?

  4. Test-beds & Services Operations Coordination Centre Certification Testbeds (SA3) Regional Operations Centres Pre-production Service Global Grid User Support Production Service EGEE Network Operations Centre (SA2) Operational Security Coordination Team Security & Policy Groups Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Operations Advisory Group (+NA4) The EGEE Infrastructure Support Structures • Infrastructure: • Physical test-beds & services • Support organisations & procedures • Policy groups

  5. Middleware Release • Technical Coordination Group • Agrees the contents and priorities for what goes into the integration and testing process • Not all desired new components or updates may make the next distribution • Depends on priorities and urgency for other pieces • Moving away from big-bang releases to component upgrades • Concept of a baseline release and then updates and patches • New baseline when significant changes (dependencies, …)

  6. Certification • Extensive certification test-bed: • Close to 100 machines involved • Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites • Emulate the deployment environments • Or at least the main ones … • Certification testing: • Installation and configuration • Component (service) functionality • System testing (trying to emulate real workloads and stress testing) • Beginning to use virtualization to simplify the testing environment • Deployment into the pre-production system • Final step of certification – validation by real sites • Validation by applications – also allows to prepare apps for new versions Mostly hidden from you, but a lot of effort goes into it.

  7. Operations • Operations Meetings • Weekly reports • GGUS • TPM, COD • Accounting • Monitoring

  8. Grid management: structure • Operations Coordination Centre (OCC) • management, oversight of all operational and support activities • Regional Operations Centres (ROC) • providing the core of the support infrastructure, each supporting a number of resource centres within its region • Grid Operator on Duty • Resource centres • providing resources (computing, storage, network, etc.); • Grid User Support (GGUS) • At FZK, coordination and management of user support, single point of contact for users

  9. Regional Operations Centre Regional Operations Centre Resource Centre Resource Centre … … Resource Centre Resource Centre Grid monitoring The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources Monitoring shows a problem Grid Operator on-duty (COD) OSCT Regional Operations Centre … …

  10. Grid Operator on Duty • Role: • Watch the problems detected by the grid monitoring tools • Problem diagnosis • Report these problems (GGUS tickets) • Follow and escalate them if needed (well defined procedure) • Provide help, propose solutions • Build and maintain a central knowledge database (WIKI) • Who? • 10 ROC teams working in pairs (one lead and one backup) on a weekly rotation

  11. Grid monitoring tools • Tools used by the Grid Operator on Duty team to detect problems • Distributed responsibility • CIC portal • single entry point • Integrated view of monitoring tools • Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) • Grid Operations Centre Core Database (GOCDB) • GIIS monitor (Gstat) • GOC certificate lifetime • GOC job monitor • Others

  12. COD Tickets • Don’t ignore them! • If problems seem to fix themselves (BDII load) then keep some stats (tickets/interventions) and report to Jeremy/Philippa • Don’t just fix problems • Report trends, repeat problems, solutions • The problem at your site is often a symptom of an underlying problem • Middleware, deployment, configuration, documentation. • Your intervention might help to fix them

  13. SAM Availability Algorithm • CE = OR of your CEs • SE = OR of your SEs • Up if CE.AND.SE.AND.BDII.AND.SRM • If Down Then Down until next Up • Availability = % of time Up • Reliability = % of time Up excluding Scheduled Downtime

  14. What to do? • SAM Monitoring will be used to judge your site in many ways • MoU, user satisfaction, Operations • Get used to it! • Complaining about the middleware doesn’t work • Continue to raise tickets and operations reports • Look for workrounds • Look at SAM failures for long-term fixes. • If you can’t reduce the number of problems, reduce their effect • Automation, alarms • Many other tools • Nagios? • Work on your problems but also work as a team.

  15. Accounting • Each Tier1 submits manual report of:- • Cputime, wallclocktime, disk, tape • Allocated and used • Per LHC VO • Aggregated into a monthly report • Which accumulates through the year • Compared with MoU and installed capacity

  16. Automated Accounting • This report is being Automated • From March the results will be taken from APEL • Overlap with manual report for 3 months • Storage Accounting too (Greg’s talk) • Once automatic, easy to extend to Tier2s • Be warned!

  17. What to do • Study APEL for your site • Look for gaps in data • Check SI2K values published • Compare with local records • Check Storage Accounts • If you are not being used by VOs, investigate

  18. Summary • Act on trouble tickets • Work on improving your SAM figures • Check your accounting

  19. Message • Site view may be from the bottom up • We are motivated to put constituent parts in place and run them well • WLCG view is from the top down. • From up there they see the Tier1s clearly and are driving them • They’ll spot you soon, so be prepared. • Learn from the Tier1 • GridPP has been a success in delivering to LHC • … but the pressure will increase over 2007 • Keep up the good work!

More Related