1 / 10

WLCG Service Report: Status and Challenges - February 2012

The WLCG Service Report for February 14, 2012, presents an overview of the current status and challenges of the service. While the system is running relatively smoothly, there are notable concerns including the delayed deployment of essential software and several incidents related to Oracle 11g upgrades. Key issues documented include alarm tickets related to ALICE and CMS services, and a suggestion for clearer management of Service Incident Reports (SIRs). Overall, the report highlights the need for improved documentation and investigation regarding service alerts and management.

yeriel
Télécharger la présentation

WLCG Service Report: Status and Challenges - February 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WLCG Service Report Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 14th February 2012

  2. Introduction • The service is running rather smoothly, the “metrics” are working relatively well • At least one significant change in the pipeline: • EMI FTS deployment in production at Tier0 and Tier1s (well) prior to 2012 pp data taking • At last T1SCM the relevant m/w had not been released (due Feb 16) nor was roadmap clear to all (being prepared) • SIRs: one requested covering Oracle 11g upgrades; others due for the 2 alarm tickets of 2012

  3. WLCG Operations Report – Structure

  4. GGUS summary (5 weeks) 4

  5. This is what is recorded in the ticket. However, it is neither a complete nor accuratesummary due to some confusion between multiple incidents and human error in updating (closing) the wrong ticket. IMHO a SIR would be useful in clarifying this. ALICE ALARM->voms-proxy-init hangs GGUS:78739 WLCG MB Report WLCG Service Report

  6. CMS ALARM->no connect to CMSR db from remote PhEDEx agents GGUS:78843 Firewall misconfiguration immediately after Oracle 11g upgrade

  7. SIR by Area (Q4 2011)

  8. Time to Resolution

  9. “Serious” SIRs in Q4 2011

  10. Conclusions • The service is (chartreuse, pistachio, olive…) • SIRs and alarms: details regarding any problem should preferably be entered into / attached to the corresponding GGUS ticket • New rule: if there is an alarm ticket (justified) and the resolution / follow-up are not in the ticket they should be documented in a SIR • Quite probable that further investigation is required • Usability of SUM: few or no exceptions – there are currently too many “patches” on the reports for them to be useful • Change management: at least one “iceberg” ahead (EMI FTS deployment at Tier0 and Tier1s prior to 2012 data taking)

More Related