physics piquet services during the CERN annual closure

physics piquet services during the CERN annual closure

Why this report? • Services in ground state, with little excitement • Simple use of services, mainly production users • No (or: fewer) experts improving services  • Easier to identify problems in our working practices (missing procedures, inter-service issues, etc) • Piquet coverage for physics services provided • This report is based on the experiences of PDB, GMoD, SMoD piquets • Experiences may be interesting for Piquet Working Group Physics services during the CERN Annual Closure 2006

CC operators Sysadmin team Service managers Service experts services support flows Lemon alarms User/experiment problem reports • 24 x 7 coverage • 1st level alarm handling • Driven by procedures • Piquet service • Manage Hardware repairsfor important machines • PDB, GMoD, SMoD piquet • Entry point for support lines Still many of them! Problem reports come to the Service Managers via many different flows, using many different tools, directly and indirectly. This still needs some tuning Physics services during the CERN Annual Closure 2006

93 alarms/day 26 alarms/day 82 alarms/day …and one incident… Number of operator alarms Physics services during the CERN Annual Closure 2006

Grid production activity! Activity on LXBATCH Physics services during the CERN Annual Closure 2006

CASTOR activity CMS most active, data export of 100 – 250 MB/s Steady ATLAS activity, with an interruption between Jan 6 - 8 Physics services during the CERN Annual Closure 2006

General observations • Most services ran without particular problems… • Usual hardware and software failures handled in the usual way • CC operators and SysAdmin piquet handled most of the alarms • Service infrastructure largely in place, including alarms, procedures, documentation • We have added some automatic recovery actions and procedures • Experiments expressed their thanks Physics services during the CERN Annual Closure 2006

Grid Services • GMoD’s handled quite some different problems • System crash on rb114, hardware failure on gdrb06, … • Service restarts, high loads, full filesystems • RB’s, CE’s, BDII, LB, … • GGUS tickets, interaction with service experts • FTS: GMoD reports stable running, with ~100% service availability Physics services during the CERN Annual Closure 2006

Oracle RAC • 5 service degradations between Dec 26 and Dec 31 • One node in a cluster gets stuck (RHES-3?), and needs a reboot • Dec 26: itrac16 hung + reboot (Atlas RAC) • Dec 26: itrac20 hung + reboot (Atlas RAC), problematic to boot node • Dec 27: itrac16 hung + reboot (Atlas RAC) • Dec 29: itrac04 hung + reboot (LCG RAC) • Dec 31: itrac11 hardware failure, affecting LHCb RAC • All single node failures, causing temporary service degradation • Normal procedures applied successfully • H/W and O/S of these RACs are being upgraded Physics services during the CERN Annual Closure 2006

Alice grid jobs on lxbatch • On Dec 29, 200 lxbatch nodes running alicesgm jobs went into a high load, and needed to be rebooted • Same for 2 of the Alice VO boxes • Normal flow worked fine: • Operator alerted SysAdmin, who escalated to the SMoD. • SMoD alerted Alice, and asked for the nodes to be rebooted • Problem understood by Alice, and solved • Shame about the other jobs on the worker nodes… Physics services during the CERN Annual Closure 2006

LFC-lhcb degraded • Jan 2: lfc processes on lfc104 start to time out, monitoring does not pick this up…This degrades the service (2 nodes in load-balanced pair) • Two lhcb users report the problem: • Mail to lfc.support@cern.ch Remedy ticket for the SMoD • Restores service on Jan 4 • GGUS ticket  GMoD • Forwarded later to lfc.support@cern.ch • Q: Can we streamline the workflow? Physics services during the CERN Annual Closure 2006

CASTORATLAS stager database • Two problems, interfering destructively… • Dec 28: Vendor fixes trivial hardware problem • but the machine remains out of the alarm handling… • Jan 1: A new hardware problem develops • goes unnoticed… • Jan 4: A high CPU load is investigated by DES • nothing found • Jan 5: the machine crashes, noticed by chance a few hours later • SMoD, Castor and Oracle experts check, service partially restored • Jan 8: the reason for the high load is found & fixed • This required expert level intervention! Physics services during the CERN Annual Closure 2006

Conclusions • Services ran (in general) stably • And they were being used! • Few service degradations, spread over different services • Service infrastructure is in place, and it is working • Several punctual improvements deployed • No gaping holes, but some small ones To-do: make sure that it still works under normal conditions  Physics services during the CERN Annual Closure 2006

physics piquet services during the CERN annual closure

physics piquet services during the CERN annual closure

Presentation Transcript

Smashing the Standard Model: Physics at the CERN LHC

PH Plenary Meeting CERN – 24 February 2014 The CERN Physics Department

CALO Piquet News

Piquet Report

New Opportunities in the Physics Landscape at CERN

Piquet Report

Piquet Report

Piquet Report 29.08.2011

The kaon physics programme outside CERN

Physics validation meeting W. Pokorski / CERN

What is CERN? Particle physics laboratory

Particle Physics, CERN and the LHC

CERN Database Services Review

CERN and the Future of Particle Physics

CERN Data Services Update

Piquet report

CERN Castor DB Services

Module 3: During the Annual Meeting

From Particle Physics @ CERN to Particle Physics @ the industry

Smashing the Standard Model: Physics at the CERN LHC

Nuclear Physics outside CERN

Physics Data Management at CERN