1 / 11

Support Ticket Summary for ALICE, ATLAS, CMS, and LHCb

This summary provides an overview of support-related events for ALICE, ATLAS, CMS, and LHCb teams over the past 3 weeks, including alarm tickets, ticket submissions, and system issues.

matney
Télécharger la présentation

Support Ticket Summary for ALICE, ATLAS, CMS, and LHCb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GGUS summary (3 weeks) VO User Team Alarm Total ALICE 7 0 2 9 ATLAS 41 126 8 175 CMS 12 2 5 19 LHCb 8 36 1 45 Totals 68 164 16 248 120 Total ALICE Total ATLAS Total CMS 100 Total LHCb 80 60 40 20 1 0 1-Jun 18-Dec 6-Jul 22-Jan 10-Aug 26-Feb 14-Sep 1-Apr

  2. Support-related events since last MB •There were 9 real ALARM tickets since the 2011/09/20 MB (3 weeks), 4 submitted by ATLAS, 4 by CMS, 1 by ALICE, all ‘solved’, most (except 1) ‘verified’. •7 ALARM tickets concerned CERN, 1 for RAL and 1 for ASGC. • 20 test ALARM tickets were submitted by the GGUS developers on Release day 2011/09/28, as a part of the regular procedure. • Following this release, a flag regulating GGUS email notification got wrongly configured. This resulted into GGUS generating duplicate email notifications to the supporters intermittently until Oct 7tham). • On 2011/10/06 pm GGUS interfaces with other ticketing systems using web services broke due to a KIT DNS problem, caused by an update of the intrusion prevention system (IPS). Due to this update the KIT DNS was not able to get in touch with other DNS servers outside. After rolling back to the previous version of the IPS it took some time until the DNS communication worked correctly again. 2 11/16/2019 WLCG MB Report WLCG Service Report

  3. ATLAS ALARM->CERN raw files vanish from Castor scratch space before merge and copy to tape GGUS:74448 What time UTC 2011/09/19 11:40 What happened GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Service mgr confirms in the ticket investigation started. Service mgr puts the ticket to status ‘solved’ explaining that a node was taken out of production for reasons unknown at that time and never recorded in the ticket. The operator records in the ticke that “the sys. Admin is working on it”. Submitter sets the ticket to status ‘verified’ . 2011/09/19 11:49 2011/09/19 11:55 2011/09/19 12:15 2011/09/19 13:08 3 11/16/2019 WLCG MB Report WLCG Service Report

  4. CMS ALARM->CERN LSF not starting T0 jobs GGUS:74456 What time UTC What happened 2011/09/19 15:48 GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/09/19 15:57 Grid services’ expert, having seen the email, comments in the ticket that the problem was already known and at hand. 2011/09/19 16:00 Operator records in the ticket that the sys. admin. was contacted. 2011/09/19 16:25 Expert sets the ticket to status ‘solved’. The cmst0 queue priority was set to a higher value so that LSF allows more CMS jobs to run within a given cycle. A more permanent solution was promised but not recorded in this ticket. 2011/09/19 17:04 Submitter observed the queues for 2.5 hrs until the number of jobs returned as failed decreased. 2011/09/25 17:28 SUNDAY Submitter sets the ticket on status ‘verified’. 4 11/16/2019 WLCG MB Report WLCG Service Report

  5. ATLAS ALARM-> T0 to RAL data exports fail GGUS:74686 What time UTC What happened 2011/09/27 04:33 GGUS TEAM ticket, automatic email notification to lcg- support@gridpp.rl.ac.uk AND automatic assignment to NIG_UK. 2011/09/27 06:45 TEAM ticket upgrade to ALARM. lcg-alarm@gridpp.rl.ac.uk notified. Automatic ALARM acknowledgement recorded in the ticket promising expert’s response within 2 hours. 2011/09/27 07:23 Site admin records in the ticket investigation is taking place with high priority. 2011/09/27 08:53 Service expert at the site record a Castor DB inconsistency found. DB experts @ RAL contacted. The Atlas Castor instance @ RAL put in downtime. 2011/09/27 13:57 4 comments added by the expert at the site rectifying the diagnostic and to record in the ticket that the DB table needed to be rebuilt. 2011/09/27 14:55 Service expert sets the ticket on status ‘solved’. 2011/09/27 16:05 Submitter sets the ticket to status ‘verified’. 5 11/16/2019 WLCG MB Report WLCG Service Report

  6. ATLAS ALARM->ASGC can’t get LFC replicas GGUS:74758 What time UTC 2011/09/28 19:22 What happened GGUS TEAM ticket, automatic email notification to ops@lists.grid.sinica.edu.tw AND automatic assignment to ROC_Asia/Pacific. “Type of Problem (ToP)” 1stusage!!! ToP: Storage Systems. Next shifter records in the ticket the problem appears in the opposite direction as well. CERN/IT/ES ATLAS supporter raises the ticket into an ALARM. Asgc-t1-op@lists.gird.sinica.edu.tw. 1stdiagnosis shows a DOS caused by a panda user. 2011/09/28 20:26 2011/09/28 20:27 2011/09/28 21:56 2011/09/29 02:26 2011/09/29 07:56 Site admin. sets the ticket ‘in progress’. The ATLAS supporter from CERN confirms ~10K concurrentjobs, each fetching 100MB from storage was the reason for the DOS, bans the job submitter and sets the ticket to status ‘solved’. 6 11/16/2019 WLCG MB Report WLCG Service Report

  7. ATLAS ALARM->CERN T0MERGE inaccessible GGUS:74838 What time UTC 2011/09/30 13:01 What happened GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Access. Operator records in the ticket that the Castor piquet was contacted. Castor expert puts the ticket ‘in progress’. Expert puts the problem to status ‘solved’ recording that the knownTransfer Manager problem was the cause. Stuck transfer requests were cleaned but available patches should be installed. Expert enters 2 more clarification comments. 2011/09/30 13:06 2011/09/30 13:06 2011/09/30 13:37 2011/09/30 14:36 2011/10/03 06:41 Submitter sets the ticket on status ‘verified’. 7 11/16/2019 WLCG MB Report WLCG Service Report

  8. CMS ALARM->CERN CMSR DB down GGUS:74701 What time UTC 2011/09/27 12:39 What happened GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected) 2ndLine support assigns ticket to DB Instances 3rdLine. The operator records that the ticket is received but calls nobody. Service expert sets the ticket to status ‘solved’ confirming there was a problem with the DB but without explanation about the reason of this problem. Submitter sets the ticket to status ‘verified’. 2011/09/27 12:53 2011/09/27 13:30 2011/09/27 14:00 2011/09/27 14:17 8 11/16/2019 WLCG MB Report WLCG Service Report

  9. CMS ALARM->CERN Problem to open DB file GGUS:74709 What time UTC 2011/09/27 17:46 What happened GGUS TEAM ticket, automatic email notification to grid- cern-prod-admins@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected). TEAM ticket upgraded to ALARM. Cms-operator- alarm@cern.ch notified. Operator records in the ticket that phyDB support was contacted. Service expert puts the ticket in status ‘solved’ without explaning how. Submitter sets the ticket to status ‘verified’. 2011/09/27 18:59 2011/09/27 19:19 2011/09/27 19:27 2011/09/27 21:15 9 11/16/2019 WLCG MB Report WLCG Service Report

  10. ALICE ALARM->CERN myproxy stopped working GGUS:75055 What time UTC 2011/10/06 17:35 What happened GGUS ALARM ticket, automatic email notification to alice- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: middleware. Operator records in the ticket that IT PES PS piquet was contacted. Service expert comments in the ticket that the problem is fixed. The diagnostic was already given by the submitter, i.e. a change of host cert. led to authorisation failures. Submitter confirms that problem went away. 2011/10/06 18:31 2011/10/06 19:46 2011/10/06 19:53 2011/10/07 10:31 Late appearance of the SNOW ticket number. 2011/10/07 11:53 Service expert puts the ticket to status ‘solved’. A number of identical comments follow due to the duplicate email notifications explained in slide 2. They stop when the sumbitter sets the ticket into status ‘verified’. 10 11/16/2019 WLCG MB Report WLCG Service Report

  11. CMS ALARM->CERN myproxy stopped working GGUS:75056 What time UTC 2011/10/06 17:43 What happened GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: File transfer (different from the identical report by ALICE – see previous slide). Service expert comments in the ticket that the problem is known and already fixed. The same expert comments in the ticket that one of the 2 myproxy hosts still gives errors and is temporarily disabled for verification. Operator records in the ticket that IT PES PS piquet was contacted. 3 comments exchanged for debugging, followed by status change to ‘solved’ and ‘verified’. Late appearance of the SNOW ticket number (reasons in 2011/10/06 18:05 2011/10/06 18:22 2011/10/06 18:31 2011/10/06 18:43- 21:22 2011/10/07 10:35 11 11/16/2019 WLCG MB Report WLCG Service Report the previous slide).

More Related