40 likes | 171 Vues
The report covers multiple issues regarding Central Services from July 17 to July 23, 2023. Key points include failures in writing and reading servers, the addition of two new servers to the LFC frontend, and ongoing cleanup in Castoratlas and at Tier1s/Tier2s. Problems with transfer failures, ACLs in directories at TRIUMF, and a power cut at INFN-NAPOLI are also detailed. The report highlights the need for improved communication between GS and DB, as well as changes in job scheduling configurations affecting US cloud operations.
E N D
AMOD report Alessandro Di Girolamo Stephane Jezequel Guido Negri
Tier0 – Central Services • Castoratlas/t0atlas: a few failures in writing (Tue 17th, still not understood) and in reading (Wed 18th, a single server with problems) • 2 new servers added to the LFC frontend (7 in total, i.e. 90*7=630 possible connections, backend database limited to 500 connections; DB team raised to 900); need to increase communications between GS and DB • Castoratlas/atlt3 being dismissed (clean up ongoing, TMPLOCALGROUPDISK removed from ToA)
Tier1s/Tier2s • PIC: many transfer failures on Monday (GGUS:84311). All dCache pools assigned to Atlas filled, new disk space assigned. PIC reported they had problems in installing 1.7PB of new hardware, should be done now • TRIUMF: GGUS:84327, bad ACLs on some directories in LFC, asked the site to kindly change them (Kors certificate used to create them is no more valid) • INFN-NAPOLI power cut during week end (14th Jul), took 3 days to fully recover (back in prod on Tue 17th) • DESY-ZN (20th –23th ) : Lost credentials during SE update
ATLAS internals • US cloud draining jobs over the weekend (21-22 Jul). Most probably due to the MaxTime attribute changed in SchedConfig (formerly a static attribute filled by hand, then changed to the one collected from the BDII from AGIS). Alden reverted back to the old values, he will now dump the different values to see discrepancies and possible solutions • some GroupProduction tried to run over ESD on tape (no more DISK copy at T1). This should not be done, they should run on RAW. Nurcan has been contacted