1 / 20

LHCb: March/April Operational Report

This report outlines the recent activities of LHCb, including a review of GGUS tickets and the most worrying issues faced by the team. It also includes a review of T1 sites and highlights from the first LHCb T1 Jamboree.

hparsons
Télécharger la présentation

LHCb: March/April Operational Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb: March/April Operational Report Roberto Santinelli on behalf of LHCb GDB 12th May 2010

  2. OUTLINE • Recent activities • GGUS review • Issues • Most worrying • T1 sites review • First LHCb T1 Jamboree Roberto Santinelli

  3. 1GB/s integrated throughput (export+ replication) • Also small reconstructed/stripped files [lxplus235] ~ > dirac-dms-storage-usage-summary --Dir=/lhcb/data/2010/RAW DIRAC SE Size (TB) Files -------------------------------------------------- CERN-RAW 18.6 11629

  4. Re-run several times reprocessing on first data due to • several internal problems. • Pedestal: due to many (increasing) different users. • Need for a LHCb DAST • No huge MC activity and in general no huge activity • (commissioning new workflows,real data is the FOCUS)

  5. GGUS tickets (March/April/ first week of May) • 91 GGUS tickets in total: • 8 normal tickets • 6 ALARM tickets (3 test) • 77 TEAM ticket • 27 GGUS tickets with shared area problems in total • 29 (real) GGUS tickets open against T0/T1: • ALARM (CERN): FTS found not working… • ALARM (GRIDKA): No space left on M-DST token • ALARM (CNAF): GPFS not working: all transfers failing • NL-T1:8 • CERN: 8 • CNAF: 5 • GRIDKA:4 • PIC: 3 • IN2P3: 1 • RAL: 0 Roberto Santinelli

  6. (Most) worrying Issues • FTS issue at CERN underlined the importance of effective communication • the migration happened simultaneously  with the LHCb Jamboree • a simple alias change – as originally discussed – was not feasible in this case (service draining issues) • explicit sign-off has already been implemented via WLCG T1SCM, • e.g. for myproxy-fts retirement and will be used for future migrations. • SVN service down and degradation of performances • NIKHEF file access issue for some users. File available and dccp works fine but ROOT can’t open • m/w problem (see later) resolved moving to dcap everywhere on dCache also reported by ATLAS. • GridKA shared area • performances issue due to concurrent activity on ATLAS software area.. adding more servers on NFS. • LHCb: banned PIC for two weeks for a missing option in the installation script (March). • CNAF and RAL: banned a week for a new connection string due to a recent migration of their Oracle to new RACs

  7. April : Production Mask NB: Blue unavailability period ; no a site issue

  8. Site issues • CERN (March): • March 3rd Offline CondDB schema corrupted. Need to restore the schema to the previous configuration but the apply process failed against 5 out of 6 T1’s (but CNAF). • March 7th merging jobs at CERN affected: input data was not available (3 days weekend not available) • March 11th Migration of central Dirac services • March 11th LFC replication failing against all T1 according SAM • A glitch with the AFS shared area preventing to write lock files spawned by SAM • March 17thSVN reported to be extremely slow • March 25th: Started xrootd tests: found the server not properly setup (supporting only kerberos) • March 29th CASTOR: The LHCb data written to lhcbraw service has not been migrated for several days • March 31st: FTS not working: wrong instance used.

  9. Site issues • CERN (April/May): • ~April: The issue with LFC-RO at CERN (due to old CORAL LFC interface) has been fixed • Released a patch which is now part of GAUDI • Good interaction experiment/service providers • Still the need to use the workaround based on a local DB lookup xml file (some LHCb core applications still based on old GAUDI stack). • 15th CASTOR Data Access problem: lhcbmdst pool was running out of maximum number of allowed parallel transfers (mitigated adding more nodes) •  Old issue of sizing pools in terms of number of servers (and not just TB provided) • 29th:Lhcb downstream capture for conditions was stuck for several hours. • May 4th: lost 72 files on the lhcb default pool. A diskserver reinstalled and data (not migrated since the 22nd of March) scrapped. • At the end the loss was limited and only 10 files unretrievable • May 5th default pool overloaded/unavailable yesterday due to some LXBATCH user putting too much load on and small size (5 diskserver/200 transfers each)

  10. Site issues • CNAF: The huge T2 resources reported to be under usage. This is due to the LSF batch system and its fair share mechanism [..]. This will be fixed by site-fine-tuned agents that submits directly on the sites. • 8th : Problem with the LRMS systematically failing submission there • 10th StoRM upgraded: started to fail systematically unit (critical) test: problem fixed with a new release of the unit test code • 18th CREAM CE direct submission: FQAN issue. Now a first prototype is working against CNAF • 24th Too low number of pool accounts for pilot role defined • 25th StoRM problem with LCMAPS preventing to upload data there • 30th March: Oracle RAC intervention • 8-14 April : site banned because of the changed connection string to CondDB was preventing to access as the APPCONFIG was not upgraded (CondDB person away) • 26 CREAM CE failing all pilots: configuration issue • 30 glitch on StoRM. • May 5th GPFS issue preventing to write data on storm: ALARM and problem fixed in ~1 hour

  11. Site issues • GridKA: • 3rd : SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server. • 5-14 (April): Shared area performances: site banned during real data taking. Concurrent high load put by ATLAS…added more h/w on their NFS servers • 26th April: MDST Space full (ALARM ticket sent due to the missed response to the automatic notification on the week end) • 28 April: PNFS to be restarted: 1 day off. • IN2P3: • Only Reported instabilities of the SRM endpoint according SAM unit test on MARCH • 24-25 April: LRMS database down • 26-27 AFS major issue. • SIGUSR1 signal sent instead of SIGTERM before sending the SIGKILL

  12. Site Issues • NL-T1 (March): • March 1st:issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue). Due to a missing library (libgsitunnel) not deployed in the AA • March 3rd issue with HOME not set on some WN (NIKHEF). • March 18th NIKHEF: reported a problem in uploading output data from production jobs no matter which destination • March 20th Discovered the LFC instance at SARA wrongly configured (Oracle RAC issue) • March 30th Issue with Accessing data at SARA-NIKHEF: discovered (many days later) to be due to an library incompatibility between Oracle and gsidcap never spotted before because all activities were not using concurrently CondDB (but just local SQLDB) and gsidcap (downloading data first) Roberto Santinelli

  13. Site issues • NL-T1 (April) • 1-13 April: Site banned for the issue of accessing data via gsidcap and concurrent ConditionDB access. Found to be a clash of libraries and now working exclusively with dcap • The issue has never been seen because with real data the very first time LHCb access ConditionDB and use file access protocol simultaneously. (usually a download of data first) • 27-28 April: NIKHEF CREAMCE issue killing the site. Received a patch to submit to $TMPDIR instead of $HOME • 29-4 May: Storage Issue due to a variety of reasons (ranging from h/w to network, from some head nodes to SRM overloaded). Roberto Santinelli

  14. Site issues • PIC: banned for more than 2 weeks for a problem with our application • March 18th : problem with the installation of one of the LHCb application software for SL5-64bit. Resolved by using –force option in the installation. Site banned 2 weeks for that. • March 29th : downtime (announced) causing some major perturbation on user activities due to some of the critical DIRAC services hosted there. • April 7th: Issue with lhcbweb restarted accidentally. Backup host of the web portal at Barcelona University • April 26-27: Network intervention • May 6th and 7th Accounting system hosted at PIC down twice in 24 hours

  15. Site issues • RAL: • March 1st: disk server issue • March 1st issue with voms certs • March 9th Streams replication apply process failure • 28th April CASTOR Oracle DB upgrade • 5-6 April: Network issue • 8-14 April: as for CNAF, the site was banned because the changed connection string was preventing to access condition DB (upgraded APPCONFIG not available, CondDB responsible was away) Roberto Santinelli

  16. Outcome of the T1 Jamboree : highlights • Presentation of the computing model and resources needed • First big change about to come: T2-LAC facilities • Interesting overview of DIRAC • Plans: reconstruction/reprocessing/simulations • Activity must be flexible depending on LHC. Sites do not have to ask each time for occupying their CPUs • CREAMCE usage (direct submission about to come) • gLexec pattern usage in LHCb • Most worrying issue at T1: file access and possible solutions • usage of xroot taken into consideration. Testing it • file download for production is the current solution. • parameter tuning in dcache sites (WG in WLCG for FA optimization). •  For production the file download proved to be the best approach (despite some sites claim that would be better to access data through the LAN)   • Test suite “hammer cloud style“ to probe site. READY • POSIX file access (LUSTRE and NFS4.1) Roberto Santinelli

  17. Outcome of the T1 Jamboree : highlights • CPU/wallclock: sites reporting some inefficiency..multiple reasons: • too many pilot submitted now that we are running in filling mode (pilot commit suicide if no task is available but after few minutes) • Also problems on stuck connections (data upload, data-stream with dcap/root servers, storage shortage, AFS outages, jobs hanging) • Very aggressive watch dog in place that kills jobs (stalled or not consuming CPU any longer i.e. <5% over a configurable amount of minutes) • Most worrying issue at T2 sites: Shared Area • This is a critical service for LHCb and as such must be taken into account by sites • Tape protection discussions • T1 LRMS fair shares: • quick turn-round when there is low activity • never fall down to zero. • Site round table on allocated resources and plans for 2010/2011 Roberto Santinelli

  18. Outcome of T1 Jamboree: 2010 Resources • CERN: full allocation 2010: June or earlier. Full allocation 2011&2012 by April 1st each year. • http://lcg.web.cern.ch/lcg/Resources/WLCGResources-2009-2010_12APR10.pdf : 12 April : all resources seems declared to be allocated • CNAF: CPU: to be ordered. Disk and Tape: delivery in March. • FZK: it’s assumed that CPU fully allocated beginning of April. Disk and Tape entirely allocated in May • IN2p3: full allocation 2010: 20/05/2010+ T2 resources  18.112 HEPSPEC-06/479 TB disk • NL-T1: Disk and Tape fully available end of Spring (<20th June) • PIC: end of April plan to allocate 2010 pledged. Agreed to host 6% of MC data (extra 50TB) • RAL: Full allocation in June and do not foresee any problem in meeting it: Roberto Santinelli

More Related