1 / 9

AMOD Report June 24-30, 2013

AMOD Report June 24-30, 2013. Torre Wenaus, BNL July 2, 2013. Activities. Stable operations, utilization tapering off on the weekend – few pending tasks ~ 4.3 M analysis jobs, 7M jobs total ~ 560 analysis users Ops issues in the week: Recovering from BNL disk pool failure

livi
Télécharger la présentation

AMOD Report June 24-30, 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD Report June 24-30, 2013 Torre Wenaus, BNL July 2, 2013

  2. Activities • Stable operations, utilization tapering off on the weekend – few pending tasks • ~4.3M analysis jobs, 7M jobs total • ~560analysis users • Ops issues in the week: • Recovering from BNL disk pool failure • Low disk space at many T1s, T2s 140k Torre Wenaus

  3. Production & Analysis Production Analysis ~17k min – 37k max Torre Wenaus

  4. Data transfers Torre Wenaus

  5. Tier 0, Central Services, ADC • Mon: HC DB problem from previous weekend fixed with DB restart. DB connections saturated when session count grew with no release of connections. “A follow-up is being discussed.” GGUS:95033 • Ongoing issue “CERN-PROD: file transfer failure from T2 sites due to SECURITY_ERROR” closed because it had been resolved in early June (as pointed out by Maria in the WLCG meeting). GGUS:92166 • Smooth incident-free interventions on Castor, Oracle production DBs, Bourricot, Tracer/Consistency Service • Problems (curl SSL failure) using pandamon cloud/site control on lxplus (SL6 issue?), experts investigating Torre Wenaus

  6. Tier 1 • Mon: FZK-LCG2 transfer failures, “all ATLAS jobs/transfers are forced onto the same disk cluster, because all other disks are full to the brim. Consequently, the load cannot get distributed anymore and we now observe higher failure rate.” GGUS:95021 • Mon-Thu: SARA-MATRIX problems with dest/source transfers. gPlazma service interruption, SRM problems fixed with restart. GGUS: 95071 • Tue: FZK-LCG2 missing AOD file reported, they can find no trace, waiting for reply. GGUS:95092 • Tue-Wed: IN2P3-CC file transfer failures, SRM crashed during night, fixed in the morning with restart. Site established auto recovery to avoid delays in such cases in the future. GGUS:95093 Torre Wenaus

  7. Tier 1 • Thu: BNL provided incident report on disk pool failure. Recovery worked on through the week • Fri-Sun: RAL-LCG2 DDM errors due to Castor problems on Fri, downtime over weekend, cloud set brokeroff, downtime ended Sunday when problems were resolved, restored to production. GGUS:95160 • Sat: SARA-MATRIX storage errors due to full DATADISK, blacklisted. Space cleaned up over weekend. GGUS:95175 • Mon 7/1: IN2P3-CC NO_SPACE_LEFT errors but no auto blacklisting, site not publishing that it is full. Inconsistency in SRM DB found, bad space calculation, fixed. GGUS:95204 • Several Tier 1s (and Tier 2s) over the week: low space. FZK, SARA, IN2P3 Torre Wenaus

  8. Other • Clouds were running out of assigned tasks during the week. Would be very desirable to sustain a deeper todo queue of tasks. • [this was the first item on the ‘Other’ slide in my last (Feb) AMOD report; it still applies] • New manual whitelisting policy • Armen in last ADC weekly: "Consider an option of manual whitelisting (by expert shifter, AMOD), not reversible by SAAB. May be needed in some exceptional cases.” • Ueda has put this in place • “on” (whitelisting = ignore auto-exclusions) added as savannah site exclusion ticket option • dq2-set-location-status documentation for the “on” case added to the CentralizedSiteExclusiontwiki Torre Wenaus

  9. Thanks! • Big thanks to very attentive and effective ADCoS shifters Torre Wenaus

More Related