1 / 11

AMOD Report

ADC Weekly 15/05/2012. AMOD Report. Simone Campana – CERN IT-ES. CERN issues. Problem to the EOS namespace on Monday Left many files on disk but not registered in the namespace Problem cured on Tuesday with some follow up on Wednesday AFAIK we lost no data. T1 issues: NL-T1.

brice
Télécharger la présentation

AMOD Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADC Weekly 15/05/2012 AMOD Report Simone Campana – CERN IT-ES

  2. CERN issues • Problem to the EOS namespace on Monday • Left many files on disk but not registered in the namespace • Problem cured on Tuesday with some follow up on Wednesday • AFAIK we lost no data ADC Weekly – 15/05/2012

  3. T1 issues: NL-T1 • SARA is experiencing SRM overloads since last Tuesday (08/05/2012) • Failures are intermittent but annoying • Issue is locking is postgres DB serving the namespace • Downtime scheduled to move DB to different hardware on the 23/05/2012 • Dedicated SAN, tuned for DB usage • Informed GDP so that they can decide if throttling of NL activities is needed ADC Weekly – 15/05/2012

  4. T1 issues: NL-T1 • On Friday we spotted a problem in Canada-NIKHEF transfers • The problem has been there since at least 24h before • Fixed by CA after some collaboration with NL • The problem has been there since 24h before • Masked by SARA oscillating behavior, absence of FTs (see later) and LIP issue (see later) • The issue was not spotted by the OPN network people ADC Weekly – 15/05/2012

  5. T1 issues: RAL • RAL storage started failing on Friday afternoon with authentication errors (CRL) • CRL-fetch cron was run by hand once on Friday evening. Alleviated the problem but did not cure it (70% failures remaining) • The cron was re-run on Saturday, both on SRM and gridftp. This fixed the issues • Problem solved but not completely understood. RAL people still investigating ADC Weekly – 15/05/2012

  6. T1 issues • Problem at ASGC disk server • Data can be served but need to reduce the number of accesses to the server • Some pools have been put read-only to reduce the traffic from put • We will miss 140TB of space for one week (not really an issue), but all files will be available ADC Weekly – 15/05/2012

  7. T1 issues – zero cost to AMOD • Many sites (generally T1s) are pretty good finding problems before we (ADC) do • Problem with BNL storage on Friday • Problem with ASGC DPM on Tuesday • Problem with RAL network switch on Wednesday • Issues are normally reported through ELOG • Not passed to GGUS • Reported by AMOD at ADC daily for the record • Relies a lot on a STRONG contact between site and ADC • Eh … if we could have a strong contact at every site… ADC Weekly – 15/05/2012

  8. T1s non-issues • Long discussion on ANALY-PIC • Initially thought no analysis was running in PIC because of one CE is Downtime • At the end it turned out everything was normal • Only 5% analysis because we asked the site to configure shares as such • Panda Sites and Queues need rationalization • Work in progress from A.DiGGi and A.Stradeling • Will ask for progress in the next ADC Dev (after CHEP) ADC Weekly – 15/05/2012

  9. T2 issues • As always, T2s keep ADCoS very busy. But we are discussing always about the same T2s • So A/B/C/B categorization with 1 month window is a good idea IMHO • Highlights of the week • Many T2s in FR cloud (but not in France nor Asia): problematic for the all week. Ping pong of black and white listing • One T2 in ES cloud (but not is Spain) erased with local commands all data on SE. Spectacular. ADC Weekly – 15/05/2012

  10. Central Services • DDM SS moving to new hardware • Some SLS unexpected unavailability made Comp@P1 nervous • Backlog of datasets to freeze in Prodsys • Some quick fixes proposed by Rod, being followed up by Alexei. News? • DDM Functional Tests now run continuously • No 12h stop between tue and wed • This created issues in the transition phase: no T1-T1 and T1-T2 FTs for 2 days • Several DB blocking sessions in the weekend • In many DBs: ADCR (instance 1 and 2), ATLR, ATONR • Nothing observed from the application point of view • Roman and Gancho investigating ADC Weekly – 15/05/2012

  11. Conclusions • An easy week, a bit of a complicated weekend • I insist, we need more “active” monitoring • Alarming, notifications … • I insist, we need a stronger contact with the sites • Particularly T2s • As usual, very nice work of Comp@P1 and ADCoS ADC Weekly – 15/05/2012

More Related