Oxford STEP09 Report: Storage Access Enhancement and Observations

Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Storage access • Our main lesson from STEP09 was the big change in ATLAS’ use of storage from lots of small files to fewer large ones. • This moved the SE bottleneck from authentication on head node to bandwidth on the disk pools. • We implemented network channel bonding on the pools to immediate effect. • We had five disk pools taking most of the load; with more servers we should have higher overall bandwidth. Oxford STEP09 Report

Some pool nodes not stressed • Some smaller pool nodes were marked read only and hence had no STEP09 data Oxford STEP09 Report

Earlier tests stressed the head nodes network more. • STEP09 had little load on head node, and only sporadic network activity Oxford STEP09 Report

Data Transfer – small but successful • Oxford are on an 18% share of AODs. • MCDISK had over 10TB of old reconTotal: 337240 files in 548 datasets, size 10902.36GBc.f. Glasgow where we have 8 recon datasets. Oxford STEP09 Report

Observations - ATLAS • Apparently ATLAS were keeping an eLog, we weren’t aware of it (or forgot about it) until the test was essentially over. • ATLAS pilot jobs were coming in with Role=Production and Role=Pilot. It makes no sense and it breaks things; please stop. • ATLAS’ pilot factory got wedged by a dead CE. We have a dual redundant pair of CEs feeding the main SL4 system, t2ce02 and t2ce04, one of which died. Initially the factory was only using t2ce02, then, when reconfigured, refused to use t2ce04 because of the backlog of (dead) pilots that had been sent to t2ce02. • The direct (non rfcp file stager) access kills us; we can’t run many jobs like that at once. Oxford STEP09 Report

Observations - LHCb • We were sent very few jobs, as far as we know they ran fine. • Er. That’s it. Oxford STEP09 Report

What next? • It would be useful to run more tests: • on the individual access methods, one at a time, • Then a repeat of the mixture, but with different DNs for each method. • Likely upgrades: • More pools online. We’ve been waiting for DPM 1.7 to do some necessary maintenance. Once that’s done we should have about twelve active pool nodes rather than five. • 10Gb Ethernet? (Not soon, and probably only on new kit) Oxford STEP09 Report

Conclusions • Oxford has no local ATLAS expert so we suffered from a lack of awareness of the three different submission methods. • Have been working with Peter Love to understand why the pilot job submission method was failing. • We would like to have a post mortem meeting with Brian to make sure we understand what caused our other various problems. Oxford STEP09 Report

Oxford STEP09 Report: Storage Access Enhancement and Observations

Oxford STEP09 Report: Storage Access Enhancement and Observations

Presentation Transcript

Oxford

oxford plumbers,oxford plumber

Oxford University Particle Physics Site Report

Oxford University Particle Physics Site Report

CMS reprocessing tests at STEP09

Oxford PP Computing Site Report

Oxford University Particle Physics Grid Report

Oxford Particle Physics Site Report

Oxford Particle Physics Site Report

Oxford University Particle Physics Site Report

Oxford University Particle Physics Site Report

OXFORD

Report from Oxford on bus tapes

Analysis in STEP09 at TOKYO

Oxford

Oxford Particle Physics Site Report

OXFORD

Oxford University Particle Physics Site Report

CMS reprocessing tests at STEP09