90 likes | 106 Vues
Update on gLite WMS tests. Andrea Sciabà. WLCG-OSG-EGEE Operations meeting September 21, 2006. Testing the gLite WMS. RB installed with gLite 3.0.2 + various patches Dedicated machine at CERN (rb102.cern.ch) 2 × Xeon 3.0 GHz 4 GB of RAM 3 RAID1 partitions for better I/O performance
E N D
Update on gLite WMS tests Andrea Sciabà WLCG-OSG-EGEE Operations meeting September 21, 2006
Testing the gLite WMS • RB installed with gLite 3.0.2 + various patches • Dedicated machine at CERN (rb102.cern.ch) • 2 × Xeon 3.0 GHz • 4 GB of RAM • 3 RAID1 partitions for better I/O performance • Closely monitored by GD, FIO and JRA1 people • Tests run by CMS, GD, ATLAS
CMS Test description • Application • Fake analysis jobs (~30’ of CPU time) • Run on CMS Tier-1’s and Tier-2’s • Different submission methods • Network server • WMProxy • Bulk submission • Submission from 1-3 UI’s in parallel • VOMS proxies • Myproxy renewal on • Deep resubmission off • Shallow resubmission ≤ 3
Latest results (I) • No. of jobs = 3 UI × 33 CEs × 200 jobs/collection 20000 jobs • ~2.5 hours to submit all jobs • ~0.5 sec/job • Submission failed for 6 collections • ~17 hours to dispatch all jobs • Equivalent to ~26000 jobs/day
Latest results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0
Failure reasons • Application errors • Maradona errors • “Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ” errors • The WMS could not submit the job to a gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Normally a batch system problem • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable
Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status (O(100)) • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now