1 / 9

Update on gLite WMS tests

Update on gLite WMS tests. Andrea Sciabà. WLCG-OSG-EGEE Operations meeting September 21, 2006. Testing the gLite WMS. RB installed with gLite 3.0.2 + various patches Dedicated machine at CERN (rb102.cern.ch) 2 × Xeon 3.0 GHz 4 GB of RAM 3 RAID1 partitions for better I/O performance

charleyj
Télécharger la présentation

Update on gLite WMS tests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Update on gLite WMS tests Andrea Sciabà WLCG-OSG-EGEE Operations meeting September 21, 2006

  2. Testing the gLite WMS • RB installed with gLite 3.0.2 + various patches • Dedicated machine at CERN (rb102.cern.ch) • 2 × Xeon 3.0 GHz • 4 GB of RAM • 3 RAID1 partitions for better I/O performance • Closely monitored by GD, FIO and JRA1 people • Tests run by CMS, GD, ATLAS

  3. CMS Test description • Application • Fake analysis jobs (~30’ of CPU time) • Run on CMS Tier-1’s and Tier-2’s • Different submission methods • Network server • WMProxy • Bulk submission • Submission from 1-3 UI’s in parallel • VOMS proxies • Myproxy renewal on • Deep resubmission off • Shallow resubmission ≤ 3

  4. Latest results (I) • No. of jobs = 3 UI × 33 CEs × 200 jobs/collection  20000 jobs • ~2.5 hours to submit all jobs • ~0.5 sec/job • Submission failed for 6 collections • ~17 hours to dispatch all jobs • Equivalent to ~26000 jobs/day

  5. Latest results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0

  6. Failure reasons • Application errors • Maradona errors • “Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ” errors • The WMS could not submit the job to a gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Normally a batch system problem • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable

  7. Efficiency table (I)

  8. Efficiency table (II)

  9. Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status (O(100)) • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now

More Related