1 / 28

TESTING FAX USING SSS and FDR datasets

TESTING FAX USING SSS and FDR datasets. 2 nd April 2013. DETAILS. Dataset: user.flegger .*.data12_8TeV .00212172. physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 500GB WNs: UC3 and UCT3 Discovery: Global redirector Running against: fax.mwt2.org

marlon
Télécharger la présentation

TESTING FAX USING SSS and FDR datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TESTING FAX USING SSS and FDR datasets 2nd April 2013

  2. DETAILS • Dataset: • user.flegger.*.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 • 500GB • WNs: UC3 and UCT3 • Discovery: Global redirector • Running against: fax.mwt2.org • Ramp-up: 4 jobs a minute • Full data copy – split in 138 jobs for each site • Average input size: 3.62 GB • Duration does not include time for job to start • Duration does not include dq2-put time. Ilija Vukotic ivukotic@uchicago.edu

  3. Jobs Ilija Vukotic ivukotic@uchicago.edu

  4. MWT2 • 2 jobs hanging – finish with no error, but only next day • UCT3 show the same efficiency as UC3 • Avg. cpu eff.: 76.5% • Avg. dur. 5:59 • Avg. rate: 290 kB/s • Total rate: 39 MB/s Ilija Vukotic ivukotic@uchicago.edu

  5. AGLT2 • 4 jobs hanging – finish with no error, but next day • Avg. CPU efficiency: 70.5% • Avg. dur. 6 h 14 min • Avg. rate: 165 kB/s • Total rate: 22MB/s Ilija Vukotic ivukotic@uchicago.edu

  6. BU • 18 jobs hanging • Avg. CPU efficiency: 35% • Avg. dur. 11 h 2 min • Avg. rate: 108 kB/s • Total rate: 14 MB/s Ilija Vukotic ivukotic@uchicago.edu

  7. MWT2 – 300 branches • 48 jobs in parallel • Avg. CPU efficiency: 17% • Avg. dur. 3 h 20 min • Avg. rate: 926 kB/s • Total rate: 44 MB/s Ilija Vukotic ivukotic@uchicago.edu

  8. Conclusion 1 • Rechecked that dq2-put times were not included. • Times seems to be properly measured. • Need to solve mystery of huge CPU times. • Maybe will have to move to c++ version. Ilija Vukotic ivukotic@uchicago.edu

  9. SSS doing XRDCP • The same DS. • But doing simple xrdcp to /dev/null. • Up to 290 jobs in parallel (UC3 and UCT3) Ilija Vukotic ivukotic@uchicago.edu

  10. SSS doing XRDCP • Wanted to do all sites that are in FAX and have FDR dataset. • Most did not work: • When asked through glrd.usatlas.org. • Some of them even when asked directly. • Some work for 5-10 files but then give up. • Some work on repeated queries. • ML monitor not adequate anymore. • CERN, some UK sites sending all traffic • Something strange with AGLT2 numbers • Something wrong with ML Ilija Vukotic ivukotic@uchicago.edu

  11. SSS doing XRDCP Errors mostly Last server error 10000 ('’) Error accessing path/file for … (BNL) Very strange error in setting up environment. Not FAX related. Created //.asetup. Please look and (optional) edit it. AtlasSetup(WARNING): Unable to write ${HOME} save file mkdir: cannot create directory `//workarea': Permission denied /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/utilities/createUserASetup.sh: line 40: //.asetup: Permission denied Ilija Vukotic ivukotic@uchicago.edu

  12. Results Ilija Vukotic ivukotic@uchicago.edu

  13. Conclusion 2 • Automatic tests for SSB are not enough. • In absence of users that would report problem, will need additional manual checks from time to time. • Monitoring needs to be validated from beginning till the end. • Huge difference in rates – need cost matrix ASAP • Rates observed sound reasonable. • Our understanding would hugely benefit from perfSonar tests over the same links. Ilija Vukotic ivukotic@uchicago.edu

  14. TESTING FAX USING HC and FDR datasets 2nd April 2013

  15. 20019750 • RC pilot • Data from SLAC only Ilija Vukotic ivukotic@uchicago.edu

  16. 20019750 7 worked 3 did not start 4 failed Ilija Vukotic ivukotic@uchicago.edu

  17. 20019750 SWT2_CPB Log put error: Error copying the file: 256, cp: cannot create regular file /xrd/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SWT2_CPB.25/user.gangarbt.32893735._ SLAC Put error: Error copying the file: 256, cp: accessing `/xrootd/atlas/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SLAC.43/user.gangarbt.32887595.EXT0._00418.HWWSkimmedNTUP.root?oss.cgroup=ATLASUSERDISK': Transport endpoint is not connected QMUL Get error: Staging input file failed MWT2 Download: 2444 seconds ROMA1 Finished: 44 Timed out:12 FZK Finished: 4 Timed out: 46 Get error: Staging input file failed ECDF Finished: 36 Failed: 11 pilotErrorDiag: Too little space left on local disk to run job CERN Get error: Staging input file failed BU Finished 23 Failed:12 Not enough local space for staging input files and run the job AGLT Finished: 17 BNL Finished: 231 Failed:8 – lost heart beat or unspecified. OU_OCHEP_SWT2, JINR,FZU – did not start Ilija Vukotic ivukotic@uchicago.edu

  18. 20019750 Ilija Vukotic ivukotic@uchicago.edu

  19. 20019749 • RC pilot • Data from anywhere Ilija Vukotic ivukotic@uchicago.edu

  20. 20019749 The same idea as 20019749 but much more sites and random files: user.flegger.*… Did not work as I expected it: each site was always running against a random but same dataset. Ilija Vukotic ivukotic@uchicago.edu

  21. 20019749 Ilija Vukotic ivukotic@uchicago.edu

  22. Conclusion 3 • While there are many fails, some seem easy to fix (not enough space on disk, etc.) • Some are the same ones observed in SSS based tests. • We need to look at performance. Often it is better to fail than have very low performance. How low is unacceptably low? • Need to start looking at site that are not part of FAX. Ilija Vukotic ivukotic@uchicago.edu

  23. Direct FDR HC jobs Ilija Vukotic ivukotic@uchicago.edu

  24. conclusion • Testing: • Need faster turn around. • Would it help: • Each 6 hours one HC submitted job at each ANALY queue • Against a very stable door • With tools we have now there is no way to precisely stress test sites. • Fill up table at the slide 21. make it green • Monitoring: • ML almost useless now. • Need full validation, specially CERN FAX dashboard Ilija Vukotic ivukotic@uchicago.edu

  25. Systematic FDR load tests in progress US cloud results. 10 jobs * 10 SMWZ files ~ 50GB CPU limited Factors affecting spreads: pair-wise network latency, throughput, storage “business”

  26. Systematic FDR load tests in progress US cloud results

  27. Systematic FDR load tests in progress EU cloud results

  28. Systematic FDR load tests in progress EU cloud results

More Related