New CERN CAF facility: parameters, usage statistics, user support

New CERN CAF facility:parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008

Outline • New CAF: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

New CAF • Timeline • 28.09 startup of the new CAF cluster • 01.10 1st day with users on the new cluster • 07.10 old CAF dismissed by IT • Usage • 26 workers instead of 33 (but much faster, see later) • Head node is « alicecaf » instead of « lxb6046 » • GSI based authentication, AliEn certificate needed • Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown • Datasets clean up, staged only latest data production (First physics - stage 3) • AF v4-15 meta package redistributed

Technical Differences • Cmsd (Cluster Management Service Daemon) • Why? Olbd not supported any longer • What? Dynamic load balancing of files and data name-space • How? Stager daemon can benefits from: • bulk prepare replaces touch file • bulk prepare allows "co-locate" files on the same node • GSI authentication • Secure communication using user certificates and LDAP based configuration management

Architectural Differences • Why « only » 26 workers? • You could use 104 if you are alone • With 26 workers 4 users can effectively run concurrently • Estimate average of 8 concurrent users… • Processing units 6.5x faster than old CAF

Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

CAF1 vs CAF2 (Processing Rate) • Test Dataset • First physics (stage 3) pp, Pythia6, 5kG, 10TeV • /COMMON/COMMON/LHC08c11_10TeV_0.5T • 1840 files, 276k events • Tutorial task that runs over ESDs and displays Pt distribution • Other comparison test:RAW data reconstruction (Cvetan)

Reminder • The test is dependent on the file distribution for the used dataset • Parallel code: • Creation of workers • Files validation (workers opening the files) • Events loop (execution of the selector on the dataset) • Serial code: • Initialization of PROOF master, session and query objects • Files look up • Packetizer (file slices distribution) • Merging (biggest task)

Task executed 5 times and averaged

Processing Rate Comparison (1) • The final average rate is the only important information • Final tail reflects the fact one by one workers stop working • data unevenly distributed • A longer tail shows a worker overloaded on the last packet(s) • 3 workers maximum helping on the same «slow» packet 104 workers, 200k evs 104 workers, 276k evs

Processing Rate Comparison (2) ___104 workers ___ 26 workes ___ 33 workers • Events/sec • MB/sec • #events • #events

Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users/Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

CAF Usage • Available resources in CAF must be fairly used • Highest attention to how disks and CPUs are used • Users are grouped (sub-detectors / physics working groups) • Each group • has a disk space (quota) which is used to stage datasets from AliEn • has a CPU fairshare target (priority) to regulate concurrent queries

CAF Groups • 19 registered groups • 145 (60) registered users • In brackets () the situation at the previous offline week

CAFStatusTable

Files Distribution Max: 1863 Min: 1727 Max difference: 8% Nodes with more files can produce tails in processing rate Above a defined threshold files are not stored any longer

Disk Usage Min: 105 Max: 116 Max difference: 10%

Dataset Monitoring • - 28TB disk space for staging • - PWG0: 4TB • - PWG1: 1TB • - PWG2: 1TB • - PWG3: 1TB • - PWG4: 1TB • - ITS: 0.2TB • - COMMON: 2TB

CPU Quotas • - default group is not the most consuming anymore

Outline • CAF2: features • CAF1 vs CAF2 • processing rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • File Staging • Conclusions

File Stager • CAF intensively uses 'prepare’ • 0-size files in Castor2 cannot be staged, but replicas are ok • Check at stager level to avoid spawning infinite prepare on the same empty file unable toget online • Loop over the replicas (CERN, if any, taken first) replica[i] in Castor && size==0? • Skip it • Add to StageLIST replica[i] is not staged? STOP Copy replica (API service) • File corrupted. Skip it STOP Stage StageLIST

Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Files Distribution • Users/Groups • Staging • Conclusions

Conclusions • If (ever) you cannot connect just drop a mail and wait for… … « please try again » • CAF Usage • Subscribe to alice-project-analysis-task-force@cern.ch using CERN SIMBA (http://listboxservices.web.cern.ch/listboxservices) • Web page at http://aliceinfo.cern.ch/Offline/Analysis/CAF • CAF tutorial once a month • New CAF • Faster machines, more space, more fun • Shaky behavior due to higher user activity is under intensive investigation • Credits • PROOF Team and IT for the prompt support

New CERN CAF facility: parameters, usage statistics, user support