230 likes | 351 Vues
The new CERN Computing Application Framework (CAF) facility introduces significant enhancements over its predecessor. This document outlines the features, processing rates, and current statistics of CAF2, comparing it to CAF1. Key aspects include the upgraded user support systems, processing capabilities, and improved management of datasets, disks, and CPUs. The new system utilizes GSI-based authentication and dynamic load balancing for efficient operation. Statistics indicate higher processing speeds with fewer workers, emphasizing the importance of resource fairness among user groups.
E N D
New CERN CAF facility:parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008
Outline • New CAF: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions
New CAF • Timeline • 28.09 startup of the new CAF cluster • 01.10 1st day with users on the new cluster • 07.10 old CAF dismissed by IT • Usage • 26 workers instead of 33 (but much faster, see later) • Head node is « alicecaf » instead of « lxb6046 » • GSI based authentication, AliEn certificate needed • Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown • Datasets clean up, staged only latest data production (First physics - stage 3) • AF v4-15 meta package redistributed
Technical Differences • Cmsd (Cluster Management Service Daemon) • Why? Olbd not supported any longer • What? Dynamic load balancing of files and data name-space • How? Stager daemon can benefits from: • bulk prepare replaces touch file • bulk prepare allows "co-locate" files on the same node • GSI authentication • Secure communication using user certificates and LDAP based configuration management
Architectural Differences • Why « only » 26 workers? • You could use 104 if you are alone • With 26 workers 4 users can effectively run concurrently • Estimate average of 8 concurrent users… • Processing units 6.5x faster than old CAF
Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions
CAF1 vs CAF2 (Processing Rate) • Test Dataset • First physics (stage 3) pp, Pythia6, 5kG, 10TeV • /COMMON/COMMON/LHC08c11_10TeV_0.5T • 1840 files, 276k events • Tutorial task that runs over ESDs and displays Pt distribution • Other comparison test:RAW data reconstruction (Cvetan)
Reminder • The test is dependent on the file distribution for the used dataset • Parallel code: • Creation of workers • Files validation (workers opening the files) • Events loop (execution of the selector on the dataset) • Serial code: • Initialization of PROOF master, session and query objects • Files look up • Packetizer (file slices distribution) • Merging (biggest task)
Processing Rate Comparison (1) • The final average rate is the only important information • Final tail reflects the fact one by one workers stop working • data unevenly distributed • A longer tail shows a worker overloaded on the last packet(s) • 3 workers maximum helping on the same «slow» packet 104 workers, 200k evs 104 workers, 276k evs
Processing Rate Comparison (2) ___104 workers ___ 26 workes ___ 33 workers • Events/sec • MB/sec • #events • #events
Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users/Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions
CAF Usage • Available resources in CAF must be fairly used • Highest attention to how disks and CPUs are used • Users are grouped (sub-detectors / physics working groups) • Each group • has a disk space (quota) which is used to stage datasets from AliEn • has a CPU fairshare target (priority) to regulate concurrent queries
CAF Groups • 19 registered groups • 145 (60) registered users • In brackets () the situation at the previous offline week
Files Distribution Max: 1863 Min: 1727 Max difference: 8% Nodes with more files can produce tails in processing rate Above a defined threshold files are not stored any longer
Disk Usage Min: 105 Max: 116 Max difference: 10%
Dataset Monitoring • - 28TB disk space for staging • - PWG0: 4TB • - PWG1: 1TB • - PWG2: 1TB • - PWG3: 1TB • - PWG4: 1TB • - ITS: 0.2TB • - COMMON: 2TB
CPU Quotas • - default group is not the most consuming anymore
Outline • CAF2: features • CAF1 vs CAF2 • processing rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • File Staging • Conclusions
File Stager • CAF intensively uses 'prepare’ • 0-size files in Castor2 cannot be staged, but replicas are ok • Check at stager level to avoid spawning infinite prepare on the same empty file unable toget online • Loop over the replicas (CERN, if any, taken first) replica[i] in Castor && size==0? • Skip it • Add to StageLIST replica[i] is not staged? STOP Copy replica (API service) • File corrupted. Skip it STOP Stage StageLIST
Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Files Distribution • Users/Groups • Staging • Conclusions
Conclusions • If (ever) you cannot connect just drop a mail and wait for… … « please try again » • CAF Usage • Subscribe to alice-project-analysis-task-force@cern.ch using CERN SIMBA (http://listboxservices.web.cern.ch/listboxservices) • Web page at http://aliceinfo.cern.ch/Offline/Analysis/CAF • CAF tutorial once a month • New CAF • Faster machines, more space, more fun • Shaky behavior due to higher user activity is under intensive investigation • Credits • PROOF Team and IT for the prompt support