Management of ATLAS jobs @ CC-IN2P3

Specificities, issues and advice Management of ATLAS jobs @ CC-IN2P3

ATLAS production - J.Devemy / N.Lajili Summary A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions

ATLAS production - J.Devemy / N.Lajili A few metrics - March 2007 Jobs : Total submitted jobs : 400 000 Total submitted jobs class Long : 200 000 2 separate farms : Pistoo : 56 cpus (for parallel jobs) Anastasie : 1616 cpus Farm usage : 62 groups (experiment or laboratory) and 384 users

ATLAS production - J.Devemy / N.Lajili A few metrics - ATLAS@March 2007 Jobs Total submitted jobs : 52 000 Total submitted jobs class Long : 28 000 = 47 % Memory use for jobs on class Long memory consistently requested : 2 GB 69 % used less than 1.5 GB 29 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 97 % of jobs used less than 800 000 s

ATLAS production - J.Devemy / N.Lajili A few metrics - ATLAS@April 2007 Jobs Total submitted jobs : 62000 Total submitted jobs class T : 36000 = 59 % of total jobs Memory use for jobs on class Long memory consistently requested : 2 GB 86 % used less than 1.5 GB 14 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 98 % of jobs used less than 800 000 s

ATLAS production - J.Devemy / N.Lajili Monitoring production @ CC : MRTG Real time production status Green : All ATLAS running jobs at CC-IN2P3 Blue : All LHC running jobs at CC-IN2P3 Orange : ((all LHC running jobs at CC-IN2P3)/(All ATLAS running jobs at CC-IN2P3))*100

ATLAS production - J.Devemy / N.Lajili Local job monitoring tool Tools to detect problematic job behaviour : 1. Jobs « slow » : running jobs which do not consume cpu time. 2. Jobs « early ended » : bench of jobs using much less cpu time than requested 3. Alert mails : BQS sends mail to grid site admin in case of job failure 4. Manual check : by running scripts

ATLAS production - J.Devemy / N.Lajili Local job monitoring tool

ATLAS production - J.Devemy / N.Lajili Concrete actions taken Find a detailed diagnosis of job failures e.g : Lack of resource, expired proxy, transfers pending, core LCG services unavailable Job environnement setting for a given VO Find the job Identity LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in : mail, GGUS ticket Various tasks for managing the production including : Jobs could be deleted, locked in queued in case of problem

ATLAS production - J.Devemy / N.Lajili Concrete actions taken Increasing VO’s quota In order to face with intensive computing : DC, MC production .. Create BQS resources To cope with internal services unavailability (HPSS, dCache) VO agents set up To regulate automatically job priorities and resources according to the VO requirements

ATLAS production - J.Devemy / N.Lajili Issues Sometimes it’s hard to find the user email Sometimes very low reactivity from users Users are not well informed about the LCG service status Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Hard to trace jobs which are not submitted through Ressource Brokers Lack of visibility about core LCG services status

ATLAS production - J.Devemy / N.Lajili Issues Zombies processes left by ended jobs on the workers nodes solved in the next BQS version Lack of tools which may allow us to manage VO priorities solved in a future version (autumn 2007) Memory wasting with jobs submitted on the class Long

ATLAS production - J.Devemy / N.Lajili Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the long class Keep us informed as soon as possible about critical production periods

ATLAS production - J.Devemy / N.Lajili BQS BQS in a few words… Home built batch system (10 years old) Works on all UNIX (GNU/Linux, Solaris, AIX...) Under continuous evolution, new functionalities are added enabling scalability, robustness, reliability functionality required by users GRID compliant Very rich scheduling policy including : quotas, resources status, number of queued and running jobs...

ATLAS production - J.Devemy / N.Lajili BQS (2) BQS philosophy : Dispatch of heterogeneous jobs on a worker node Usage of BQS resources (kind of semaphores) Current developments : Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS New BQS servers (to easily absorb the growth of activity)

Comments / Questions ATLAS production - J.Devemy / N.Lajili

Management of ATLAS jobs @ CC-IN2P3

Management of ATLAS jobs @ CC-IN2P3

Presentation Transcript

Jean-Yves Nief, CC-IN2P3

CC-IN2P3: A High Performance Data Center for Research

iRODS usage at CC-IN2P3

CC - IN2P3 Site Report

BaBar @ CC-IN2P3

Jean-Yves Nief CC-IN2P3, Lyon

Jean-Yves Nief CC-IN2P3, Lyon

Workshop KEK - CC-IN2P3

SRB and iRODS @ CC-IN2P3

CC - IN2P3 Site Report

CC-IN2P3

Glexec/SCAS Pilot: IN2P3-CC status

Jean-Yves Nief, CC-IN2P3

IGTMD réunion du 4 Mai 2007 CC IN2P3 Lyon

CC-IN2P3 data repositories

Grid interoperability developments at CC-IN2P3

CC - IN2P3 Site Report

Astroparticle @ CC-IN2P3: an overview