1 / 16

Management of ATLAS jobs @ CC-IN2P3

Specificities, issues and advice. Management of ATLAS jobs @ CC-IN2P3. Summary. A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions. A few metrics - March 2007. Jobs : Total submitted jobs : 400 000

gcynthia
Télécharger la présentation

Management of ATLAS jobs @ CC-IN2P3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Specificities, issues and advice Management of ATLAS jobs @ CC-IN2P3

  2. ATLAS production - J.Devemy / N.Lajili Summary A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions

  3. ATLAS production - J.Devemy / N.Lajili A few metrics - March 2007 Jobs : Total submitted jobs : 400 000 Total submitted jobs class Long : 200 000 2 separate farms : Pistoo : 56 cpus (for parallel jobs) Anastasie : 1616 cpus Farm usage : 62 groups (experiment or laboratory) and 384 users

  4. ATLAS production - J.Devemy / N.Lajili A few metrics - ATLAS@March 2007 Jobs Total submitted jobs : 52 000 Total submitted jobs class Long : 28 000 = 47 % Memory use for jobs on class Long memory consistently requested : 2 GB 69 % used less than 1.5 GB 29 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 97 % of jobs used less than 800 000 s

  5. ATLAS production - J.Devemy / N.Lajili A few metrics - ATLAS@April 2007 Jobs Total submitted jobs : 62000 Total submitted jobs class T : 36000 = 59 % of total jobs Memory use for jobs on class Long memory consistently requested : 2 GB 86 % used less than 1.5 GB 14 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 98 % of jobs used less than 800 000 s

  6. ATLAS production - J.Devemy / N.Lajili Monitoring production @ CC : MRTG Real time production status Green : All ATLAS running jobs at CC-IN2P3 Blue :  All LHC running jobs at CC-IN2P3 Orange : ((all LHC running jobs at CC-IN2P3)/(All ATLAS running jobs at CC-IN2P3))*100

  7. ATLAS production - J.Devemy / N.Lajili Local job monitoring tool Tools to detect problematic job behaviour : 1. Jobs « slow » : running jobs which do not consume cpu time. 2. Jobs « early ended » : bench of jobs using much less cpu time than requested 3. Alert mails : BQS sends mail to grid site admin in case of job failure 4. Manual check : by running scripts

  8. ATLAS production - J.Devemy / N.Lajili Local job monitoring tool

  9. ATLAS production - J.Devemy / N.Lajili Concrete actions taken Find a detailed diagnosis of job failures e.g : Lack of resource, expired proxy, transfers pending, core LCG services unavailable Job environnement setting for a given VO Find the job Identity LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in : mail, GGUS ticket Various tasks for managing the production including : Jobs could be deleted, locked in queued in case of problem

  10. ATLAS production - J.Devemy / N.Lajili Concrete actions taken Increasing VO’s quota In order to face with intensive computing : DC, MC production .. Create BQS resources To cope with internal services unavailability (HPSS, dCache) VO agents set up To regulate automatically job priorities and resources according to the VO requirements

  11. ATLAS production - J.Devemy / N.Lajili Issues Sometimes it’s hard to find the user email Sometimes very low reactivity from users Users are not well informed about the LCG service status Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Hard to trace jobs which are not submitted through Ressource Brokers Lack of visibility about core LCG services status

  12. ATLAS production - J.Devemy / N.Lajili Issues Zombies processes left by ended jobs on the workers nodes solved in the next BQS version Lack of tools which may allow us to manage VO priorities solved in a future version (autumn 2007) Memory wasting with jobs submitted on the class Long

  13. ATLAS production - J.Devemy / N.Lajili Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the long class Keep us informed as soon as possible about critical production periods

  14. ATLAS production - J.Devemy / N.Lajili BQS BQS in a few words… Home built batch system (10 years old) Works on all UNIX (GNU/Linux, Solaris, AIX...) Under continuous evolution, new functionalities are added enabling scalability, robustness, reliability functionality required by users GRID compliant Very rich scheduling policy including : quotas, resources status, number of queued and running jobs...

  15. ATLAS production - J.Devemy / N.Lajili BQS (2) BQS philosophy : Dispatch of heterogeneous jobs on a worker node Usage of BQS resources (kind of semaphores) Current developments : Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS New BQS servers (to easily absorb the growth of activity)

  16. Comments / Questions ATLAS production - J.Devemy / N.Lajili

More Related