1 / 37

AliEn development

AliEn development. Miguel Martinez Pedreira. Contents. Basics and concepts CVMFS migration Recent changes ToDos / Ideas. What is AliEn ?.

beulah
Télécharger la présentation

AliEn development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AliEn development Miguel Martinez Pedreira

  2. Contents • Basics and concepts • CVMFS migration • Recent changes • ToDos / Ideas AliEn development - Miguel Martinez Pedreira

  3. What is AliEn ? • AliEn (ALICE Environment) is a lightweight Open Source Grid Framework built around other Open Source components using the combination of a Web Service and Distributed Agent Model • designed to comply with the offline world of a HEP experiment • massive amounts of data implies distributing its storage and processing • It started within the ALICE Off-line Project at CERN and constitutes the production environment for simulation, reconstruction, and analysis of physics data of the ALICE Experiment • The current status of the ALICE grid operation can be found at the MonALISA Grid Monitoring AliEn development - Miguel Martinez Pedreira

  4. Virtual Organisations • Users (jobs submission for data analysis) + Central management (the “brain” of the GRID) + Sites (the “muscle” of the GRID) • ALICE numbers: ~35K avg running jobs (~200K/day), ~80sites, ~60SE, ~1000M entries in the catalogue AliEn development - Miguel Martinez Pedreira

  5. Distributed Analysis AliEn development - Miguel Martinez Pedreira

  6. ALICE Grid AliEn development - Miguel Martinez Pedreira

  7. AliEn summary • 3-layer system that leverages thedeployed resources of the underlying WLCG infrastructures and services • Interfaces to AliRoot via ROOT plugin (TAlien) that implements AliEn API • Complex workflows including distributed analysis built on top of AliEn API • Used by ALICE and PANDA AliEn development - Miguel Martinez Pedreira

  8. CVMFS AliEn development - Miguel Martinez Pedreira

  9. CVMFS migration • Started in second half of August 2013 • First sites: test CE at CERN (local) and BITP (grid) • Running production-like JDLs (alitrain/aliprod) • CVMFS setup tested then over ‘real’ site • Combination of packages were showing problem with the environment of the process that was being set First fix • IP list from APISCONFIG • Paths • Wiki created with steps to follow before starting migration AliEn development - Miguel Martinez Pedreira

  10. CVMFS migration • Then started with some other sites • Mainly big sites, 1 by 1, overall quite smooth • Decided to push everyone • Spam-time ( sorry  ) • haven’t counted myself, 7 November ~800 mails had been sent ( according to Maarten ) • Reactions • Many sites reinstalling to WLCG SL6 • Sites pending CVMFS being installed for quite a while... • Several interface updates: CONDOR, ARC • CREAM was updated also to add robustness and deal with specific setups AliEn development - Miguel Martinez Pedreira

  11. CVMFS migration • Deadline modified after October push: end 2013 AliEn development - Miguel Martinez Pedreira

  12. CVMFS migration • PackMan service not needed anymore • Issues • Fix for the ROOT/AliRoot paths when local ROOT installed (conflict) • Missing libraries: libtcl, gfortran, libtermcap... • Some sites don’t use HEP_OS_libs • ‘Warm-up’ time with ERROR_E ? • I/O Error • squid and cache misconfigurations? not easy to debug... • Efficiencies ? • Thanks to admins ;-) ! AliEn development - Miguel Martinez Pedreira

  13. CVMFS migration AliEn development - Miguel Martinez Pedreira

  14. CVMFS migration AliEn development - Miguel Martinez Pedreira

  15. CVMFS migration AliEn development - Miguel Martinez Pedreira

  16. Recent changes: /tmp issue • Detected some jobs that were analyzing the wrong data • From the file ‘wn.xml’ • Job tokens • Not unique  • Using a new service creating unique tokens • Other ideas led to a more robust JobAgent • Check sandbox use, creation • Check chdir • Printing more info • Unique open/read of the XML file • Problem resulted not to be in AliEn • jobs waiting for same CVMFS content writing concurrently • helped to create a model of the JobAgent flow: to be ‘digitalized’ AliEn development - Miguel Martinez Pedreira

  17. Recent changes • Investigating the errors • ERROR_E • TTL • Memory • Idle • Couldn’t get Catalogue... • Still the winner is ERROR_V • Focusing on what we don’t understand or the problems that affect more the system overall AliEn development - Miguel Martinez Pedreira

  18. Recent changes: ERROR_E • Many jobs failing for running over the TTL • ProxyTTL=“1” • Job runtime to proxy timeleft on the job – 10 minutes • Still some continue failing... • Proxy timeleft unavailable • Memory issues (or other) • Saving the output to check logs • OutputErrorE, same as output • then registerOutput <jobId> AliEn development - Miguel Martinez Pedreira

  19. Recent changes: ERROR_E • Some jobs fail getting a catalogue instance • Found race-condition • Jobs WAITING for a while, and being moved to ZOMBIE just after ASSIGNED • Optimizer using DB field based on timestamp • that wasn’t properly updated... • A small portion still fail • Added full trace of the catalogue creation in the JobAgent • “Bad hostname” or “Host undefined” • Stuck after getting JDL • Under investigation AliEn development - Miguel Martinez Pedreira

  20. Recent changes: INSERTION Inserting Submit Job Man copyInput Check JDL, inserting `whereis` per file Splitting sizes check Analyse split fields InputDownload-Workdirectorysize Inserting Opt add SE requirements Create baskets Splitting Opt Getting all files check SE-CE compatibility `whereis` per file sizes check If maxsize, `ls` per file insert jobagent Submit subjobs Waiting Split AliEn development - Miguel Martinez Pedreira

  21. Recent changes: INSERTION Submit Job Man Check JDL, inserting Splitting Analyse split fields Create baskets Getting all files Splitting Opt `whereis` per file (and cached!) sizes check (no InputDownload or InputBox) add/check SE-CE compatibility Subjobs to WAITING Split Waiting AliEn development - Miguel Martinez Pedreira

  22. Recent changes: INSERTION AliEn development - Miguel Martinez Pedreira

  23. Recent changes: INSERTION AliEn development - Miguel Martinez Pedreira

  24. Recent changes: baskets • Some months ago, file/job distribution found to be very ugly • Issue in catalogue: entries with duplicated pfns • cleanup ? • fix on the optimizer deals with it • Improved the basket creation • Step 1: file transfers to have data in same SEs (Markus/Jan) • not so good for the grid balance • Step 2: creating big collections for several runs (Costin) • balanced grid • FileBroker not needed anymore... AliEn development - Miguel Martinez Pedreira

  25. File Catalogue AliEn development - Miguel Martinez Pedreira

  26. File Access Monitoring Service • FAMoS provides a facility to monitor the attributes of the accesses to the files and to record in an organized manner the values of attributes to a database • Counts the accesses not only to individual files but also to set of AOD and ESD files of the LHC periods, called categories (e.g. LHC10f6a_ESD, LHC10h_AOD) • It provides also information on the categories accessed by individual users (like: alidaq, aliprod, alitrain) • Information gathered from Authen’s and API servers • since August • Web interface under development • http://aligrid.yerphi.am/famos/monitoring AliEn development - Miguel Martinez Pedreira

  27. File Access Monitoring Service AliEn development - Miguel Martinez Pedreira

  28. File Access Monitoring Service AliEn development - Miguel Martinez Pedreira

  29. File Access Monitoring Service AliEn development - Miguel Martinez Pedreira

  30. ToDos / Ideas • Unifying AliEn versions  • v2-19, v2-20, v2-21, trunk, central, API • Also making installations/tests work • JDL optimization • Millions of jobs, big JDL text • Done in v2-21 • Storing compressed JDL • and only diff tags in resultsJdl • We could also create the subjobs JDL from the father’s • Optimizers investigations • periodicity ? • jobs not expiring (SPLIT without pending subjobs e.g.) • queries failing ? • Catalogue cleanups/modifications • orphan entries, duplicated pfns • unused tables • new solutions (EOS?), apply (v2-20?) improvements • more caching AliEn development - Miguel Martinez Pedreira

  31. ToDos / Ideas • zip64 • currently using standard zip • can’t deal with >4GB • IPv6 readiness • this summer • Broker queries • packages matching queries, long list • JSON services • SOAP used in production: but JSON is ready since v2-20 and being used by PANDA • + performance – backward incompatibility (in critical parts...)  AliEn development - Miguel Martinez Pedreira

  32. ToDos / Ideas • find • new flag to sort files according to the position of the request • for better reading • Long desired commands fixes • ps, top, masterjob... • glExec • utility allowing user separation in multi-user pilot jobs • let each user task run under a corresponding account • PayLoad + user certs? • Supercomputing...? • interfacing with PanDA AliEn development - Miguel Martinez Pedreira

  33. ToDos / Ideas • Machine/Job features Task Force • aiming to adapt to virtualized environments (including cloud) • also related to multi-core slot queues, cpu dynamic availability... • concerns about overloading the meta-service when communicating features via “magic-IP” • Expand to more Iaas systems AliEn development - Miguel Martinez Pedreira

  34. jAlien AliEn development - Miguel Martinez Pedreira

  35. jAlien AliEn development - Miguel Martinez Pedreira

  36. jAlien AliEn development - Miguel Martinez Pedreira

  37. Sorry if I bored you ;-) Any questions ? AliEn development - Miguel Martinez Pedreira

More Related