1 / 28

Experiment Software Installation in the LHC Computing Grid

Experiment Software Installation in the LHC Computing Grid. E-science2005,December 8-th. www.eu-egee.org. Roberto Santinelli. EGEE is a project funded by the European Union under contract IST-2003-508833. Outline. The problem -Eye’s of bird view

julios
Télécharger la présentation

Experiment Software Installation in the LHC Computing Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiment Software Installation in the LHC Computing Grid E-science2005,December 8-th www.eu-egee.org Roberto Santinelli EGEE is a project funded by the European Union under contract IST-2003-508833

  2. Outline • The problem -Eye’s of bird view -Requirements from the site administrators -Requirements from the experiments • The general mechanism (since LCG phase-1) - Limits of the LCG-1 mechanism • Tank&Spark • Experimental results from INFN activity. Experiment Software Installation

  3. The Problem Experiment Software Installation

  4. How can I run my private application in a world-wide Grid? How is a user guaranteed that the software he/she needs is in fact installed on a remote resource where the user might not even have an account? Experiment Software Installation

  5. The problem • The middleware software is often installed and configured by system administrators at sites via customized tools like YAIM (in the past LCFGng) or Quattor (but also Rocks) that provide also for centralized management of the entire computing facility.  (root access required, please) while • Gigabytes of Virtual Organization (VO) specific software need to be installed and managed at each site too. Experiment Software Installation

  6. The problem • The difficulty here is mainly connected to the existence of disparate ways of installing, configuring and validating experiment software. • Each VO (LHC Experiment and not only) wants to maintain its own proprietary method for distributing the software on LCG (scram, pacman, rpm, tarball and so on). • Furthermore the system for accomplishing this task might change with time even inside the same VO. • Site admin’s might be not familiar with all VO software requirements and release procedures. • HEP Application Software is continuously upgraded (HF) Experiment Software Installation

  7. Concerns from site administrators • No daemon running on the WNs: the WN belongs to the site and not to the GRID • No root privileges for VO software installation on WN. • Traceable access to the site. • In/Outbound connectivity to be avoided. • Site policies applied for external action triggering. • Strong authentication required • No shared file system area should be assumed to serve the experiment software area at the site. Experiment Software Installation

  8. Requirements from the Experiment • Only special users are allowed to install software at a site. • The experiment should be free to install whenever it wants the software in a site. • Management of simultaneous and concurrent installations should be in place. • Installation/validation/removal process failure recovery. • Publication of relevant information for the software. • Installation/validation/removal of software should be done not only through a Grid Job but also via a direct service request in order to assure the right precedence with respect to “normal” jobs. • Validation of the software installed could be done in different steps and at different times with respect to the installation. • The software must be “relocatable” • It is up to the experiment to provide scripts for installation/validation/removal of software (freedom). Experiment Software Installation

  9. General Mechanism “In Order to comply with these requirements/concerns - since LCG-1- a general mechanism relying upon a set of “key-features” has been taken into account……………….. “ Experiment Software Installation

  10. ? General LCG-1 Mechanism: The VO_<EXP>_SW_AREA Sites can choose between providing VO space in a shared file system or allow for local software installation on a WN. This is indicated through the local environment variable VO_<EXP>_SW_DIR. If it is equal to “.” (meaning in linux the current path) it means there is not file system shared among WNs and everything is installed locally, on the working directory, and later on scratched by the local batch system (to avoid the exhaustion of disk space on the node). Vice versa the software can be installed permanently on the shared area and made always accessible by all WNs. Experiment Software Installation

  11. General LCG-1 Mechanism: The ESM Only few people in a given collaboration are eligible to install software on behalf of the VO. These special users have write access to the VO space (in case of shared file system) and privileges to publish the tag in the Information System. The VO Manager creates the software administrator Group. People belonging to this group are mapped to a special Administrative VO account in the local grid-mapfile at all sites. Experiment Software Installation

  12. General LCG-1 Mechanism:Tag publication For sites on which the installation/certification has been successfully run, the ESM adds in the InfoSys of the site a Tag that indicates that the version of the VO-software has successfully passed the tests. This information will allow real user jobs - requiring a given version of VO software - to be driven on homologated sites. Experiment Software Installation

  13. General LCG-1 Mechanism: The workflow • The ESM checks the resources available and decides where the new release must be installed. • The ESM creates one or more tarball files containing scripts to install (and validate later on) and bundles of experiment software that have to be installed. • The ESM runs “special lcg-tool” to install software on a per site basis by submitting “normal grid jobs” explicitly on these sites. • The ESM runs further verification steps by submitting other grid jobs that will certify the software previously installed. • The ESM finally publishes on the Information System of the site a tag. These tags indicate the software is available to subsequent running jobs from “normal users” . Experiment Software Installation

  14. How this general mechanism behaves with respect to the initial list of requirements/concerns? Experiment Software Installation

  15. Limits: from site administrators concerns point of view • No daemon running on the WNs: the WN belongs to the site and not to the GRID • No root privileges for VO software installation on WN. • In/Outbound connectivity to be avoided. • Strong authentication required • Site policies applied for external action triggering. • No shared file system area should be assumed to serve the experiment software area at the site. • Traceable access to the site. Experiment Software Installation

  16. Limits: from Experiments requirements point of view • Only special users are allowed to install software at a site • Installation/validation/removal process failure recovery • Publication of relevant information for the software. • It is up to the experiment to provide scripts for installation/validation/removal of software. (freedom) • The experiment needs to install very frequently its own software: it should be free to install whenever it wants the software in a site. • Management of simultaneous and concurrent installations should be in place • Installation/validation/removal of software should be done not only through a Grid Job but also via a direct service request in order to assure the right precedence in respect to “normal” jobs (special priority) • Validation of the software installed could be done in different steps and at different times with respect to the installation. • THE SOFTWARE MUST BE PROPAGATED IN ALL WNs WHENEVER THERE IS NOT SHARED FILE SYSTEM Experiment Software Installation

  17. Tank & Spark “In order to FULLY comply with these requirements LCG passed from a simple set of scripts …….to a structured service : Tank and Spark integrated into this general framework through a multilayered toolkit.” Experiment Software Installation

  18. - Tank&Spark: Mainly used to propagate software to other WNs. • installation by-passing the grid-job-submission • high prioritization of software management • tracks who is installing what • complies with external policy set by the site administrators • can manage all possible topologies of file system (shared, no-shared, AFS, a mix of them!) • a-synchronous (currently) and synchronous installation. • failure recovery capabilities (re-try of the installation on the node) • Possible roll-back of a given installation (not in place yet) • exhaustive notification about the installation process (node by node) • to the ESM and (automatically) to the site admin • do you remember the final wizard at the begin of this talk? • Keeps information about a given software • internally identified through GUIDs • date, size, owner, path, status, dependencies and so on … • Automatic farm management • Is a node out? Is a new node there? Is the file system up and running? • Rough monitoring of the installation progress (using the Information System) • SSL embedded synchronization • in case the SE is outside the site firewall lcg-ManageSoftware: • It is the steering script invoked on the WN for installing/removing/validating application software. • It checks for the local environment and decides the workflow to be performed. • Is there running some other process for the same software and VO? • Is it a shared file system or not ? • Is it a AFS file system or not (conversion of GSI credential to AFS Tokens)? • Is there installed Tank&Spark ? • It downloads the sources of the software to be installed (if specified) to the local WN (by-passing the outbound connectivity requirement through the local SE. • It invokes the experiment specific script (provided somehow by the ESM) and checks the result of the script. • It creates a temporary directory used for install/validate/remove software and later on it cleans up the temporary directory. • It publishes TAGs on the Information System with a flavor that depends on the action that it’s going to be performed and the result of a such action (monitoring of the process: do you remember the scroll bar at the begin of this talk?) Structure of the Experiment Software Installation toolkit Lcg-ManageVOTag: It is the module used by lcg-ManageSoftware in order to add/list/remove tags published on the Information System. lcg-asis: is a friendly user interface (on the UI) that simplifies the ESM life (do you remember the wizard steering you at the begin of this talk?) • It uploads (if specified) the sources of the software on the grid (rpms, archives and so on). • It loops over all available sites to the VO (complying with some requirements provided by the ESM in terms of CPU, memory, disk space and CPU-time) and for each site: • Checks if another software management process is running on the site through the Information System • Creates automatically the JDL for that site • Submits the jobs and stores job information Lcg-asis gssklog: It allows for the conversion of GSI credentials into valid KRB5 AFS tokens in case the site provides shared area through AFS UI For more information visit http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/configuration_of_tankspark Lcg-ManageSoftware WN Lcg-ManageVOTag Tank&Spark gssklog CE Experiment Software Installation

  19. Tank & Spark It is a client-server application written entirely in C++ and it consists of three different components: Tank : multithread (g-SOAP based) service (running on the CE) listening for GSI-authenticated (and non) connections interfaced to a MySQL BKD. Spark :client application runningon each WN which talks with “Tank” for retrieving/inserting/deleting software information and triggering synvcronization R-sync server running on another machine (a SE for instance) and acting as central repository of the VO software. Experiment Software Installation

  20. At the end of the whole process TANK will e-mail the ESM indicating the result of the installation; the Information System is upgraded accordingly to the result of the process JDL-installation job from ESM arrives on CE ESM TANK is contacted by all WNs one at the time External conditions are checked. Special site policies can be taken into account. Local installation on WNs is triggered. No authentication is required : each WN trusts TANK. 7 TANK registers the new tag and synchronize through R-SYNC the new directory created in SPARK in a central repository 1 Spark-client program is called. Delegated credentials of the ESM are checked in TANK. SPARK ask for a sw tag registration in TANK central DB. ESM requests ends up on WN that becomes SPARK SE TANK 3 2 4 5 6 WN WN WN WN WN WN WN CE abc ab “c” “c” The software (here labeled as “c”) is installed locally through the middle layer lcg-ManageSoftware. A pre-validation is highly recommended before triggering the propagation. The Information System is upgraded abc ab Site Firewall

  21. Experimental results Experiment Software Installation

  22. Experimental Results: Goals • Checking all functionalities of the service. (Pisa test) • Installing and synchronizing ~3GB of *real application software (Atlas) * among 50 machines (Padova test). • Reproduce the same installation test several times proving T&S being suitable for the typical HEP use case (reliability test) • Tested the removal capability of Tank and Spark • Testing the system in a production environment • Compatibility with the existent production infrastructure (CE load affection, data safety and integrity) • Non-expert users • Proving the scalability of the system up to o(100) machines and the stability of the system in the time (Legnaro test) Experiment Software Installation

  23. Experimental results: Pisa test On the CE Tank DB (at the end) ` Spark WN (beginning process) Another WN (at the end) Installing a (fake) package tagged:CMSIM-145.1.2-0 (50% sharing area and 50% local disk machines) Experiment Software Installation

  24. Experimental Results: Padova test …after ~3hs… Storage Element (r-sync server) CPU=3GHz, Mem=2GB, Experiment Software Installation

  25. Experimental results: Padova tests 50 WNs 3GHz dual Xeon 2GB of memory FastEthernet connected A WN of Padova CE load: Tank doesn’t affect the CPU (this is the normal CE activity!) Experiment Software Installation

  26. Experimental results: Legnaro Tests O(100) Dual Xeon machine (2.4-3.0 GHz) and 2 GB of RAM. Configurated with share file system. Similar for the CE and the SE. The farm is connected to a Gigabit Ethernet switch. • The Tank servers run for more than 30 days with the same load without any watch dog intervention. Very stable. • The Tank servers didn’t prove memory problems being his consummation stable to 9 MB and only 0.1% of CPU used (smaller than an emacs process!) • We tested the CPU load of the r-sync server in the critical “check summing” operation for the Geant4 software (1.3GB) • R-sync scales • The system scaled up to o(100) machines without any problem! Experiment Software Installation

  27. Conclusion • Tank & Spark is deployed on LCG since last year. • It offers a versatile system for propagating GB of experiment specific software in LCG computing centers and tackles all limits that previous mechanism had. • It copes with the overwhelming majority of site admins and end-users requirements. • It adds extra-features that only preliminary tests and a close work with the HEP experiments showed up to be needed. • It has been proved to be reliable, robust and it has been demonstrated its scalability up to O(100) machines farms and 3GB of software successfully synchronized. • Next step will consist of performing the same test against O(1000) machines computing centers. Experiment Software Installation

  28. Acknowledgements Special thanks to the following Italian site administrators (within the INFN collaboration) that permitted these tests Tommaso Boccali (PISA) Sergio Fantinel and Massimo Biasotto (LEGNARO) Rosario Turrisi (PADOVA) Experiment Software Installation

More Related