Setting Up a New LHCb Production Centre: Processes, Challenges, and Future Plans

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration of access machine - Job handling - Setting up Cambridge as a (small-scale) production centre:  Configuration for summer 2002  Problems encountered  Future plans

LHCb distributed production system - Production manager stores details of participating sites in two places:  in a Java servlet that produces job scripts  in the PVSS system used for job management - Each production site must define and configure an access machine  Access machine deals with requests from PVSS, and distributes jobs between all machines available at a site  In EDG terms, the access machine acts as a Computing Element, and the machines where jobs are run act as Worker Nodes - When producing job scripts, use Servlet Runner that must have write access to the area where a site’s job scripts are created  May be able to use CERN Servlet Runner (afs access), or may need Servlet Runner installed at remote site

Configuration of access machine - Main steps for configuring the access machine are as follows:  Install PVSS tools  Define environment variable LHCBPRODROOT to point to root directory of production area  Download and run mcsetup installation script  Customise site-specific scripts  Customisation basically defines site identity, command for job submission, and what to do with output  Set up Servlet Runner if not using CERN Servlet Runner  More details available at: http://lhcb-wdqa.web.cern.ch/lhcb-wdqa/distribution http://lhcb-comp.web.cern.ch/lhcb-comp/ComputingModel /datachallenges/slice.doc

Job handling - Basic job handling is as follows (using CERN Servlet Runner):  Specify job request by filling in web form at: http://lhcb-comp.web.cern.ch/lhcb-comp/SICB/pcsf/html /mcbrunel.htm  Parameters passed to Servlet Runner, which produces job scripts  Submit jobs either through PVSS or locally using script submit-all-scripts installed by mcsetup  When jobs are completed, update central database and transfer data to CASTOR using script transfer-all installed by mcsetup

Cambridge: Summer 2002 (1) - Jobs for summer production were run on 10 desktop machines with Redhat Linux 7.1 installed:  5 x P3 (0.9-1.0 GHz, 256-512 Mb)  5 x P4 (1.8-2.0 GHz, 256-512 Mb) - Desktop machines are used by people who work interactively, and may submit other jobs; production jobs were run on low-priority batch queues  Made use of otherwise-idle CPU cycles - Each machine used has 10-20 Gb local scratch space; additionally had 20 Gb for LHCb production on central file server - LHCb production tools and software were installed only on the access machine - Access machine submitted jobs to an NQS pipe queue, for distribution among all production nodes

Cambridge: Summer 2002 (2) - A script executed at job startup determined where to run the applications:  If the local scratch area had at least 5 Gb free, the LHCb software was copied to a new directory in this area, and run there  If there was insufficient free space locally, the LHCb software was copied to a new directory in the LHCb area of the central file server, and run there - When a job completed, its output was stored on the file server, then the directory where the job was run was deleted - Log files and DSTs were copied to CERN, using bbftp and locally written tools

Cambridge: Problems encountered (1) - Configuration process was very drawn out, as all changes had to be made centrally  With new installation tools, site configuration is simpler and almost everything is done locally - Information concerning production not always communicated quickly to sites outside CERN  Situtation improved now that lhcb-production mailing list has been set up

Cambridge: Problems encountered (2) - Had problems during production when afs was unavailable, with sequence as follows:  Job fails to retrieve parameter files needed by SICBMC  SICBMC complains, but runs anyway  Job fails to retrieve options files needed by Brunel  Brunel core dumps  Large amounts of CPU time wasted (SICBMC producing unusable events); human intervention needed after job crash  Problem solved with new system, where reliance on afs is removed - Brunel v13r1 used a lot of memory (around 200 Mb)  Some jobs had to be killed as they prevented other users from working  Improvements with newer versions of Brunel?

Cambridge: Future plans - Participation in summer 2002 production has been a positive experience  Gained experience with production tools, and with running simulation and reconstruction jobs using the latest versions of the software Produced 37k events that have been copied to CASTOR, and are being used locally in physics studies - Aim to maintain participation in data challenges at least at current (low) level - Additional 20 x P3 (1.1 GHz, 256 Mb) are available in Cambridge HEP Group if we are able to use Grid tools (Globus or EDG)  Will be exploring possibilities in the coming months

Setting Up a New LHCb Production Centre: Processes, Challenges, and Future Plans

Setting Up a New LHCb Production Centre: Processes, Challenges, and Future Plans

Presentation Transcript

October 2002

October 2002

October 23rd 2008 meeting

October 2002

Burton upon Trent, 23rd October

October 2002

Tuesday, October 23rd

October 2002

October 2002

NCP, 23rd October 2002

2002 EDUCAUSE, October 3, 2002

October 2002

Geant4 Workshop, CERN - 1 October 2002 Joseph Perl

K.Harrison and A.Soroko Cosener’s House, Abingdon, UK 22 May 2002

October 2002

October 2002

Friday Sermon Slides October 23rd , 2009

Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002

SAFERIB CERN, 30 October – 1 November 2002

Thursday 23rd October