LHCb computing for the analysis : a naive user point of view

LHCb computing for the analysis : a naive user point of view Marie-Hélène Schune, LAL-Orsay for LHCb-France • Framework, data flow, computing model … • Few specific points • Practically ? And today ? LHCb computing TDR Talks from Ph. Charpentier (eg LHCC February 2008) and N. Brook (DESY computing seminar) Many thanks to my LHCb colleagues for their help in preparing the talk (in particular : MN Minard, S. Poss, A. Tsaregorodtsev) Any mistake is mine, I am not at all an expert ! Workshop analyse cc-in2p3 17 avril 2008

The LHCb collaboration : • 15 countries • 47 institutes • ~600 physicists • LHCb physics goals : • search for New Physics signals in flavour physics (B and D) • CP violation study • Rare B decays studies • In 1 year : 2 fb-1

An LHCb event : 2 GBytes file, 60k events , 30s on average Transferred from Online to Tier0 (CERN-Castor) Copied from Tier0 to one of the Tier1s RAW data Reconstruction is run at Tier0 and Tier1. Tracks reconstruction, clusters, PID … Reconstruction A priori the reconstruction is foreseen to be run twice a year : quasi real time and after LHC shutdown Reduced DST Stored locally at the Tier0 and the Tier1 Preselection Code Stripping of the events, developed by the physics groups. Data streams are created at Tier0 and Tier1. They are distributed to all Tier1s RAW + DST TAG

RAW data 35kB/evt Reduced DST 20kB/evt A priori the preselection is foreseen to be run four times a year Preselection Code Physics stream1 DST+RAW Physics stream2 DST+RAW Physics streamN DST+RAW … Event Tag Collection To allow quick access to the data For 120 days of run : 6 TB for each stream. Numbers based on the computing TDR : factor 10 overall reduction DST : 110 kB/evt

Tier0, Tiers1 and Tiers2 : Monte Carlo production • Simulation is done using non-Tier1 CPU resources • MC data are stored at Tier0 and Tier1s, no permanent storage at Tier2s

LHCb computing TDR : Physics stream i DST+RAW Event Tag Collection Analysis Code CERN + the 6 Tier1s User DST RooTuple User Event Tag Collection Final Analysis Code (cuts …) Result !

Data access through the GRID : • for the users : GANGA front-end is used to prepare and submit jobs. DIRAC is wrapping all the GRID ( and non GRID ) resources for LHCb. It is not used directly by the users DIRAC can be viewed as a (very) large batch system : – Accounting – Priority Mechanism – Fair share A GANGA job :

Few specific points : • LHCb does not see directly cc-in2p3 : it appears thatnone of the French physicists doing analysis in LHCb logs on cc-in2p3 • For an LHCb user where the job runs is fully transparent • After the CERN, the cc-in2p3 will be the largest center for the analysis in LHCb • So the use of cc-in2p3 is in fact dictated by the presence of the MC sample analyzed the data access is the main problem raised by the users eg : on 2 millions events only ¼ can be analyzed (after several trials)

Practically : • Create the datafiles location from the LHCb Bookkeeping web interface • Set up the environment (versions…) • Tell GANGA to work interactively • Do a final check of the code  • Tell GANGA to send the jobs on the GRID using DIRAC • Have few coffees • Look at the monitoring page (http://lhcb.pic.es/DIRAC/Monitoring/Analysis/): • When the jobs have ended copy the RooTuples.

Create the datafiles location from the LHCb Bookkeeping web interface • Set up the environment (versions…) • Tell GANGA to work interactively • Do a final check of the code  • Tell GANGA to send the jobs on the GRID using DIRAC Through the web a large file with all the requested data is obtained : //-- GAUDI data cards generated on 3/25/08 10:27 AM //-- For Event Type = 11124001 / Data type = DST 1 //-- Configuration = DC06 - phys-lumi2 //-- DST 1 datasets produced by Brunel - v30r14 //-- From DIGI 1 datasets produced by Boole - v12r10 //-- From SIM 1 datasets produced by Gauss - v25r7 //-- Database version = v30r14 //-- Cards content = logical //-- //-- Datasets replicated at ANY //-- 158 dataset(s) - NbEvents = 78493 //-- EventSelector.Input = { "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000001_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000002_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000003_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000004_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000005_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000006_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000007_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000008_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000009_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000010_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000011_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", "DATAFILE='LFN:/lhcb/production/DC06/phys-lumi2/00001558/DST/0000/00001558_00000012_5.dst' TYP='POOL_ROOTTREE' OPT='READ'", … };

This is given to Ganga with : DaVinciVersion = 'v19r9' myJobName = 'Bu2LLK_bb1' myApplication = DaVinci() myApplication.version = DaVinciVersion myApplication.cmt_user_path = '/afs/cern.ch/user/m/mschune/cmtuser/DaVinci_v19r9' myApplication.masterpackage = 'PhysSel/Bu2LLK/v3r2' myApplication.optsfile = File ( '/afs/cern.ch/user/m/mschune/cmtuser/DaVinci_v19r9/PhysSel/Bu2LLK/v3r2/options/myBd2Kstaree-bb1.opts' ) mySplitter = DiracSplitter( filesPerJob = 4, maxFiles = -1 ) myMerger = None myInputsandbox = [] myBackend = Dirac( CPUTime=1000 ) j = Job ( name = myJobName, application = myApplication, splitter = mySplitter, merger = myMerger, inputsandbox = myInputsandbox, backend = myBackend ) j.submit() This will automatically split the job in N subjobs where (in this case) N=NDataSets/4 In ganga : In [32]:jobs Out[32]: Statistics: 4 jobs -------------- # id status name subjobs application backend backend.actualCE # 104 completed Bu2LLK_bb3 12 DaVinci Dirac # 105 running Bu2LLK_bb2 50 DaVinci Dirac # 106 completed Bu2LLK_bb1 50 DaVinci Dirac # 116 submitted Bs2DsX 3 DaVinci Dirac Depending on the user the number of datafiles per job varies from 1 to ~10 1 job per file : a lot of jobs to handle but low failure rate … 10 jobs per file : too high failure rate

Create the datafiles location from the LHCb Bookkeeping web interface • Set up the environment (versions…) • Tell GANGA to work interactively • Do a final check of the code  • Tell GANGA to send the jobs on the GRID using DIRAC • Have few coffees • Look at the monitoring page (http://lhcb.pic.es/DIRAC/Monitoring/Analysis/): • When the jobs have ended copy the RooTuples (python scripts). Everything is available in :

Two ways of working : • Use a generic LHCb code which works for any analysis and stores all needed information … : •  No need to write the code •  (Very) large Rootuples  RooTuples analysis will require some CPU (+ a lot of disk space) • Write your own analysis code : •  Small RooTuples which can be then read interactively with ROOT •  Need to know a little bit more about LHCb code and C++ Two ways of working … still at the experimental stage. Time will show what is the users’ preferred way. The first approach raises more stringently some questions about where to do the analysis of the large RooTuples ? : Tiers1, Tiers2, Tiers3 ? A significant amount of disk space is needed to store the Rootuple (cc-in2p3 , labs, laptops … ? ) Some students are using ~100 GB .

Numbers for analysis jobs … Three examples (french physicists) for the period march 2008 – april 2008 ALL sites 340 jobs 1 stalled 19 failed Cc-in2p3 10 jobs 1 stalled 0 failed ALL sites 1049 jobs 317 stalled 29 failed Cc-in2p3 253 jobs 252 stalled 0 failed ALL sites 294 jobs 62 stalled 54 failed Cc-in2p3 0 jobs 0 stalled 0 failed All users CERN 5086 jobs 664 stalled 645 failed Cc-in2p3 1419 jobs 940 stalled 14 failed CNAF 163 jobs 84 stalled 41 failed NIKHEF 84 jobs 15 stalled 24 failed RAL 349 jobs 68 stalled 51 failed NB failed jobs can be the user’s fault …

Final remarks Working with the GRID brings a little bit of dream in our everyday life …. • A significant amount of “know how” is needed to run on the GRID (tutorials and documentation is usually not enough : other users’ help is needed ! ) • Compared with my previous experiments (ALEPH and BaBar) : an additional level of complexity • on the web page you know that you are in waiting state … but why … how to know for the others ??? Should know the name of somebody running the same kind of jobs to know what happens for him !! • When you have found the correct set of packages … It runs fast !

LHCb computing for the analysis : a naive user point of view