Dataflow/workflow with real data through Tiers

Tutorial Dataflow/workflow with real data through Tiers N. De Filippis Department of Physics and INFN Bari

Outline • Computing facilities in the control room, at Tier-0 and at Central Analysis facilities (CAF): • Ex: Tracker Analysis Centre (TAC) • Local storage and automatic processing at TAC • how to register files in DBS/DLS • Automatic data shipping and remote processing at Tier-1 /Tier 2 • Injection in PhEDEx for the transfer • Re-reconstruction and skimming with Prodagent • Data analysis in a distributed environment via CRAB • simulation of cosmics in a Tier-2 site

What expected in the CMS Computing Model Dataflow/workflow from Point 5 to Tiers: • The CAF will support: • diagnostics of detector problems, trigger performance services, • derivation of calibration and alignment data • reconstruction services, interactive and batch analysis facilities • Most of tasks have to be performed in remote Tier sites in distributed environ.

Computing facilities in the control room, at Tier-0 and at Central Analysis facilities

Example of a facility for Tracker • The TAC is a dedicated Tracker Control Room • To serve the needs of collecting and analysing the data from the 25% Tracker test at the Tracker Integration Facility (TIF) • In use since Oct. 1st 2006 by DAQ and detector people • Computing elements at TAC: • 1 disk server: CMSTKSTORAGE • 1 DB server: CMSTIBDB • 1 wireless/wired router • 12 PC’s • 2 DAQ (CMSTAC02 e CMSTAC02) • 3 DQM, 1 Visualization (CMSTKMON, CMSTAC04 e CMSTAC05) • 2 TIB/TID (CMSTAC00 e CMSTAC01) • 3 DCS (PCCMSTRDCS10, • PCCMSTRDCS11 and PCCMSTRDCS12) • 2 TEC+ (CMSTAC06 and CMSTAC07) + 1 private PC TAC is like a control room + Tier-0 + CAF “in miniatura”

Local storage and processing at TAC • A dedicated PC (CMSTKSTORAGE) is devoted to store temporarily the data: • it has now 2.8 TB local fast disk (no redundancy) • it will allow local caching for about 10 days of data taking (300 GB/day expected for 25 % test) • CMSTKSTORAGE also used to perform the following tasks: • perform o2o for connection and pedestals runs to fill the Offline DB • convert RU files into EDM-compliant formats • write files to CASTOR when ready Area in castor created under …/store/… • /castor/cern.ch/cms/store/TAC/PIXEL • /castor/cern.ch/cms/store/TAC/TIB • /castor/cern.ch/cms/store/TAC/TOB • /castor/cern.ch/cms/store/TAC/TEC • register files in Data Bookkeeping Service (DBS) and Data Location Service (DLS)

How to register files in DBS/DLS (1) • A grid certificate with CMS Role=Production is needed: • voms-proxy-init -voms cms:/cms/Role=production • DBS and DLS API • cvs co –r DBS_0_0_3a DBS cvs co –r DLS_0_1_2 DLS • One DBS and DLS instance: please use • MCLocal_4/Writerfor DBS • prod-lfc-cms-central.cern.ch/grid/cms/DLS/MCLocal_4 for DLS • The following info about your EDM-compliant file are needed: • --PrimaryDataset=TAC-TIB-120-DAQ-EDM • --ProcessedDataset=CMSSW_1_2_0-RAW-Run-0000505 • --DataTier=RAW • --LFN=/store/TAC/TIB/edm_2007_01_29/EDM0000505_000.root • --Size=205347982 • --TotalEvents= 3707 One processed dataset per run

How to register files in DBS/DLS (2) -- GUID=38ACFC35-06B0-DB11-B463 extracted with EdmFileUtil -u file:file.root -- CheckSum=4264158233extracted with cksum command -- CMSSWVersion=CMSSW_1_2_0 -- ApplicationName=FUEventProcess -- ApplicationFamily=Online -- PSetHash= 4cff1ae0-1565-43f8-b1e9-82ee0793cc8cextracted with uuidgen • Run the script for the registration in DBS: • python dbsCgiCHWriter.py --DBSInstance=MCLocal_4/Writer • --DBSURL= “http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery" • --PrimaryDataset=$primdataset --ProcessedDataset=$procdataset --DataTier=RAW --LFN=$lfn --Size=$size --TotalEvents=$nevts --GUID=$guid --CheckSum=$cksum • --CMSSWVersion=CMSSW_1_2_0 --ApplicationName=FUEventProcess • --ApplicationFamily=Online --PSetHash=$psethash • Closure of blocks in DBS: • python closeDBSFileBlock.py --DBSAddress=MCLocal_4/Writer -datasetPath=$dataset • The two scripts dbsCgiCHWriter.pyandcloseDBSFileBlock.py can be found in/afs/cern.ch/user/n/ndefilip/public/Registration/

How to register files in DBS/DLS (3) Data registered in DBS • Run the script for the registration of blocks of files in DLS: • python dbsread.py --datasetPath=$dataset • or for each block of files: • dls-add -i DLS_TYPE_LFC -e prod-lfc-cmscentral.cern.ch/grid/cms/DLS/MCLocal_4 • /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000505#497a013d-3b49-43ad-a80f-dbc590e593d7 srm.cern.ch which is the name of the SE

Results in Data discovery page: http://cmsdbs.cern.ch/discovery/expert Tracker data MTCC data

Automatic data shipping and remote processing at Tier-1/Tier 2

PhEDEx injection (1) • Data published in DBS and DLS are ready to be transferred via the CMS official data movement tool, PhEDEx • The injection, that is the procedure to write into the database for transfer of PhEDEx, has to be run in principle from CERN where the data are collected • but it can be run also in a remote site in a Tier-1 / Tier-2 hosting PhEDEx • It runs at Bari via an official PhEDEx agent and a component of ProdAgent modified to “close” blocks at the end of the transfer in order to enable automatic publishing in DLS (the same procedure used for Monte Carlo data) • complete automatisation is reached with a script that watches for new tracker related entries in DBS/DLS • Once data are injected in PhEDEX any Tier-1 or Tier-2 can subscribe to them

PhEDEx injection (2) • ProdAgent_v0XX is needed: • configure PA to use the PhEDEX dropbox /dir/state/inject-tmdb/inbox prodAgent-edit-config --component=PhEDExInterface --parameter=PhEDExDropBox --value=/dropboxdir/ • start the PhEDExInterface component of PA: • prodAgentd --start --component=PhEDExInterface • PhEDEx_2.4 is needed: • configure the inject-tmdb agent in your Config file • ### AGENT LABEL=inject-tmdb PROGRAM=Toolkit/DropBox/DropTMDBPublisher • -db ${PHEDEX_DBPARAM} • -node TX_NON_EXISTENT_NODE • start the inject-tmdb agent of PHEDEx: ./Master -config Config start inject-tmdb

PhEDEx injection (3) • For each datasetpath of a run: • python dbsinjectTMDB.py --datasetPath=$dataset --injectdir=logs/ • In the log of PhEDEX you will find the following messages In /afs/cern.ch/user/n/ndefilip/public/Registration 2007-01-31 07:55:05: TMDBInject[18582]: (re)connecting to database Connecting to database Reading file information from /home1/prodagent/state/inject-tmdb/work/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869.xml Processing dbs http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery?instance=MCLocal_4/Writer (204) Processing dataset /TAC-TIB-120-DAQ-EDM/RAW (1364) Processing block /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000520#353a3ae2-30a0-4f30-86df-e08ba9ac6869 (7634) :+/ 1 new files, 1 new replicas PTB R C 2007-01-31 07:55:08: DropTMDBPublisher[5828]: stats: _TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09 3.04r 0.18u 0.08s success

Results in PhEDEx page: http://cmsdoc.cern.ch/cms/aprom/phedex http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Data::Replicas?filter=TAC-T;view=global;dexp=1364;rows=;node=6;node=19;node=44;nvalue=Node%20files#d1364

“Official” reconstruction/skimming (1) • Goal: to run reconstruction of raw data in a standard and official way, typically using code of a CMSSW release (no prerelease, no user patch) • ProdAgent tool evaluated to perform reconstruction with the same procedures as for monte carlo samples • ProdAgent can be run everywhere…better in a Tier-1 / Tier-2 • Running with ProdAgent will ensure that RECO data are automatically registered in DBS and DLS, ready to be shipped to Tier-1 and Tier-2 and analysed via computing tools • in the close future the standard reconstruction, calibration and alignment tasks will run on Central Analysis Facility (CAF) machines at CERN, such as expected in the Computing Model.

“Official” reconstruction/skimming (2) • Input data are processed run by run and new processed datasets are created as output, one for each run • ProdAgent use the DatasetInjector component to be aware of the input files to be processed • It is needed to create the workflow file from the cfg for reconstruction; • the following example is for DIGI-RECO processing starting from GEN-SIM input files • no Pileup, StartUp and LowLumi pileup can be set for digitization • splitting of input files can be done either by event of by file

“Official” reconstruction/skimming (3) • Creating the workflow file for no pileup case: python $PRODAGENT_ROOT/util/createProcessingWorkflow.py --dataset=/TAC-TIB-120-DAQ-EDM/RAW/CMSSW_1_2_0-RAW-Run-0000530 --cfg=DIGI-RECO-NoPU-OnSel.cfg --version=CMSSW_1_2_0 --category=mc --dbs-address=MCLocal_4/Writer --dbs-url=http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery --dls-type=DLS_TYPE_DLI --dls-address=lfc-cms-test.cern.ch/grid/cms/DLS/MCLocal_4 --same-primary-dataset --only-closed-blocks --fake-hash --split-type=event --split-size=1000 --pileup-files-per-job=1 --pileup-dataset=/mc-csa06-111-minbias/GEN/CMSSW_1_1_1-GEN-SIM-1164410273 --name=TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU • Submitting jobs: python PRODAGENT/test/python/IntTests/InjectTestSkimLCG.py --workflow=/yourpath/TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU-Workflow.xml --njobs=300

Data analysis via CRAB at Tiers (1) • Data published in DBS/DLS can be processed via CRAB in remote using the distributed environment tools • users have to edit crab.cfg and insert the dataset path of the run to be analyzed as obtained by DBS info • users have to provide their CMSSW cfg, setup the environment and compile their code via scramv1 • offline DB accessed via frontier at Tier-1/2 already tested during CSA06 with alignment data • an example cfg to perform the reconstruction chain starting from raw data can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/TACAnalysis_Run2048.cfg • Thanks to D. Giordano for the support

Data analysis via CRAB at Tiers (2) • The piece of the cfg useful to access the offline DB via frontier • The output files produced with CRAB are not registrered in DBS/DLS (but the implementation of the code is under development…) • Further details about CRAB in the tutorial of F. Fanzago.

“Official” Cosmics simulation (1) • Goal: to make standard simulation of cosmics with official code in CMSSW release (no patch, no prereleses) • CMSSW_1_2_2 is needed: • Download AnalysisExamples/SiStripDetectorPerformance • cvs co –r CMSSW_1_2_2 AnalysisExamples/SiStripDetectorPerformance • Complete geometry of CMS, no magnetic field, cosmic filter implemented to get muon triggered by scintillators: AnalysisExamples/SiStripDetectorPerformance/src/CosmicTIFFilter.cc • The configuration file is: AnalysisExamples/SiStripDetectorPerformance/test/cosmic_tif.cfg • interactively: cmsRun cosmic_tif.cfg • by using ProdAgentto make large-scale and fully automatized productions Thanks to L. Fanò

“Official” Cosmics simulation (2) • ProdAgent_v012: • create the workflow from the cfg file for GEN-SIM-DIGI: python $PRODAGENT_ROOT/util/createProductionWorkflow.py --cfg /your/path/cosmic_tif.cfg --version CMSSW_1_2_0 --fake-hash • Warnings: • when using createPreProdWorkflow.py the PoolOutputModule name in cfg should be compliant with the conventions to reflect the data tier the output file contains (i.e. GEN-SIM , GEN-SIM-DIGI, FEVT ). • so download the modified cfg from /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF.cfg • the workflow can be found in: /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF-Workflow.xml • Submit jobs via standard prodagent scripts: python $PRODAGENT_ROOT/test/python/IntTests/InjectTestLCG.py --workflow=/your/path/COSMIC_TIF-Workflow.xml --run=30000001 --nevts=10000 –njobs=100

Pro and con’s • Advantages of the CMS computing approach: • Data officially published processed with official tools  so results are reproducible • the access to a large number of distributed resources • profit from the experience of the computing teams • Con’s: • initial effort to learn official computing tools • possible problems at remote sites, storage issues, instability of grid components (RB,CE), etc… • concurrence of analysis jobs and production jobs  policy/prioritization to be set in remote sites.

Conclusions • First real data registered in DBS/DLS are officially available to the CMS community • Data are moved between sites and published by using official tools • Reconstruction, re-reconstruction and skimming could be “standardized” using ProdAgent • Data analysis is performed by using CRAB • Cosmic simulation for detector communities can be officially addressed • Many thanks to the people of the TAC team (fabrizio, giuseppe, domenico., livio, tommaso, subir….)

Dataflow/workflow with real data through Tiers

Dataflow/workflow with real data through Tiers

Presentation Transcript

Workflow Management and Virtual Data

Workflow and Process Management (WPM)

Department of Agriculture – BAI, BPI and BFAR Automating SPS Clearance Workflow

INTERNATIONAL MATERIAL DATA SYSTEM - IMDS

Data Flow Modeling of Combinational Logic

Primary Source Verification DataFlow - SCHS How to Apply

Radiation Oncology Whiteboard Data and Workflow manager for enhanced communication and task management

Data Mining in the Real-World

Introduction To Next Generation Sequencing (NGS) Data Analysis

Workflow Management and Virtual Data

Advanced Compiler Techniques

Integrating School Mental Health and PBIS: Examples at All 3 Tiers

Overview: Cloud Computing and Workflow Research in NGSP Group

Response to Instruction and Intervention (RTI 2 ) – Data Teams and Decision Making

CHAPTER 11

Git with t for teams