CMS tools for distributed analysis

CMS tools for distributed analysis N. De Filippis - LLR-Ecole Polytechnique

CMS tools tutorial organization • morning session: overview of CMS tools • concepts of the CMS computing/analysis model • overview of the analysis workflow and tools • overview of the analysis job monitoring system • afternoon session: practical examples • demonstration of data discovery • demonstration of job submission • job monitoring and trouble-shooting

CMS tools: overview session

GRIF LNL Wisconsin Rome CMS dataflow and workflow • Data are collected, filtered online, stored, reconstructed with HLT information at Tier-0 and registered in Data Bookeeping Service (DBS) at CERN. CERN Computer Centre Tier 0 DBS • RECO data are moved from Tier-0 to Tier 1 via PhEDEx PhEDEx PhEDEx PhEDEx France Regional Centre (IN2P3 Italy Regional Centre (CNAF) FermiLab Tier 1 • Data are filtered (in a reduced AOD format) at Tier-1 according to the physics analysis group selection for skimming; skim output are shipped to Tier-2 via PhEDEx PhEDEx PhEDEx PhEDEx Tier 2 LLR • Data are analysed at Tier-2 and Tier-3 by physics group users and published in a DBS instance dedicated to the physics group. Tier 3

CMS analysis model (CTDR) • In the computing model CMS stated that the analysis resources are: • Central Analysis Facility (CAF) at CERN: • intended for specific varieties of analysis with requirements of low latency access to the data (data are on disk pool of CASTOR) • CAF is a very large resource but policy decisions about the users and the use of the resource • processing at CAF: calibration and alignment, few physics analyses • Tier-2 resources: • two groups of analysis users to be supported in Tier-2 • support for analysis groups/specific analyses • support for local communities • 1. is addressed by using the distributed analysis system • 2. is addressed via local batch queues/access and distributed systems • Tier-3 resources: local and interactive analysis, private resources

Tier-2 resources • A nominal Tier-2 should be: • 0.9MSI2k of computing power, 200TB of disk and 1Gb/s of WAN. • that means several hundred batch slots and disk for large skim samples • A reasonably large fraction of the nominal is devoted to analysis group activities and the remainder is assignable to the local community • Proposal under discussion is ~50% of nominal processing resources for simulation, ~40% of the nominal resources for analysis groups (specific and well organized analyses) , and the remainder for local users • 10% of the local storage for simulation, 60% for analysis groups and 30% for local communities • • for 2008 lower guideline of 60TB of disk and 0.4MSI2k for analysis groups

Distributed analysis in Tier-2 • via virtual organization (VO) authorized access • via data discovery (DBS) • via job analysis builder (CRAB) • via data, job and resources monitoring (DashBoard) • via automatic storage allocation for physics groups and local user data • via data movement tools (PhEDeX)

CMS Data Discovery The Dataset Bookkeeping System (DBS) provides the means to define, discover and use CMS event data. The main features that DBS provides are: • Data Description: keeps dataset definition along with attributes characterising the dataset, the type of content resulting from a degree of processing applied to the data (RAW, RECO, etc) The DBS also provides information regarding the “provenance” of the data it describes. • Data Discovery: stores information about (real and simulated) CMS data in a queryable format. The supported queries allow users to discover available data and how they are organized (logically) in term of packaging units (files and file-blocks). Answers the question: “Which data exist?” • Data location: provides the means to locate replicas of data in the distributed computing system by providing the names of Storage Elements of sites hosting the data. Answers the question “Where data exist?”

CMS Data Discovery: DBS Data discovery page:https://cmsweb.cern.ch/dbs_discovery/_navigator?userMode=user • A sample of data is identified by a string called: datasetpath • /Primarydataset/Processeddataset/DataTier • Ex: /Njet_2j_80_140-alpgen/CMSSW_1_6_7-CSA07-1200571375/RECO • Primary dataset: name that describes the physics channel • Processed dataset: name that describe the kind of processing applied • Data Tier: describes the kind of event information stored from each step in the simulation and reconstruction chain. Examples: RAW and RECO, and for MC, GEN, SIM and DIGI. • File-related concepts: • Logical File Name (LFN):a site-independent name for a file. It doesn't contain either the actual protocol used to read the file or any of the site-specific information about the place where it is located. • Physical File Name (PFN):site-dependent name for a file to allow local access to a file at a site. Logical file names are mapped into the physical file names via the local trivial file catalog, (TFC)

Logical file names click on this Site: SE name CMS Data Discovery (2) Data discovery page: https://cmsweb.cern.ch/dbs_discovery/_navigator?userMode=user

CRAB (CMS remote analysis builder) • CRAB is a user oriented tool for: • job preparation, submission, (basic) monitoring of CMS analysis jobs in remove sites by using the GRID infrastructure (EGEE and OSG) • Features: • User Settings provided via a configuration file (dataset, data type) • Data discovery querying DBS for remote sites • Job splitting automatic (n. of jobs or events per job to be provided) • Jobs will run where the data are • GRID details mostly hidden to the user • status monitoring, job tracking and output management of jobs • publishing of the output • Use cases supported: • Official and private code analysis of official CMS data • But also private production and skimming The aim of CRAB is to hide as much as possible the grid complexity to the user

The user analysis basic model

UI The user analysis workflow • The user provides: • Dataset (runs,#event,..) taken by DBS • Private CMSSW code DataSet Catalogue DBS • CRAB discoveries data and sites hosting them by querying DBS CRAB Job submission tool • CRAB prepares, splits and submits jobs to the Resource Broker/WMS Workload Management System • The RB/WMS sends jobs at sites hosting the data provided the CMS software was installed Resource Broker (RB/WMS) • CRAB retrieves automatically logs and the output files of the jobs; it’s possible to store output files into SE (best solution) CMSSW Computing Element • CRAB can publish the output of the jobs in DBS to make output data available officially for subsequent processing. Worker node Storage Element

CRAB standalone and server

CRAB documentation and support CRAB home page: http://cmsdoc.cern.ch/cms/ccs/wm/www/Crab/ CRAB twiki: https://twiki.cern.ch/twiki/bin/view/CMS/CRAB

CRAB frequently asked questions https://twiki.cern.ch/twiki/bin/view/CMS/CrabFaq

CMS job monitoring: Dashboard • Most of the CMS Job Submission Systems , including Crab, are instrumented to send monitoring information to the CMS Dashboard. • The CMS Job Submission Systems Dashboard collects information from the Grid Monitoring systems. • Monitoring data is stored in the central data base and there is a web interface running on top of it and allowing CMS users to follow the progress of their jobs. • Web interface at link: http://arda-dashboard.cern.ch/cms/

Monitor of analysis jobs (last month) sites Users dataset

Monitor of analysis tasks Choose your identity in the "Select a User" window, select the time window to define the tasks submitted during a given time range, you should get at the screen the list of all your tasks submitted over the time range you have chosen.

Monitor of site availability for analysis http://lxarda16.cern.ch/dashboard/request.py/samvisualization Simple analysis test is run continuosly on any site to check the availability for analysis

Statistics of failures All Tiers:~400 Kjobs: Failure 20% Mostly for storage problems at sites Tier2 only (32%): ~130 Kjobs: Failure 20% https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes

Storage management at Tier-2 • CMS users/physics groups can produce and store large quantities of data: • due to re-processing of skim • due to common preselection processing and iterated reprocessing • due to private productions (especially fast simulation) • due to end-user analysis ROOT-ples •  care on how to manage the utilization of the storage • Tier-2 disk-based storage will be fixed size and so care is needed to ensure every site can meet their obligations with respect to the collaboration and analysis users are treated fairly: • user quota, policy are under definition • physics analysis data namespace • user data namespace Official Physics groups Private user

Physics analysis data namespace

User analysis data namespace

request samples to be transferred. Requests to be approved by site administrators according to policies of physics groups Request a replica of a dataset in one site CMS data movement http://cmsdoc.cern.ch/cms/aprom/phedex/

Glossary

Demonstration about CMS analysis workflow in the afternoon

CMS tools for distributed analysis