210 likes | 368 Vues
Data Operations. US CMS Collaboration Meeting 2010 06. May 2010. Outline. Introduction to Data Operations T0 processing T1 processing T2 MC production Transfer Operation Release Validation How to stay informed about samples, MC, etc. Service work and other contributions. CMS Computing.
E N D
Data Operations • US CMS Collaboration Meeting 2010 • 06. May 2010
Outline • Introduction to Data Operations • T0 processing • T1 processing • T2 MC production • Transfer Operation • Release Validation • How to stay informed about samples, MC, etc. • Service work and other contributions
CMS Computing • Traditional view of the tiered computing infrastructure • Data Operations handles all central tasks on the T0, T1 and T2 level • The project is lead by Markus Klute (MIT) and Oliver Gutsche (FNAL) • More and more T3 sites (especially in the US) have been added or are going to be added T2 T2 T2 T2 T2 T2 T1: USA T2 T2 T1: Italy T1: France T2 T2 T0 T2 T1: Germany T1: Spain T2 T2 T2 T1: UK T1: Taiwan T2 T2 T2 T2 T2 T2 T2 T2
T0 processing Task leaders Josh Bendavid (MIT) Marco Zanetti (MIT) Integration Dave Mason (FNAL) Stephen Gowdy (CERN) T0 at CERN • CPU • Process all data coming from the detector • Express(latency 1 hour) • Bulk: Repacking, PromptReco (latency 24 hours plus conditions hold) • PromptSkimming is dispatched by the T0 infrastructure although run at the T1 sites • Tape • Custodial copy of data coming from the detector • Inactive “cold” copy of all data not mean to be accessed, only to be archived • Network • Transfer all data from the detector to T1 sites for archival • Time critical otherwise buffers at T0 overflow
Data taking 2010 • Current acquisition era: Commissioning10 • 1 physics Primary Dataset (PD) • Various Secondary Datasets (SD) and Central Skims (SD) • All datasets have a custodial copy at one of the T1 sites • FNAL has a replica of ALL data • Next acquisition era: Run2010A • Will be put in place at Linst > 1E29 • 7 physics PDs: JetMETTauMonitor, JetMETTau, EGMonitor, EG, MuMonitor, Mu, MinimumBias • FNAL will have a copy of all data
T1 processing Task leaders Kristian Hahn (MIT/FNAL) Guillelmo Gomez-Ceballos (MIT) Integration Jose Hernandez (CIEMAT) Claudio Grandi (INFN Bologna) T1 site • Tape • Custodial copy of data and MC • Non-custodial replica of data and MC • CPU • PromptSkims (produce SD/CS) • Re-reconstruction passes on data and MC including SD/CS production • MC production if resources are free and T2 level fully utilized • Network • Serve data and MC samples to T2 sites • Archive data from T0, other T1 sites and the T2 level (MC)
T1 operation in 2010 • Summer09 MC: re-digitization / re-reconstruction • Input: • ~575 Million events, 450 TB • Processing: • ~500 workflows • ~500,000 processing jobs • 90% of the events processed in ∼5 days • Tails finished after 2 weeks • Output: • ~1500 datasets • ~400 TB RAW, ~220 TB RECO, ~65 TB AOD • 2010 data re-reconstruction • Requested when change in release or global tag in T0 • 2 passes until now: Apr1ReReco & Apr20ReReco • PD: MinimumBias & ZeroBias plus associated SD/CS • Also re-reconstruction of corresponding MinBias MC samples • Full request currently takes ~2-3 days due to long tails • NEW: train model • Run a new re-reconstruction pass every week • Use stable release and pick up latest conditions and all added statistics • Train leaves the station on Thursday’s, 8 PM CEST Jobs per day
T2 MC production T2 site Task leaders Ajit Mohapatra (U Wisconsin) Valentina Dutta (MIT) • CPU • 50% for analysis • 50% for MC production • Standard MC production in multiple steps (GEN-SIM-RAW, GEN-SIM-RECO, … ) • Newer workflows using LHE datasets as input • PileUp workflows using MinBias or data samples for PileUp mixing • Network • Archive produced MC samples at T1 sites for custodial storage • Samples are moved from the various T2 sites to one T1 site to consolidate the distributed production
T2 MC production in 2009/2010 • Record of over 300 Million events in 1 month • Production with lots of variation over the year due to request situation • Currently requests gets produced quickly as there is literally no queue
Transfer Operation PhEDEx • Central • Handles dataset transfers between all CMS sites • Database to keep track of files, blocks and their location • Links between sites which can be used for transfers • Central scheduling of transfers and balancing between source sites • Infrastructure to submit transfer or deletion requests (Webpage) Task leaders Paul Rossman (FNAL) Si Xie (MIT) • Per site • Agents that handle • Transfers • Deletions • Verification / consistency checks
Transfer operation in 2009/2010 • Totally transferred data volume in last 52 weeks: 17 PB • Site with highest incoming and outgoing volume: FNAL Transfer volume in last 52 weeks by Destination Significant effort has to be spent to debug transfer problems Transfer volume in last 52 weeks by Source
Release Validation CERN & FNAL Task leaders Oliver Gutsche (FNAL, interim) Special Operator Diego Reyes (U Los Andes, Colombia) • Release validation • Produce special MC samples and data reconstructions for all releases • Standard set: • Turnaround 24 hours on 500 slots at CERN • Now mostly run at FNAL using opportunistic cycles in parallel to T1 production • Run for all releases except patch releases • High statistics set: • Turnaround 1 week for higher statistics samples • PileUp and HeavyIon samples • Produced in parallel to standard set outside the 24 hour window
Release Validation in 2009/2010 • Significant contribution to software validation and stability of releases for production • High visibility task with reports in all major offline & computing meetings * double counting RAW, RECO and AOD events
User advice • To stay informed about current and future samples: • Requests are submitted and acknowledged in • hn-cms-dataopsrequests@cern.ch • Samples are announced in • hn-cms-datasets@cern.ch • Both lists are low traffic lists • We strongly discourage asking questions or replying to threads on these lists
Your contribution to Data Operations • Data Operations can award service credit for all its tasks • Graduate students and post-docs can spend 25% of their time working for Data Operations as operators • Interest in computing and high scale data processing required • Training would give detailed insight into computing infrastructure and software and prepare ideally for all analysis tasks • Very talented graduate students and post-docs can spend 50% of their time filling one of the 5 task leader positions • High visibility in the collaboration • Significant contribution to the success of the experiment • Closely connected to everything related to data and MC, good for analysis • Data Operations is constantly looking for talents to replenish the current manpower • Urgently we are looking for leaders of the Release Validation task • Please contact Markus and Oliver if you are interested or have questions.
The Data Operations Team • Project lead: • Markus Klute & Oliver Gutsche • Task leaders: • Josh Bendavid, Marco Zanetti, Kristian Hahn, Guillelmo Gomez Ceballos Retuerto, Ajit Mohapatra, Valentina Dutta, Paul Rossman, Si Xie • Operators • Andrew Lahiff, Andrey Tsyganov, Aresh Vedaee, Ariel Gomez Gomez Diaz, Arnaud Willy J Willy J Pin, Derek Barge, Dorian Kcira, Gilles De De Lentdecker, Jeff Haas, Jen Adelman-McCarthy, Joseph Mccartin, Julien Caudron, Junhui Liao, Lukas Vanelderen, Nicolas Dominique, Dominique Schul, Petri Lehtonen, Subir Sarkar, Vincenzo Spinoso, Xavier Janssen
Summary & Outlook • Data Operations handles all central processing and production tasks on the T0, T1 and T2 level of the distributed computing infrastructure • Current performance in the areas of data taking,skimming, re-reconstruction, MC production, transfers and release validation is excellent • We are always looking for interested and talented people to help us getting the data as quickly and reliably as possible to all CMS collaborators. • Don’t miss the Computing Shift presentation on Friday
Glossary • Express • Low latency processing of extra express stream(s) from the detector (40 Hz), latency 1 hour • Repacking • Binary streamer files from the detector are translated into the ROOT format • PromptReco • RAW ROOT files are reconstructed promptly at the T0 • PromptSkimming • As soon as blocks of data (groups of files either constraint by runs or by number of files (1000)) are completely stored on tape at the custodial T1 site, the PromptSkimming system sends jobs to this site to run the skimming workflows
Glossary • Primary Dataset (PD) • Data stream from P5 is split into PDs according to trigger selections with minimal overlap • Needed for processing in distributed computing infrastructure • Produced at T0, re-reconstructed at T1 sites • Secondary Dataset (SD) • More restrictive trigger selection than PD • Produced at T1 sites with PromptSkimming system, also produced after re-reconstruction passes at T1 sites • Central Skims (CS) • In addition to a more restrictive trigger selection than the parent PD, reconstruction level selections are applied • Same processing as SD