CMS computing: model, status and plans
E N D
Presentation Transcript
CMS computing: model, status and plans C. Charlot / LLR
The problem: data volume • RAW • Detector data +L1, HLT results after online formatting • Includes factors for poor understanding of detector, compression, .. • 1.5MB/evt @ 150Hz ~4.5PB/year (two copies, one distributed) • RECO • Reconstructed objects with their associated hits • 250kB/evt ~1.5PB/year (including 3 reproc. versions) • AOD • The main analysis format: clusters, tracks, particle id, • 50kB/evt ~2PB/year - whole copy at each T1 (e.g. CC-IN2P3) • TAG • High level physics objects, run info (event directory), <10kB/evt • FEVT • Bundling of RAW+RECO for distribution, storage • + MC data in estimated 1:1 ratio with experiment data C. Charlot, 2nd LCG-France Colloquium, mars 2007
Data processing • We aim for prompt data reconstruction and analysis • Backlogs are the real killer • Prioritisation will be important • At the begining, computing system will not be 100% • Cope with backlogs without delaying critical data • Reserve possibility of ‘prompt calibration’ using low latency data • Streaming • Rule #1 of hadron collider physics: understand your trigger and selection is everything • LHC analyses rarely mix inclusive triggers • Classifying events early allows prioritisation • Crudest example: express-line of ‘hot’ / calib events • Propose o(50) ‘primary datasets’, immutable but • Can have overlapp (10% assumed) C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-0 Centre • Prompt reco (24/200), FEVT storage, data distribution • Provided by IT division • CPU: 4.6MSI2K, Disk: 0.4PB, MSS:4,9PB, WAN: >5Gbps C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-1 Centres • Data storage, heavy processing (Re-Reco, skim, AOD extraction), raw data access, Tier-2 support • 7 Tier-1: ASCC, CCIN2P3, FNAL, GridKa, CNAF, PIC, RAL • Nominally, CPU: 2.5MSI2K, Disk: 1.2PB, MSS: 2.8PB, WAN: >10Gbps C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-1 Centres • Analysis, MC production, specialised support tasks • Local + common use • Nominally, CPU: 0.9MSI2K, Disk: 0.2PB, No MSS, WAN: >1Gbps C. Charlot, 2nd LCG-France Colloquium, mars 2007
CMS-CAF • Latency critical services, analysis, Tier-1 functionality • CERN responsability, open to all collaborators • Roughly: Tier-1 MSS + 2 Tier-2 C. Charlot, 2nd LCG-France Colloquium, mars 2007
Ressource evolution • We should be frightened by these numbers • Revised LHC planning • Keep integrated data volume ~same by increased trigger rate C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-1/Tier-2 Associations • Associated Tier-1: hosting MC prod + reference for AOD serving • Full AOD sample at Tier-1 (after T1T1 transfers for re-recoed AODs) • Stream “allocation” ~ available disk storage at centre CCIN2P3-AF, GRIF C. Charlot, 2nd LCG-France Colloquium, mars 2007
Transfer Rates • These are raw rates: no catchup, no overhead • T1T1; total AOD size, replication period (currently 14 days) • T1T2: T2 capacity; refresh period at T2 (currently 30 days) • Average rate, worst-case peak for T1 is sum of T2 transfer capacities • Weighted by data fraction at T1 OPN in: FEVT (T0T1)+AOD (T1T1) OPN out: AOD (T1T1) T2 in: FEVTsim+AODsim (T2T1) T2 out: FEVT+AOD (T1T2) MB/s C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-0 Status (CSA06) • Prompt Reconstruction at 40 Hz • 50 Hz for 2 weeks, then 100 Hz • Peak rate: >300 Hz for >10 hours • 207M events total • Uptime: 80% of best 2 weeks • Achieved 100% of 4 weeks • Use of Frontier for DB access to prompt reconstruction conditions • The CSA challenge was the first opportunity to test this on a large scale with developed reconstruction software • Initial difficulties encountered during commissioning, but patches and reduced logging allowed full inclusion into CSA C. Charlot, 2nd LCG-France Colloquium, mars 2007
Data Processing & Placement • Reminder: in CMS model, each Tier-1 gets only a fraction of total RAW+RECO • Chose Tier-1 destinations to meet analysis interest while not exceeding site storage capacity or bandwidth from Tier-0 Express C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-0Tier-1 Transfers • Goal was to sustain 150 MB/s to T1s • Twice the expected 40 Hz output rate Last week’s averages hit350MB/s (daily) 650MB/s (hourly)i.e. exceeded 2008 levels for ~10 days (with some backlog observed) Monthly T1 Transfer plot signals start Target rate Min bias only @ start T0 rate: 54 110 170 160 Hz C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-1 Transfer Performance goals • 6 of 7 Tier-1s exceed 90% availability for 30 days • U.S. Tier-1 (FNAL) hit 2x goal • 5 sites stored data to MSS (tape) C. Charlot, 2nd LCG-France Colloquium, mars 2007
Tier-1 Skim Jobs • Tested workflow to reduce primary datasets to manageable sizes for analyses • Computing provided centralized skim job workflow at T1 • 4 production teams • Secondary datasets are registered into Dataset Bookkeeping Service and accessed like any other data • Common skim job tools prepared based on “MC Truth” or Reconstruction (both types tested) • Overwhelming response from CSA analysis demos • About 25 filters producing ~37 (+ 21 jet) datasets ! • Variety of output formats (FEVT, RECO, AOD, custom) • Selected events range from <1% to 100% (for Jets split) • Sizes range from <0.001 TB to 2.5 TB C. Charlot, 2nd LCG-France Colloquium, mars 2007
Jobs Execution on the Grid • >50K jobs/day submitted on all but one day in final week • >30K/day robot jobs • 90% job completion efficiency • Robot jobs have same mechanics as user job submissions via CRAB • 2 submission teams set up • Mostly T2 centers as expected • OSG carries large proportion • Scaling issues encountered, but subsequently solved C. Charlot, 2nd LCG-France Colloquium, mars 2007
CMS Tier-2: data transfers CSA06 GRIF 24 Tier-2 sites /CSA06-106-os-Jets0-0/RECO/CMSSW_1_0_6-RECO Fake rate e- from jets ~3TB C. Charlot, 2nd LCG-France Colloquium, mars 2007
T1 Re-Reconstruction • Demonstrated re-reconstruction at T1 centers with access to offline DB using new constants • 4 teams set up to run 100K events at each T1 • Re-reconstruction demonstrated on >100K events at 6 T1s • 100% efficiency at CCIN2P3 (although small sample) • Initially ran into a problem with a couple reconstruction modules when first attempted • Had to drop pixel tracks and vertices out of ~100 modules due to technical issue with getting products stored in Event • For the Tracker and ECAL calibration exercises, new constants inserted into DB were used for re-reconstruction, and dataset published/accessed • Full reprocessing workflow! C. Charlot, 2nd LCG-France Colloquium, mars 2007
2007 MC production • 1_2_0 validation production completed • 03/07 prod for HLT (1_3_0) • 04-05/07 prod for physics (1_4_0) Stageout pbs C. Charlot, 2nd LCG-France Colloquium, mars 2007
CMS Computing timeline 2007 • Computing support for 2008 papers preparation • Large scale MC production: march may 2007 • Analysis autumn 2007 • Core software final procedure and algos autumn 2007 • Computing, Analysis and Software Challenge, CSA07 • Computing model at ~50% scale • Data production, distribution at Tier-1s • Skimming, re-reco at Tier-1, distribution to Tier-2 • Analysis at Tier-2s together with MC production • july 2007 • Data taking: end 2007 C. Charlot, 2nd LCG-France Colloquium, mars 2007
Conclusions • CMS se prépare pour le data taking • Les activités au niveau du Tier-1 CC-IN2P3 vont se rencentrer sur ses missions premières • CSA07 est l’objectif no 1 du premier semestre • Également participation à la production MC • L’emphase porte maintenant sur les Tier-2 • Montée en puissance de GRIF • Production MC • Analyse locale • Tier-2 au CC-IN2P3 • Besoins importants pour l’analyse en Q2-Q3 2007 C. Charlot, 2nd LCG-France Colloquium, mars 2007