CMS Stress Test Report Marco Verlato (INFN-Padova)

INFN-GRID Testbed Meeting 17 Gennaio 2003 CMS Stress Test Report Marco Verlato (INFN-Padova)

Motivations and goals • Purpose of the “stress test”: • Verify how EDG middleware is good for CMS Production • Verify the portability of CMS Production environment on a grid environment • Produce a reasonable amount of the PRS requested events • Goals • Aim for 1 million events (only FZ files, no Objectivity) • Measure performances, efficiencies and reasons of job failures • Try to make the system stable • Organization • Operations started November 30th and ended at Xmas (~3 weeks) • The joint effort involved CMS, EDG and LCG people (~50 people, 17 from INFN) • Mailing list: <cms-stress-test@cern.ch>

Software and middleware • CMS Software used is the official production one • CMKIN and CMSIM: installed as rpm on all the sites • EDG Middleware releases: • 1.3.4 (before 9/12) • 1.4.0 (after 9/12) • Tools used (on EDG “User Interface”) • Modified IMPALA/BOSS system to allow for Grid submission of jobs • Scripts and ad-hoc tools to: • Replicate files • Collect monitoring information from EDG and from the jobs

CE SE RefDB BOSS DB Job output filtering Runtime monitoring parameters RC UI IMPALA data registration JDL WN CE JobExecuter dbUpdator Write data CMS sw SE GRID SERVICES CE CE CMS sw SE SE

Resources • The production is managed from 4 UI’s: • Bologna / CNAF • Ecole Polytechnique • Imperial College • Padova reduces the bottleneck due to the BOSS DB • Several RB’s seeing the same Computing and Storage Elements: • CERN (dedicated to CMS) (EP UI) • CERN (common to all applications) (backup!) • CNAF (common to all applications) (Padova UI) • CNAF (dedicated to CMS) (CNAF UI) • Imperial College (dedicated to CMS and BABAR) (IC UI) reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G

Resources

Data management • Two practical approaches: • Bologna, Padova: FZ files (~230 MB sized) are directly stored at CNAF, Legnaro • EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN.Goal: to test the creation of replicas of files • All sites use disk for the file storage, but: • CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR (thanks to a new staging daemon from WP2) • HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS

Online Monitoring (MDS based)

Events vs. time (CMKIN)

Events vs. time (CMSIM) ~7 sec/event average ~2.5 sec/event peak (12-14 dec)

Final results (preliminary!)

Main issues

Chronology • 29/11 – 2/12: reasonably smooth • 3/12 – 5/12: “inefficiency” due to CMS week • 6/12: RC problems begin; new collections created; Nagios monitoring online • 7/12 – 8/12: II in very bad shape • 9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB • 11/12: Top level MDS stuck because of a CE in Lyon • 14/12 – 15/12: II stuck, most submitted jobs aborted • 16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable

Conclusions • Job failures are dominated by: • Standard output of job wrapper does not contain useful data: • many different causes • does affect mainly “long jobs” • some patches with possible solutions implemented • Replica Catalog stops responding: no real solution, but we will soon use RLS • Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems • Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.) • Short term actions: • EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed • Test is going on in “no-stress” mode: • in parallel with the review preparation (testbed will remain stable) • it will measure the effect of new GRAM-PBS script and JSS-Maradona patches

CMS Stress Test Report Marco Verlato (INFN-Padova)