LHCb DC’04 Status

LHCb DC’04 Status R. Graciani September 8, 2004 GDB

LHCb DC’04 team • DIRAC: • Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees • Production management: • Joel Closier, R. G. (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers • LHCb Bookkeeping, Monitoring & Accounting: • Markus Frank, Carmine Cioffi, Manuel Sanchez, Rubén Vizcaya • LCG-LHCb liaison: • Flavia Donno, Roberto Santinelli • The LCG-GDA team: • Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz, David Smith, Zdenek Sekera, Marco Serra… • The LCG site managers. GDB: LHCb DC'04 Status

LHCb DC’04 aims Gather information for LHCb Computing TDR Physics Goals: • HLT studies, consolidating efficiencies. • B/S studies, consolidate background estimates + background properties. Requires quantitative increase in number of signal and background events: • 30 106 signal events (~80 physics channels). • 15 106 specific backgrounds. • 125 106 background (B inclusive + min. bias, 1:1.8). Split DC’04 in 3 Phases: • Production: MC simulation (Done). • Stripping: Event pre-selection (To start soon). • Analysis (In preparation). GDB: LHCb DC'04 Status

Phase 1 Completed 186 M Produced Events Phase 1 Completed 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone GDB: LHCb DC'04 Status

LCG DIRAC Job • Input SandBox: • Small bash script (~50 lines). • Check environment: • Site, hostname, CPU, Memory, Disk Space… • Install DIRAC: • Download DIRAC tarball (~1 MB). • Deploy DIRAC on WN. • DIRAC Agent: • Request a DIRAC task (LHCb Simulation job) • Execute task: • Install LHCb sw if not present in VO shared area • Event generation • Detector Simulation • Digitalization • Reconstruction • Check Steps • Upload results: • Log files • Data files • Bookkeeping reports • Report status • Mail developers on ERROR • Retrieval of SandBox • Analysis of Retrieved Output SandBox GDB: LHCb DC'04 Status

LCG Strategy Use all the available resources • Test sites: • Each site is tested with special and production-like jobs. • Add site to production mask: • LCG job submission (BDII and|or JDL). • DIRAC WMS. • Submit jobs continuously: • Via cron job on UI: • Submit jobs • Retrieve Sandbox • Check Aborted jobs • Check std.out • At the end of August 4 RB were in use (2@CERN, 1 RAL, 1 CNAF). • Allways keep jobs in the queues • Use limited proxy to avoid runaway jobs (Geant 4 precision bug, hung network connections…) • Cancel jobs too long in the queue (to avoid proxy expiration issues). • Closely follow problems and work with LCG-GDA team & Site Managers. • All LCG-related data are kept on AFS volume. GDB: LHCb DC'04 Status

The Phase 1 Statistics (I) 20 DIRAC Sites 43 LCG Sites GDB: LHCb DC'04 Status

The Phase 1 Statistics (II) 20 DIRAC Sites 424 CPU · Years 43 LCG Sites GDB: LHCb DC'04 Status

Phase 1 Statistics (III) GDB: LHCb DC'04 Status

DIRAC-LCG Share 424 CPU · Years May: 89%:11% 11% of DC’04 Jun: 80%:20% 25% of DC’04 Jul: 77%:23% 22% of DC’04 Aug: 27%:73% 42% of DC’04 GDB: LHCb DC'04 Status

LCG Performance (I) Submitted Jobs Cancelled Jobs Aborted Jobs (Before Running) • 211k Submitted Jobs • After Running: • 113 k Done (Successful) • 34 k Aborted GDB: LHCb DC'04 Status

LCG Performance (II) LCG Job Submission Summary Table LCG Efficiency: 61 % GDB: LHCb DC'04 Status

LCG Performance (III) LCG Job Submission Summary Plot GDB: LHCb DC'04 Status

LCG Performance (IV) LCG Running & Waiting Jobs during LHCb DC (from GOG LCG monitoring pages) GDB: LHCb DC'04 Status

LCG Performance (V) Output Sandbox Analysis: 69K Successful jobs Missing python, Fail DIRAC installation, Fail Connection DIRAC Servers, Fail Software installation… Error while running Applications (Hardware, System, LHCb Soft….) Error while transferring or registering output data (can be recovered retry). LHCb Accounting: 81k LCG Successful jobs GDB: LHCb DC'04 Status

DIRAC Problems • Most DIRAC Issues are due to temporarily Unavailability of the Servers: • Information Server, • WorkLoad Manager Server (Receiver, Optimizer, Matcher), • Monitoring Server, • File Catalogue Server (Alien or BK Interface), • HTTP, SFTP, GRIDFTP, BBFTP Servers. Unavailability is due to both to hardware errors, software errors, and machine overloads. In most cases these errors were fatal for the running Job. Additional Redundancy and Retry mechanisms have been put in place during the DC. • Need to monitor servers and functionality for fast response. • Small fraction of human errors: • Jobs submitted with wrong software version, or to non-authorized sites. • Very small fraction of Application Errors: • In many cases they have been traced to some WN problem, disk full, cwd unavailable… • Missing Monitoring – Accounting Info from some jobs. • These problems affect both LCG and DIRAC production sites but appear when taking the systems to its limit (> 3500 concurrent jobs). GDB: LHCb DC'04 Status

LCG Problems • Large fraction of Aborted Jobs due to site miss-configurations before LHCb jobs start to run (20 %). In many cases even before the LCG Wrapper starts to run. • Large fraction of aborted jobs after they start “Running” (18.5%). • Large fraction of “Successful” jobs that did not produce Output SandBox. 69 k from LCG SandBox vs 81 k from LHCb Accounting. • Job Scheduling problems due to “wrong” info from CE’s. • Resource Ranking: • First based on Estimated Waiting Time (default, no very useful). • Later based on Waiting-Jobs, Running-Jobs and Total CPUs. • FuzzyRank overloads small sites, • NormalRank increases effect of problematic sites. • Error Debugging: • Incredibly difficult and time consuming task. Too many Ids and logs to be search. • Hardware, software, hacker,… problems in dedicated UI and RB machines. • We have not tested Proxy Server, RM… GDB: LHCb DC'04 Status

DC’04 Lesson: DIRAC • Improve Server Availability: • Service Monitoring • Redundancy • Error Recovery • Minimize need for human intervention. • Improve Error Handling and Reporting. • Test hardware capacity of servers for the expected Work Load (CPU, Memory, Disk, Network, …). GDB: LHCb DC'04 Status

DC’04 Lesson: LCG (I) • Limit strong dependence on LSRM: • Wrapper Outputs, Jobs reports all necessary info by itself. • Improve OutputSandBox Upload | Retrieval mechanism: • Should also be available for Failed and Aborted Jobs. • Improve reliability of CE status collection methods (timestamps?). • Add intelligence on CE or RB to detect and avoid large number of aborted jobs on start-up: • Avoid miss-configured site to become a black-hole. • Need to collect LCG-log info and tool to navigate them (including different JobIDs). • Need a way to limit the CPU (and Wall-clock time): • LCG Wrapper must issue appropriated signals to User Job to allow graceful termination. • Need to be able to allocate Local Disk Space for Running Job. • How to manuals: • Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and|or upgrade). GDB: LHCb DC'04 Status

DC’04 Lesson: LCG (II) Personal Comments: • Suggestion to change the definition of the job “Running” state (it is only declared once the Wrapper, or even the user job, is safely running on the WN). • Review basic “edg-job-submit”, “edg-job-status”, “edg-job-get-output”. • i.e. in case RB is down and several jobs are included in the query they may take hours to complete. • RB performance needs to be carefully monitored and 24x7 support is needed for a production System. • Graphic tools http://www.hep.ph.ic.ac.uk/~mp801/applet/ • Automatic scripts. • Job Assignment to a Resource (queue on a CE) must be delayed until previously assigned jobs are submitted to the site (or fail). For Production-like tasks (large number of Jobs) improve responsiveness: • “edg-job-*” commands with bulk-operation, creating authenticated channel with RB’s to handle operations with all relevant jobs, and with possibility to retrieve the output message as “python objects”, C++ structures, … • Re-use of previously uploaded Input Sandbox. GDB: LHCb DC'04 Status

Status and Outlook • LHCb DC’04 Phase 1 is over. • The Production Target has been achieved: • 186 M Events in 424 CPU years. • ~ 50% on LCG Resources (75-80% at the last weeks). • Right LHCb Strategy: • Submitting “empty” DIRAC Agents to LCG has proven to be very flexible allowing a success rate above LCG alone. • Big room for improvements, both on DIRAC and LCG: • DIRAC needs to improve in the reliability of the Servers: • big step already during DC. • LCG needs improvement on the single job efficiency: • ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. • In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. • Congratulation and warm thanks to the complete LCG team for their support and dedication GDB: LHCb DC'04 Status

Outlook • Phase 2 • Stripping in last steps of preparation • Need to Run over 65 TB of Data distributed over 4 Tier1 Sites (CERN, CNAF, FZK, PIC), with “small” CPU requirements. • Phase 3 • End user analysis will follow • GANGA tools in preparation • (Phase 1) • Keep a continuous rate of production activity with programmed mini DC (i.e., few days once a month). GDB: LHCb DC'04 Status

LHCb DC’04 Status

LHCb DC’04 Status

Presentation Transcript