D0 Grid Data Production Phase 2 Progress Report

Version 1.0 (meeting edition) 05 February 2009 Rob Kennedy and Adam Lyon Attending: RDK, AL, … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production

D0 Grid Data Production Overview • News and Summary • System running less smoothly over past 1 weeks • <list of causes>  Work these into plans too, where missing. • System is not keeping up with Raw Data logging (higher L). ~2x more CPU/job. • Need more CPU, improved/stable config, more CPU, high resource utilization, more CPU • Lull at times in January as old runs processed… now full impact of high L being felt. • Seeking a CPB time slot at next opportunity – last meeting cancelled • Brief summary of Phase 1 progress, Phase 2 Work list, and seeking D0 input on priorities • Exec Mtg w/D0 Spokespeople and Vicky on Feb. 6 • Phase 1 close-out with more operational experience, events/day = f(luminosity), etc. • Phase 2 proposal (at a roll-up level, with some time estimates and assignments) • Agenda • Phase 2 Work List (discussion 3) • Roll-up level with work tasks assigned where known, else planning task in place • Rough time estimates and assignments needed to kick this off. • Focus on getting some tasks moving now rather than Polishing the Plan before acting.

D0 Grid Data Production Phase 2 Work List Outline • 2.1 Capacity Management: Data Prod is not keeping up with data logging, period. • Capacity Planning: Model nEvents per Day – forecast CPU needed • Capacity Deployment: Procure, acquire, borrow CPU. Infrastructure capable? • Resource Utilization: Use what we have as much as possible. • 2.2 Configuration Management: Expanded system will need higher reliability • Decoupling: @ SAM station, @ Durable Storage, Durable Storage capacity, FWD4a/b • Stability, Reduced Effort: Deeper Queues, Scripted workflow vs PBS behavior • Resilience: Eliminate SPOF, add/improve redundancy • Management: Capture configuration and artifacts in CVS consistently. • 2.3 Operations-Driven Projects • Monitoring: Can we “see” above in real-time? Can we trace jobs “up” as well as down? • Issues: Address “stuck state” issue affecting both Data and MC Production • Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. • Processes: Enable REX/Ops to deploy new Condor. Revisit capacity/config 2/year? • Phase 1 Follow-up: auto-update of gridmap files, FWD3 reboot

D0 Grid Data Production 2.1 Capacity Management • 2.1.1 Capacity Planning: Model nEvents per Day – improve to handle different Tevatron store profiles, etc. • Goal: What is required to keep up with data logging by 01 April 2009? • Goal: What is required to reduce backlog to 1 week’s worth by 01 August 2009? • Goal: What infrastructure is required to handle the CPU capacity determined above? • Major cause of difference between calculated max events/day and actual? (8.0E6 vs 6.5E6)  MD looking into • CPU(Reco) / CPU(Total): Only 70-85% of CPU may actually used for reconstructing events. Rest = unwinding tarballs, I/O, and other overhead tasks. This is rather unexpected since reconstruction thought to be so CPU intensive, but bears investigation now. • CPU/event = f(Reco version): P20.05 (current) consumes up to 10% more cpu at higher luminosity than p20.02 (whose CPU=f(L) was used above). • These two effects, if confirmed, could explain most of the difference between observed and calculated production capacity. • Reco Overhead vs. Reco “calculations” CPU: Relative proportion? Latencies impacted CPU utilization?  MD, Adam looking into • If large latencies, can greater CPU utilization be achieved by having more > 1 job per core running? • Adam is mining XML database… LOTS of useful information on job timing, failures, etc. • Ling Ho: 8 core servers (2008) have scratch/data on system disk (only 1 disk). Have observed contention slowing jobs down. This may only be true for a small fraction now, BUT CRITICAL to consider for FY09 purchases! • Data Handling tuning (AL, RI): tuning SAM station, investigate if CPU waiting on data. • 2.1.2 Capacity Deployment: Goals to be determined by capacity planning • MEV: May be ~300 slots available from CDF old node retirements in next week or two. Just a little help, but worth the effort. • Re-org queues on CAB to make use of idle analysis CPU for weeks at a time… turn back over before conferences? • “Faster” Reco executable possible? We may ask for D0 to consider since price tag for expanded capacity may be high too. • 2.1.3 Resource Utilization: top-down investigation as well as bottom-up investigation. Examples… • Goal: > 90% of available CPU used by Data Prod (assuming demand limited at all times) • Goal: > 90% of available job slots used by Data Prod averaged over time (assuming demand limited at all times) • Goal: TBD… Uptime goal, downtimes limited • Outcome of Capacity Planning investigations: Immediate improvements, fixes possible?

D0 Grid Data Production 2.2 Configuration Management • 2.2.1 Decoupling • @ SAM station • @ Durable Storage • Durable Storage capacity given above. • Virtualization & FWD4 (FWD5 was used for decoupling, like to retire) • Estimate for FEF readiness to deploy virtualization? • 2.2.2 Queue Optimization for Stability, Utilization, Reduced Effort • Goal: System runnable day-to-day by Shifter, supervised by expert Coordinator • Deeper Queues, Scripted workflow vs PBS behavior • Few Long Batch Jobs holding Grid Job Slots – address with alt. config? • 2.2.3 Resilience • Eliminate SPOF, add/improve redundancy • 2.2.4 Management • Capture configuration and artefacts in CVS consistently.

D0 Grid Data Production 2.3 Operations-Driven Projects • 2.3.1 Monitoring • Can we “see” above in real-time? • Collect what we all have, defined requirements (w/i resources available), and execute. • Can we trace jobs “up” as well as down? • Enhance existing script to automate batch job to grid job “drill-up”? • 2.3.2 Issues • Address “stuck state” issue affecting both Data and MC Production • Other issues to consider: From Joel? Large job issue from Mike? Context Server issues? • 2.3.3 Features • Add state at queuing node (from Phase 1) • Readiness of Condor fix? Kluge and follow-up w/official release later? • FWD Load Balancing: Distribute jobs “evenly” across FWD… however easiest to do or approximate. • 2.3.4 Processes • Enable REX/Ops to deploy new Condor. • Revisit capacity/config 2/year? • Continuous Service Improvement process development • 2.3.5 Phase 1 Follow-up • Enable auto-update of gridmap files on Queuing nodes • Enable monitoring on Queuing nodes • FWD3 not rebooted yet, so has not picked up ulimit-FileMaxPerProcess increase. • Lessons Learned from recent ops experience: <to be discussed>

D0 Grid Data Production Phase 2 Work List – Feb old slide • Work in Progress or Leftover from Phase 1 (continued) • AL’s Queuing Node monitoring – deploy on all QUE. Defer productizing until broader evaluation. • Planning/Observation/Investigation/Testing Work • Overall CPU Capacity and Utilization on CAB • Can add usable CPUs to system? Can more use of idle cycles be made without too negative an impact? • Future Configuration of CAB to maximize resource utilization (without degrading analysis user experience) • See Keith and Steve’s recommendations related to configuration. • Potential for opportunistic usage where cycles free? Concern over coupling to less stable workflows again, though. • Virtualization: ready yet? How to best apply to this system? FWD4  FWD4a, FWD4b? • Monitoring: Gather “what we have” and viable requirements • Plan/prioritize next layer of decoupling and/or capacity increase: SAM Station and/or Durable Storage • Understand PBS capabilities: can automated scripts be written to submit reliably over a weekend? • Related to configuration optimization issue. Could be a moot issue if scripts can keep queues full w/o intervention • Evaluate New Ops Issues: easy to fix? (yes: do so; no: plan for later) • Large jobs killed due to lack of thumbnail access at late Sat/early Sun. • Context Server issues… follow-up by MD&RI. • Tasks to simplify operation by shifters (overseen by coordinator) based on Heidi’s experience • (MD and JS): Condor not reporting the proper status of jobs under certain conditions. Appears that a state change for a job is overlooked (message lost?) and then conceptual job state remains stuck at that point, even if physical job is successful and ends. Requires manual checking for physical job at CE… ouch! Impacts both Data Prod and MC Prod. • JS: Will check MC Prod kluge list to look for other potential large benefit-low cost issue candidates for Initiative. • Plan out the Dependent Work… watch to see if/when can be addressed: List on next slide

D0 Grid Data Production Dependent Work – Feb old slideWaiting on some condition • Waiting on new Condor release with requisite functionality • New SAM-Grid Release with support for new Job status value at Queuing node • Defect in new Condor (Qedit) prevents feature from working. Dev done and tested, feature disabled by default. • Kluge alternative or wait for Condor fix? Fix is in new Condor (? check), will be in new VDT release in a few weeks… • If fix is in new Condor, then we wait on that propagating to us via VDT. PM to check if fix in new Condor. • Status: New Condors (7.0, 7.2) do NOT have a fix. PM is requesting an estimate. Is kluge more acceptable to make progress? • Waiting on PBS Head Node upgrade .OR. New Condor release with requisite functionality • Optimize Configurations separately for Data and MC Production • January 2009: first trial of increasing “nJobs in queue on CAB” limits: trade-off with scheduler polling overhead • PBS head node upgrade may mitigate impact of polling • New version of Condor may lessen the polling itself • Increase Data Production “queue” depth to avoid empty queue conditions. Goal?: “Fill once for the weekend.” • Desire to have either ONE of FWD nodes able to fill the running slots. • Status: PBS Head nodes installed, not tested yet. • May first be used to test recent PBS behavior and a means to add redundancy with a failover mechanism • Waiting on “push to do” by RDK/AL, Before major re-installations involving QUE1 or CAB system • Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade) .and. [[[Complementary transfer of batch system from FEF to FGS – No, to much risk: QL]]] • Needs preparation, have each system meet requirements of new stewards. • Waiting on demonstrated need for Initiative to be involved rather than CD and/or D0 groups in place • MC Production-specific issues… so far REX/Ops handling with D0 experimenters involvement. However, issues impacting both MC and Data Prod will be given higher priority for consideration (see previous slide).

D0 Grid Data Production Phase 2 Work List – Mar old slide • General Grid System Work • Implement FWD4  FWD4a (Data) and FWD4b (MC) • Retire FWD5 to test stand duty… or whatever. • Next Layer of decoupling/capacity increases • Decouple SAM Station: 1 each for Data and MC Prod =SRM use decoupled • Decouple Data vs MC local durable storage and add capacity if needed. • Next REX/Ops deployable Condor Upgrades (VDT/Condor product config) • Model in use elsewhere, takes Developers out of the loop. • Broader Work based on previous planning tasks • FermiGrid/CAB config changes per plan being developed in Feb. • May be worthwhile to wait on PBS head node upgrade • Also, consider queue configuration (more than one queue/user with different priorities) • Simplify working urgent data processing requests ahead of steady-state processing • May integrate with opportunistic processing ideas as well. • Monitoring Design, early Implementation • Slow FWD-CAB Job Transition: Install Monitoring and Alarm (alarm less called for nowadays) • Enable alarms, monitoring for all fcpd and Durable Storage Services • Workflow diagram-oriented monitoring – “see” bottlenecks, debugging.

D0 Grid Data Production Phase 2 Work List – Apr old slide • Finish Task Chains in Progress • Monitoring Implementation, robustness to change in system over time. • Example: we exploit on all of d0farm being on cab2 for simple plotting. • Address Issues for Full-Load System? • FWD load balancing or approximation • Look at Preserving Progress over Time • Processes for Change Management, Continuous Service Improvement, Dev and Ops roles remaining as intended, … • Provisioning for dev/int/prd environments • Virtualization to provide 32/64-bit dev environment • Node retirements (life cycle) and impact on capacity • Enable/train REX/Ops to do limited small-scale development

D0 Grid Data Production Phase 2 Deferred Work ListDo not do until proven necessary and worthwhile • No known need at present (but track for historical record) • Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5. • Only a minor OS version difference. Wait until this is needed to avoid another disruption and risk

D0 Grid Data Production Background Slides Original 16+4 Issues List

D0 Grid Data Production Issues List (p.1/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • SAM-Grid Job Status development (see discussion on earlier slides). Delayed by Condor defect. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • AL: Improved script to identify them and treat symptoms. Not happened recently.But why happening at all? • Not specific to SAM-Grid Grid Production • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • Condor 7 Upgrade – RESOLVED! • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • Move SAM station off of FWD1 – DONE! Context Server move as well – DONE! • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available. • Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.

D0 Grid Data Production Issues List (p.2/4) (Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • Nothing in Phase 1. Later Phase may include decoupling of durable location servers? • No automatic handling of hardware failure. System keeps trying even if storage server down. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production.

D0 Grid Data Production Issues List (p.3/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 10) 32,001 Directory problem: Acceptable band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system. • Already a cron job to move information into sub-directories to avoid this. • 11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • Condor 7 Upgrade?May be different causes in other episodes... Only one was understood. • Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only) • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • Later Phase to include more detailed debugging with more modern software in use. • At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC). • For example: GAHP server... Part of Condor • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot. • Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!

D0 Grid Data Production Issues List (p.4/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • Nothing in Phase 1.Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production. • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase. • 16) Periods of Slow Fwd node to CAB Job transitions: related to Spiral of Death issue? • Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere. • Cures all observed cases? Not yet sure. • MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helps • Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or something more specific. • MC-specific 2) Redundant SAM caches needed in the field • Out of scope for Phase 1 • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • Out of scope for Phase 1. PM sent doc, met with Joel. • MC-specific 4) Get LCG forwarding nodes up and running reliably • Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though.

D0 Grid Data Production Phase 2 Progress Report