160 likes | 269 Vues
The RunJob Project aims to automate job creation and submission for various High Energy Physics (HEP) applications, while seamlessly extending to new environments and applications. By linking jobs in tree-like data flows and integrating with catalog services, the project ensures robust tracking and production control. The proposal includes detailed requirements, development plans, and manpower estimates necessary for optimal integration with experiments like CDF, CMS, and DZero. Emphasizing rigorous testing and documentation, this initiative seeks to meet the evolving needs of HEP projects.
E N D
The RunJob Project A Proposal Greg Graham, FNAL CD
What is RunJob? • Automatic Job Creation and Submission • Metadata description of job steps • Produces jobs for a variety of environments • Easy to extend to new applications or new environments • Metadata model extends to catalogs or services to do production control or tracking • Links jobs together in tree-like dataflow arrangements Greg Graham, FNAL CD
Who Uses RunJob? • DZero • Monte Carlo Challenges (CHEP 2000, CHEP 2001) • User Monte Carlo production • SAMGrid production (Under construction) • Data Reprocessing • CMS • Monte Carlo Challenges (CHEP 2003) • USCMS Grid, LCG based Grid production • User Monte Carlo Production (Under construction) • Data Reprocessing Greg Graham, FNAL CD
The RunJob Pilot Project • Begun in early Spring 2003 to “merge” the then divergent DZero and CMS versions • ShahKar package created and developed during Summer 2003 with input from DZero and CMS reps. • ShahKar merged with CMS variant MCRunjob in Fall 2003 • but not propagated. • DZero integration pushed back to April 2004. Greg Graham, FNAL CD
Proposal for a Full RunJob Project • Increase manpower to accomplish • better integration with experiments’ planning processes: CDF, CMS, DZero, others? • integration of codebases with the ShahKar code base from pilot project • features and core development happen in RunJob project to satisfy experiments’ needs and schedule • rigorous testing and debugging support, documentation, and release management Greg Graham, FNAL CD
Requirements and Features • The RunJob Project Plan that was distributed comes with some very generally stated “requirements” and a lot of very specific work items • Reflects the need to begin talking to the experiments to tighten up the requirements and map them to specific work items. • The work items reflect developments ongoing within the RunJob pilot project • 12 man-years of experience building production processing systems for HEP applications in many different environments. Greg Graham, FNAL CD
Requirements and Features • “It automatically generates jobs to run my application(s) in a variety of environments” • scriptObject design is a way to better abstract the job descriptions away from the jobs themselves and therefore away from the environments. These are like internal “sandboxes”. (Critical, needed by all) • Development work will include modules tailored for specific environments such as LSF, FBS, PBS, Condor, etc. (Critical, needed by all) • Development work will also include Grid environments and Web Services design work. (TBD, who needs this and when?) Greg Graham, FNAL CD
Requirements and Features • “Later, I can go back and determine ho the job was configured.” • Physics parameters and defaults should not come from RunJob itself (Critical) • Contexts are documents that can record suitable defaults for various applications, groups of applications, or environments. (Critical) • Contexts can currently be combined in a rudimentary fashion; better (algebraic) combination rules lead to more expressiveness and better control in complex environments (TBD; who needs it and when?) Greg Graham, FNAL CD
Requirements and Features • “I need to build jobs across datasets listed in a catalog using parameters in a control DB.” • Observation: everyone comes around to doing this eventually ;-) • Uniform interfaces to catalogs and control databases potentially decrease maintenance costs for all experiments and increase adaptibility to new systems. (TBD, who needs it and when?) • Interfaces to specific catalogs and control DBS are an integration task. (Critical.) Greg Graham, FNAL CD
Requirements and Features • “I need to resubmit jobs when they fail” • Specification of RunJob state just before job creation/submission; this is the “XML” milestone. (Critical) • Storage of RunJob state specifications in an XML database or filesystem. (Critical) • Interface to specific job tracking systems designed by the experiments to do this. (TBD, who needs this and when?) Greg Graham, FNAL CD
Requirements and Features • “I need feature X working by my experiments’ milestone Y.” • These need to be worked out during the negotiation phase this Spring. • The stated specific work items listed in the plan are probably a good cover of the forseeable requirements to come during the negotiation phase • … so on to the manpower estimates ;-) Greg Graham, FNAL CD
Manpower Estimates • My favorite quote: “The plan is OK except possibly for the schedule and the manpower.” • For each listed milestone/deliverable/feature, a SWAG estimate was produced. The SWAGs were then summed, and the result was inflated by 25%. • 40 man-months total effort, not including management or testing. • The Level of Effort (LOE) was used • essentially equal to the average number of “warm bodies” active for the duration of the schedule. • Total FTE = LOE * project duration. Greg Graham, FNAL CD
Manpower Estimates Greg Graham, FNAL CD
Schedule Changes • Deferment cost estimates • Project management and essential functions LOE remain constant • Development driven functions scale against schedule length • Adjusted average LOE = 1.6 + 4.2/(length) • Risk: Can we satisfy the experiments’ milestones? Greg Graham, FNAL CD
Schedule Changes • Cutting Work Items • Analysis cannot really be done without experiments’ input • Cutting Project Roles (eg- dedicated testing) • Analysis cannot really be done without experiments’ input • Probably there is some savings here: development could be pushed further up the integration food chain and into experiments’ variant codebases themselves. • We recommend against this because it dilutes the benefits of cooperation. Greg Graham, FNAL CD
Conclusion • The RunJob project is an exciting opportunity for the RunII experiments and CMS to collaborate on software. • DZero and CMS already use fairly closely related variants. • The RunJob project can build upon • the experience of many people who have been working on it already for years • a successful pilot project that minimally satisfies many requirements already • We are eager to work with the experiments to effectively gather and address their requirements and milestones coherently across the experiments. Greg Graham, FNAL CD