Dynamic DAGMan with ClassAds
This document outlines the integration of Dynamic DAGMan and ClassAds for effective workflow management in Condor. DAGMan facilitates the management of job dependencies in a directed acyclic graph format, ensuring orderly execution of tasks. With dynamic capabilities, it allows for runtime decision-making and conditional execution based on job states. The ClassAds framework provides a flexible way to define job attributes, resources, and conditions, enhancing the matchmaking process within the execution environment. This synergy promotes efficient job handling and recovery in complex workflows.
Dynamic DAGMan with ClassAds
E N D
Presentation Transcript
Himani Apte Dynamic DAGMan with ClassAds
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
DAGMan • Directed Acyclic Graph Manager • Meta-scheduler for Condor • DAG: set of jobs with dependencies • Manages submission of DAG jobs • Enforces execution order • DAGMan itself is a Condor job!
Example DAG Job A A.condor Job B B.condor Job C C.condor Job D D.condor Parent A Child B C Parent B C Child D Script PRE A input.sh Script POST D output.sh A B C D
Simplified state diagram of a DAG node Pre-running Post-running Waiting Submitted Done Failed
DAGMan: important properties • Monitors job state using Condor logs • Simple and clean recovery model • Rescue DAG: saves state at failure • Restart: reconstruct internal state • Scripts allow “lazy” planning • Throttling parameters
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Motivation for dynamic DAGMan • DAG: complete execution order • Flexibility to make run-time decisions • Which subset of DAG nodes should execute? • When should node X execute? • Conditional DAGs • Associate a condition with DAG edges • Simplest condition: successful completion of parent nodes
Conditional DAG: examples Example 1 Example 2 A P1 P2 Condition: P1.x OR P2.x Condition: A.x = = true Yes No B C C
Motivation for dynamic DAGMan • Scripts can be leveraged for lazy planning • For simple conditions • E.g. exit value of job • Modify DAG structure • E.g. convert branch-not-taken to no-op/empty • We want a generic solution • Supported by “Dynamic DAGMan”
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
ClassAds • Classified advertisements • Used extensively in Condor • Define jobs, machines, resources • Define conditions, triggers, requirements • Maintain internal state
ClassAds • List of attribute-value pairs • Simple value types: integer, strings • Complex types: list, expressions, ClassAds • Matchmaking framework • Tests match between two classAds • Using “Requirements” expression • Great fit for Dynamic DAGMan
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Putting together: DAGMan + ClassAds • Dynamic DAGMan research project • Work-in-progress • Not yet available in Condor • DAG nodes have associated classAds • Basic node attributes • Job identifier, name, type • Status (Waiting, Submitted, Done, etc.)
Dynamic DAGMan: attributes • Execution characteristics of job • Exit value • Wall-clock time • CPU utilization (local and remote) • Network statistics (bytes sent / received) • Information about files transferred (for vanilla universe) • Attributes maintained by Condor for a job
Dynamic DAGMan: conditions • Requirements expression • Defines trigger condition for the node • Arbitrarily complex expression • Defined on the attributes of parent nodes • Use matchmaking to determine if a node can be submitted
Dynamic DAG: example Job A A.condor Job B B.condor Job C C.condor Parent A Child B \ COND [ ( other.job == A && other.x == true ) ] Parent A Child C \ COND [ ( other.job == A && other.x == false ) ] A Yes No condition x = = true B C
Dynamic DAGMan: example Job P1 P1.condor Job P2 P2.condor Job C C.condor Parent P1 P2 Child C \ COND [ (other.job == P1 && other.x == true) || (other.job == P2 && other.x == true) ] P1 P2 Condition: P1.x OR P2.x C
Dynamic DAGMan • Recovery model is still the same • Rescue DAG: saves node state at failure • ClassAd attribute-values can be re-generated from Condor logs • Flexibility to make run-time decisions • Which subset of nodes in the DAG should be executed? • When should node X be executed?
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Looking ahead • DAG with only implicit edges • Parent-child relations embedded in classAds • Nodes specify • Trigger condition • Preference for child nodes to run • On-the-fly dependency formation based on previous node execution • DAGMan collaborates with Quill • Getting attributes from persistent storage
Looking ahead • Allow job to modify/add its attributes • Determine what happens after job exits • Global state control • Throttling expression/parameters • Global DAG-classAd • Statistics on running, successful and failed jobs • E.g. if (#failed jobs > N ) run cleanup node
Thank-you We are interested in knowing your suggestions!