Reliability and Troubleshooting with Condor

Reliability and Troubleshootingwith Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002

Condor Reliability • Condor was designed for idle machines: • Reclaim, reboot, crash, out of memory... • Sounds much like the grid! • US-CMS testbed • Distributed ownership, control, and resources. • (War stories abound.) • Condor tools add controlled reliability. • Not absolute reliability, but: • A finite amount of retry. • A notification/recovery strategy. • Logging and book-keeping. • Known state after a failure.

Private Network Private Network Private Network US-CMS Physical Structure MOP Master Workers Head Node Workers Head Node Public Internet Workers Head Node

US-CMS Logical Structure Master Site Worker Impala Globus MOP Condor DAGMan Real Work Condor-G Red items expect a reliable environment. Green items create a reliable environment.

Run Run Run Run Idle Idle End-User Tools Condor-G (transaction interface) Job Managers Head Node Condor-G Submitter Gatekeeper Job Log System Log Job Queue Local Resource Manager Grid Managers GRAM GAHP-Server

Condor-G deals with system failures, DAGMan deals with app and user failures. PRE and POST may be used to validate inputs and outputs. “Rescue DAG” describes what is left unexecuted. DAG nodes may themselves be DAGs. pre.pl C post.pl Directed Acyclic Graph Manager (DAGMan) A B D

Standard shell scripts are very error-prone. FTSH adds time limits, retry, logging, and clean termination. “Exceptions for scripts:” unexpected errors cannot accidentally be ignored. try 10 times try for 15 minutes globus_url_copy A B end try for 1 hour run-simulation < B > C gzip < C >D end try for 15 minutes globus_url_copy D E end end Fault Tolerant Shell (FTSH)

Hawkeye Hawkeye Manager (Example Hawkeye Page) Policy Manager Trigger Exprs ClassAd Queries ClassAd Data Probe Modules Probe Modules Probe Modules Submit Repair Job Contact Sysadmin Log Event

For More Info... • Condor-G • http://www.cs.wisc.edu/condor/condorg • DAGMan • http://www.cs.wisc.edu/condor/dagman • Fault Tolerant Shell • http://www.cs.wisc.edu/~thain/research/ftsh • Hawkeye • http://www.cs.wisc.edu/condor/hawkeye • Philosophy of Error Management • http://www.cs.wisc.edu/condor/doc/error-scope.pdf • The Condor Project • http://www.cs.wisc.edu/condor

Reliability and Troubleshooting with Condor