470 likes | 645 Vues
Condor Tutorial NCSA Alliance ‘98. Presented by: The Condor Team University of Wisconsin-Madison Email: condor-admin@cs.wisc.edu URL: http://www.cs.wisc.edu/condor. Welcome to the Condor Tutorial!. Introductions What is Condor ? A system for High Throughput Computing.
E N D
Condor TutorialNCSA Alliance ‘98 Presented by: The Condor Team University of Wisconsin-Madison Email: condor-admin@cs.wisc.edu URL: http://www.cs.wisc.edu/condor
Welcome to the Condor Tutorial! • Introductions • What is Condor ? • A system for High Throughput Computing Condor Tutorial, NCSA Alliance '98, April 27th 1998
The “Religion” behind High Throughput Computing Key Concepts: • High Throughput Computing (HTC) • Distributively owned resources Condor Tutorial, NCSA Alliance '98, April 27th 1998
Performance vs.Throughput • High Performance - Very large amounts of processing capacity over short time periods (FLOPS - Floating Point Operations Per Second) • High Throughput - Large amounts of processing capacity sustained over very long time periods (FLOPY - Floating Point Operations Per Year) FLOPY 30758400*FLOPS Condor Tutorial, NCSA Alliance '98, April 27th 1998
Distributed Ownership • Due to dramatic decrease in the cost-performance ratio of hardware, powerful computing resources are owned today by individuals, groups, departments, … • Huge increase in the aggregate processing capacity owned by the organization • Much smaller increase in the capacity accessible by a single person Condor Tutorial, NCSA Alliance '98, April 27th 1998
The Challenge and Motivation behind Condor Turn large collections of existing distributively owned (and perhaps non-dedicated) computing resources into effective High Throughput Computing Environments Minimize Wait while Idle Condor Tutorial, NCSA Alliance '98, April 27th 1998
Road Block: Sociology Make owners (& system administrators) happy. • Give owners full control on • when and by whom private resources are used for HTC • impact of HTC on private Quality of Service • membership and information on HTC related activities • No changes to existing software and make it easy • to install, configure, monitor, and maintain Happy owners more resources higher throughput Condor Tutorial, NCSA Alliance '98, April 27th 1998
Road Block: Robustness To be effective, a HTC environment must run as a 24-7-365 operation. • Customers count on it • Debugging and fault isolation may be a very time consuming processes • In a large distributed system, everything that might go wrong will go wrong. Robust system less down time higher throughput Condor Tutorial, NCSA Alliance '98, April 27th 1998
Road Block: Portability To be effective, the HTC software must run on and support the latest greatest hardware and software. • Owners select hardware and software according to their needs and tradeoffs • Customers expect it to be there. • Application developer expect only few (if any) changes to their applications. Portability more platforms higher throughput Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor’s unique mechanisms for HTC • Matchmaking - enables requests for services and offers to provide services to find each other. • Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!). • Remote I/O - enables remote (from execution site) access to local (at submission site) data. Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Viewpoints • Owner • Creates resource offers • User • Creates resource requests • Administrator • Drinks Coffee • Manages the pool-wide configuration • Could also be the Owner Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Agents • Condor Resource Agent • condor_startd daemon • allows a machine to execute Condor jobs • enforces owner policy • Condor User Agent • condor_schedd daemon • allows a machine to submit jobs to a pool Condor Tutorial, NCSA Alliance '98, April 27th 1998
Central Manager The Tutorial Installation Alliance ‘98 Pool schedd Your Workstation startd Condor Tutorial, NCSA Alliance '98, April 27th 1998
Central Manager The Tutorial Installation Central Manager UW-Madison Pool Alliance ‘98 Pool schedd schedd Your Workstation startd Condor Tutorial, NCSA Alliance '98, April 27th 1998
Hands-on:Example #1Joining the UW-Madison CS Condor Pool as a Submit-only node Condor Tutorial, NCSA Alliance '98, April 27th 1998
Overview of Submitting a Job to Condor • Create a Submit-Description File • Run condor_compile to relink your program with the Condor Libraries, if Condor’s Checkpointing or Remote I/O support is desired • Run condor_submit • sends your request to the User Agent (condor_schedd) Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor System Structure Condor Tutorial, NCSA Alliance '98, April 27th 1998
Hands-on:Example #2Submit Jobs to Condor Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Universes A Universe specifies a Condor runtime environment: • STANDARD • Supports Checkpointing • Supports Remote System Calls • Has some limitations…. • VANILLA • Any Unix executable (shell scripts, etc) • No Condor Checkpointing or Remote I/O Condor Tutorial, NCSA Alliance '98, April 27th 1998
Hands-on:Example #3Tour of User Tools/Commands Condor Tutorial, NCSA Alliance '98, April 27th 1998
User Priorities in Condor • Each active user in the pool has a user priority • Viewed or changed with condor_userprio • Like golf: the lower, the better • A given user’s share of available machines is inversely related to the ratio between user priorities. • Example: Fred’s priority is 10, Joe’s is 20. Fred will be allocated twice as many machines as Joe. Condor Tutorial, NCSA Alliance '98, April 27th 1998
User Priorities in Condor, cont. • Condor continuously adjusts user priorities over time • machines allocated > priority, priority worsens • machines allocated < priority, priority improves • Priority Preemption • Higher priority users will grab machines away from lower priority users (thanks to Checkpointing…) • Starvation is prevented • Priority “thrashing” is prevented Condor Tutorial, NCSA Alliance '98, April 27th 1998
Parallel Jobs in CondorCondor can run parallel applications ( written to the popular PVM message passing library ) Condor Tutorial, NCSA Alliance '98, April 27th 1998
Master-Worker Paradigm Condor-PVM is designed to run PVM applications which follow the master-worker paradigm. • Master • has a pool of work, sends pieces of work to the workers, manages the work and the workers • Worker • gets a piece of work, does the computation, sends the result back Condor Tutorial, NCSA Alliance '98, April 27th 1998
What does Condor-PVM do? Condor acts as the PVM resource manager. • All pvm_addhost requests get re-mapped to Condor. • Condor dynamically constructs PVM virtual machines out of non-dedicated desktop machines. • When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms. Condor Tutorial, NCSA Alliance '98, April 27th 1998
How to compile and submit Condor-PVM jobs • Binary Compatible • Compile and link with PVM library just as normal PVM applications. No need to link with Condor. • Submit In the submit file set: universe = PVM machine_count = <min>..<max> Condor Tutorial, NCSA Alliance '98, April 27th 1998
Classified Advertisements • ClassAds • Language for expressing attributes • Semantics for evaluating them • Intuitively, a ClassAd is a set of named expressions • Each named expression is an attribute • Expressions are similar to C … • Constants, attribute references, operators Condor Tutorial, NCSA Alliance '98, April 27th 1998
MyType = "Machine" TargetType = "Job" Name = "froth.cs.wisc.edu" StartdIpAddr="<128.105.73.44:33846>" Arch = "INTEL" OpSys = "SOLARIS251" VirtualMemory = 225312 Disk = 35957 KFlops = 21058 Mips = 103 LoadAvg = 0.011719 KeyboardIdle = 12 Cpus = 1 Memory = 128 Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60 Rank = 0 Classified Advertisements: Example Condor Tutorial, NCSA Alliance '98, April 27th 1998
Classified Advertisements: Matching • ClassAds are always considered in pairs Does ClassAd A match ClassAd B (and vice versa)? Condor Tutorial, NCSA Alliance '98, April 27th 1998
ClassAd A MyType = "Apartment" TargetType = "ApartmentRenter" SquareArea = 3500 RentOffer = 1000 HeatIncluded = False OnBusLine = True Rank = UnderGrad==False + TARGET.RentOffer Requirements = MY.RentOffer - TARGET.RentOffer < 150 ClassAd B MyType = "ApartmentRenter" TargetType = "Apartment" UnderGrad = False RentOffer = 900 Rank = 1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded Requirements = OnBusLine && SquareArea > 2700 Classified Advertisements: Examples Condor Tutorial, NCSA Alliance '98, April 27th 1998
ClassAds in the Condor System • ClassAds allow Condor to be a general system • Constraints and ranks on matches expressed by entities themselves • Only priority logic integrated into Manager • All principal entities in the Condor system are represented by ClassAds • Machines, Jobs, Submitters Condor Tutorial, NCSA Alliance '98, April 27th 1998
ClassAds in Condor: Requirements and Rank (Example) Friend = Owner == "tannenba" || Owner == "wright" ResearchGroup = Owner == "jbasney" || Owner == "raman" Trusted = Owner != "rival" && Owner != "riffraff" Requirements = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 ) Rank = Friend + ResearchGroup*10 Condor Tutorial, NCSA Alliance '98, April 27th 1998
Hands-on:Example #4Submit Jobs with ClassAd Constraints Condor Tutorial, NCSA Alliance '98, April 27th 1998
Resource Owner’s Viewpoint Owner is King • In Condor, the owner of the resource (machine owner) can dictate the terms and conditions under which that resource can be used • How? Configure the Resource Agent’s Policy (condor_startd configuration) Condor Tutorial, NCSA Alliance '98, April 27th 1998
Resource Agent ConfigurationExpressions • START expression • When TRUE, Condor can start a job • True = Unclaimed State • False = Owner State • SUSPEND expression • When TRUE, Condor suspends any job running on this machine • CONTINUE expression • When TRUE, will continue a suspended job Condor Tutorial, NCSA Alliance '98, April 27th 1998
Resource Agent Configuration Expressions, cont. • VACATE expression • When TRUE, kick the job off of the machine (via a Checkpoint if possible) • KILL expression • When TRUE, kill the job immediately • No Checkpoint • On UNIX: a “kill -9” Condor Tutorial, NCSA Alliance '98, April 27th 1998
START True True True True True WANT SUSPEND False False SUSPEND WANT VACATE VACATE KILL Resource Agent Configuration Expressions, Cont. Condor Tutorial, NCSA Alliance '98, April 27th 1998
Resource Agent Configuration Expressions, cont. • Default Setup WANT_VACATE : True WANT_SUSPEND : True START : Keyboard_Idle && CPU_Idle SUSPEND : Keyboard_Busy || CPU_Busy CONTINUE : Keyboard and CPU idle again VACATE : If Suspended > 10 minutes KILL : If spent > 10 minutes in VACATE state Condor Tutorial, NCSA Alliance '98, April 27th 1998
Hands-on:Example #5UW-Madison CS Pool Startd Policy Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Administrator Features • The condor_master is the administrator’s best friend • Watches/restarts other daemons • Sends Email if notices suspicious problems • Runs condor_preen • Provides administrator remote control Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Administrator Commands • Administrator Commands • condor_off [ hostname … ] • Down entire pool: condor_off `cat machines-file` • condor_on • condor_restart • condor_reconfig (“on-the-fly” reconfiguration) • condor_vacate • These commands could be used by the Owner as well, if desired Condor Tutorial, NCSA Alliance '98, April 27th 1998
Condor Host-based Access Control • HOST_ALLOW and HOST_DENY to grant machines (subnets, domains) different access levels: • READ access • WRITE access • ADMINISTRATOR access • OWNER access Condor Tutorial, NCSA Alliance '98, April 27th 1998
Example: Simple Host-based Access Control HOSTDENY_READ = *.mil HOSTALLOW_WRITE = *.ncsa.uiuc.edu HOSTDENY_WRITE = ppp*.ncsa.uiuc.edu, 172.44.* HOSTALLOW_ADMINISTRATOR = bigcheese.ncsa.uiuc.edu HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR) Condor Tutorial, NCSA Alliance '98, April 27th 1998
Configuration File Hierarchy • condor_config • Pool-wide default • Condor pool administrator’s requirements • condor_config.local • Overrides for a specific machine • Reflects Owner’s requirements • condor_config.root • System Administrator requirements Condor Tutorial, NCSA Alliance '98, April 27th 1998
Future Directions • Condor for Windows NT • SMP support • More parallel job support • Checkpoint parallel jobs • MPI, MPI-2 • Flocking … Condor Tutorial, NCSA Alliance '98, April 27th 1998
Obtaining Condor • Condor can be downloaded from the Condor web site at: http://www.cs.wisc.edu/condor • Complete Users and Administrators manual available http://www.cs.wisc.edu/condor/manual • Contracted Support is available • Questions? Email : condor-admin@cs.wisc.edu Condor Tutorial, NCSA Alliance '98, April 27th 1998
Thank You!! Thank you for your interest! The Condor Team: Miron Livny Marvin Solomon Todd Tannenbaum Derek Wright Bin Song Rajesh Raman Tom Stanis Jim Basney Adiel Yoaz Condor Tutorial, NCSA Alliance '98, April 27th 1998