1 / 64

Condor by Example

Condor by Example. Lecture Format:. In each lecture: Lecture to whole group. Workshop and examples at computer. Oops! Some items are filled in at the last minute. Please fill the _______ with notes. Outline. Overview Submitting Jobs, Getting Feedback Setting Requirements with ClassAds

yeo-butler
Télécharger la présentation

Condor by Example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor by Example

  2. Lecture Format: • In each lecture: • Lecture to whole group. • Workshop and examples at computer. • Oops! • Some items are filled in at the last minute. • Please fill the _______ with notes.

  3. Outline • Overview • Submitting Jobs, Getting Feedback • Setting Requirements with ClassAds • Which Universe? • Move to Workshop

  4. What is Condor? • Condor converts a collection of unrelated workstations into a high-throughput computing facility. • Condor uses matchmaking to make sure that everyone is happy.

  5. What is High-Throughput Computing? • High-performance: CPU cycles/second under ideal circumstances. • “How fast can I run simulation X on this machine?” • High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. • “How many times can I run simulation X in the next week using all available machines?”

  6. What is High-Throughput Computing? • Condor does whatever it takes to run your jobs, even if some machines… • Crash! • Are disconnected • Run out of disk space • Are removed or added from the pool • Are put to other uses

  7. What is Matchmaking? • Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

  8. Who uses Condor? • Hundreds of universities and companies around the world! • University of Wisconsin, USA • 682 CPUs in one building • Computer architecture simulations • National Institute of Physics, Italy • 200 CPUs in many cities • Reconstruction of collider events • And many others!

  9. What can Condordo for me? Condor can… • …increase your throughput. • …do your housekeeping. • …improve reliability. • …give performance feedback.

  10. 20 GB Server 512 MB 800 MHz Cluster Overview 100 Mb/s network Client 128 MB 666 MHz Client 128 MB 666 MHz Client 128 MB 666 MHz Client 128 MB 666 MHz Client 128 MB 666 MHz 10 GB 10 GB 10 GB 10 GB 10 GB

  11. How many machines now? • The map is out of date! • The system is always changing. • First example: What machines (and of what kind) are in the pool now?

  12. How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle 0.000 30 axpd21.pd.inf OSF1 ALPHA Owner Idle 0.266 96 vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy 0.000 256 . . . Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF1 115 67 46 1 0 1 INTEL/LINUX 53 18 0 35 0 0 INTEL/LINUX-GLIBC 16 7 0 9 0 0 SUN4u/SOLARIS251 1 1 0 0 0 0 SUN4u/SOLARIS26 6 2 0 4 0 0 SUN4u/SOLARIS27 1 1 0 0 0 0 SUN4x/SOLARIS26 2 1 0 1 0 0 Total 194 97 46 50 0 1

  13. Machine States • Most machines will be: • Owner: • The machine’s owner is busy at the console, so no Condor jobs may run. • Claimed: • Condor has selected the machine to run jobs for other users.

  14. Machine States • Only a few should be: • Unclaimed: • The owner is gone, but Condor has not yet selected the machine. • Matched: • Between claimed and unclaimed. • Preempting: • Condor is busy removing a job.

  15. More Things to Try % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor.cs.wisc.edu

  16. Submitting Jobs

  17. Steps to Running a Job • Re-link for Condor. • Submit the job. • Watch the progess. • Receive email when done.

  18. Example Job Integrate sin(x) from 0 to 10, using 10 million slices. Simple program takes a few seconds. % ./integrate 10 10000000 2.0445075

  19. PROGRAM INTEGRATE CHARACTER STR*10 REAL X, SLICES, LIMIT CALL GETARG(1,STR) READ (STR,*) LIMIT CALL GETARG(2,STR) READ (STR,*) SLICES TOTAL=0 STEP=LIMIT/SLICES DO X=0, LIMIT, STEP TOTAL = TOTAL + SIN(X)*STEP END DO PRINT *, TOTAL END

  20. Re-link for Condor • If you normally compile like this: • g77 integrate.f -o integrate • Then compile for Condor like this: • condor_compile g77 integrate.f -o integrate

  21. Submit the Job • Create a submit file: • emacs integrate.submit & • Submit the job: • condor_submit integrate.submit Executable = integrate Arguments = 10 10000000 Output = integrate.out Log = integrate.log queue

  22. Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12:40 0+00:00:15 R 0 2.5 fib 40 Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)

  23. Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Wed Jun 21 14:24:42 2000 Completed at: Wed Jun 21 14:36:36 2000 Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37 . . .

  24. Running Many Processes • 100 processes are almost as easy as !. • Each condor_submit makes one cluster of one or more processes. • Add the number of processes to run to the Queue statement. • Use the $(PROCESS) variable to give each process slightly different instructions.

  25. Running Many Processes • Perform the same program on 50 different intervals. • Output goes in integrate.out.1, integrate.out.2, and so on… Executable = integrate Arguments = $(PROCESS) 10000000 Output = integrate.out.$(PROCESS) Log = integrate.log Queue 50

  26. Running Many Processes % condor_q -- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.3 thain 6/23 10:47 0+00:05:40 R 0 2.5 fib 3 9.6 thain 6/23 10:47 0+00:05:11 R 0 2.5 fib 6 9.7 thain 6/23 10:47 0+00:05:09 R 0 2.5 fib 7 . . . 21 jobs; 2 idle, 19 running, 0 held Cluster number Process number

  27. Where Are They Running? • condor_q –run • Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> : ID OWNER SUBMITTED RUN_TIME HOST(S) 9.47 thain 6/23 10:47 0+00:07:03 ax4bbt.bo.infn.it 9.48 thain 6/23 10:47 0+00:06:51 pewobo1.bo.infn.it 9.49 thain 6/23 10:47 0+00:06:30 osde01.pd.infn.it Current Location

  28. Help! I’m buried in Email! • By default, Condor sends one email for each completed process. • Add these to your submit file: • notification = error • notification = never • To send it to someone else: • notify_user = thain@cs.wisc.edu

  29. Removing Processes • Remove one process: • condor_rm 9.47 • Remove a whole cluster: • condor_rm 9 • Remove everything! • condor_rm -a

  30. Getting Feedback

  31. What have I done? • The user log file (fib.log) shows a chronological list of everything important that happened to a job. 001 (007.035.000) 06/21 17:03:44 Job executing on host: <140.105.6.155:2219> 004 (007.035.000) 06/21 17:04:58 Job was evicted. 009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.

  32. What have I done? % condor_history ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD 9.3 thain 6/23 10:47 0+00:00:00 C 6/23 10:58 fib 3 9.40 thain 6/23 10:47 0+00:00:24 C 6/23 10:59 fib 40 9.10 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 10 9.47 thain 6/23 10:47 0+00:05:45 C 6/23 11:01 fib 47 9.7 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 7

  33. Brief I/O Summary % condor_q –io -- Schedd: c01.cs.wisc.edu : <128.105.146.101:2016> ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE 756.15 joe 244.9 KB 379.8 KB 71 1.3 KB/s 512.0 KB 32.0 KB 758.24 joe 198.8 KB 219.5 KB 78 45.0 B /s 512.0 KB 32.0 KB 758.26 joe 44.7 KB 22.1 KB 2727 13.0 B /s 512.0 KB 32.0 KB 3 jobs; 0 idle, 3 running, 0 held

  34. Complete I/O Summaryin Email Your condor job "/usr/joe/records.remote input output" exited with status 0. Total I/O: 104.2 KB/s effective throughput 5 files opened 104 reads totaling 411.0 KB 316 writes totaling 1.2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling 398.6 KB 311 write totaling 1.2 MB 101 seeks (Only since Condor Version 6.1.11)

  35. Complete I/O Summaryin Email • The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.

  36. Complete I/O Summary in Email • Example: • CMSSIM - collider simulation • “Why is this job so slow?” • Data summary: • read 250 MB from 20 MB file. • Very high SEEK total -> random access. • Solution: Increase buffer to 20 MB.

  37. Who Uses Condor? % condor_q –global -- Schedd: to02xd.to.infn.it : <192.84.137.2:1030> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 127.0 garzelli 6/21 18:45 1+14:18:16 R 0 17.2 tosti2trisdn -- Schedd: quark.ts.infn.it : <140.105.6.101:3908> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 600.0 dellaric 4/10 14:57 55+09:20:31 R 0 9.1 john p2.dat 665.0 dellaric 6/2 11:14 20+03:27:30 R 0 9.2 john p1.dat 788.0 pamela 6/20 09:27 3+04:41:43 R 0 15.4 montepamela

  38. Who uses Condor? % condor_status –submitters Name Machine Running IdleJobs MaxJobsRunning rebuzzin@pv.infn.it decux1.pv. 22 34 200 pamela@ts.infn.it quark.ts.i 6 1 200 giunti@to.infn.it to05xd.to. 21 49 200 . . . RunningJobs IdleJobs cattaneo@pv.infn.it 0 1 pamela@ts.infn.it 6 1 rebuzzin@pv.infn.it 22 34 Total 59 86

  39. Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16:27 Effective User Name Priority ------------------------------ --------- meucci@pv.infn.it 0.50 longof@ts.infn.it 0.50 thain@bo.infn.it 0.50 dellaric@ts.infn.it 2.00 clueoff@pd.infn.it 3.00 pamela@ts.infn.it 5.81 rebuzzin@pv.infn.it 18.18 giunti@to.infn.it 19.72 ------------------------------ --------- Number of users shown: 8

  40. Who Uses Condor? • The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter. • Lighter users receive a lower priority: they will be allocated CPUs before heavy users. • Users consuming the same amount of CPU will be allocated an equal amount.

  41. Measuring Goodput • Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. • This is a big topic all by itself: http://www.cs.wisc.edu/condor/goodput

  42. Measuring Goodput % condor_q –goodput -- Submitter: coral.cs.wisc.edu : <128.105.175.116:45697> : coral.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 719.74 thain 6/23 07:35 2+20:47:59 100.0% 87.6% 0.00 719.75 thain 6/23 07:35 2+20:38:45 40.5% 99.8% 0.00 719.76 thain 6/23 07:35 2+20:38:16 96.9% 98.7% 0.00 719.77 thain 6/23 07:35 2+21:10:06 100.0% 99.8% 0.00

  43. Setting Requirements • We believe that Condor must allow both users (jobs) and owners (machines) to set requirements. • This is an absolute necessity in order to convince people to participate in the community.

  44. ClassAds • ClassAds are a simple language for describing both the properties and the requirements of jobs and machines. • Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.

  45. ClassAd for a Machine • condor_status –l axpbo8 MyType = "Machine" TargetType = "Job" Name = "axpbo8.bo.infn.it" START = TRUE VirtualMemory = 342696 Disk = 28728536 Memory = 160 Cpus = 1 Arch = "ALPHA" OpSys = "OSF1“

  46. ClassAd for a Job • condor_q –l 9.49 MyType = "Job" TargetType = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib.out.49” Args = “49” ImageSize = 2544 DiskUsage = 2544 Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

  47. Default Requirements • By default, Condor assumes the requirements for your job are: “I need a machine with…” • The same operating system and architecture as my workstation. • Enough disk to store the program. • Enough virtual memory to run the program.

  48. ClassAd Requirements • Similar to C/C++/Java expressions: • Symbols: Arch, OpSys, Memory, Mips • Values: 15, 6.5, “LINUX” • Operators: • ==, <, >, <=, >= • &&, || • ( )

  49. Adding Requirements • In the submit file, add a line beginning with “requirements = “ Executable = fib Arguments = 40 Output = fib.out Log = fib.log Requirements = (Memory > 64) queue

  50. Example Requirements • (Memory>64) • (Machine == “axpbo3.bo.infn.it” ) • (Mips>100) || (Kflops>10000) • (Subnet != “131.154.10”) && (Disk > 20000000)

More Related