1 / 31

High Throughput Computing with Condor at Notre Dame

High Throughput Computing with Condor at Notre Dame. Douglas Thain 30 April 2009. Today’s Talk. High Level Introduction (20 min) What is Condor? How does it work? What is it good for? Hands-On Tutorial (30 min) Finding Resources Submitting Jobs Managing Jobs Ideas for Scaling Up.

fiona
Télécharger la présentation

High Throughput Computing with Condor at Notre Dame

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Throughput Computingwith Condor at Notre Dame Douglas Thain 30 April 2009

  2. Today’s Talk • High Level Introduction (20 min) • What is Condor? • How does it work? • What is it good for? • Hands-On Tutorial (30 min) • Finding Resources • Submitting Jobs • Managing Jobs • Ideas for Scaling Up

  3. The Cooperative Computing Lab • We create software that enables the reliable sharing of cycles and storage capacity between cooperating people. • We conduct research on the effectiveness of various systems and strategies for large scale computing. • We collaborate with others that need to use large scale computing, so as to find the real problems and make an impact on the world. • We operate systems like Condor that directly support research and collaboration at ND. http://www.cse.nd.edu/~ccl

  4. What is Condor? • Condor is software from UW-Madison that harnesses idle cycles from existing machines. (Most workstations are ~90% idle!) • With the assistance of CSE, OIT, and CRC staff, Condor has been installed on ~700 cores in Engineering and Science since early 2005. • The Condor pool expands the capabilities of researchers in to perform both cycle and storage intensive research. • New users and contributors are welcome to join! http://condor.cse.nd.edu

  5. Batch Users Purdue ~10k cores Wisconsin ~5k cores www portals login nodes db server central mgr “flocking” to other condor pools Condor Distributed Batch System (~700 cores) green house netscale 16x2 cclsun 16x2 compbio 1x8 CSE 170 ccl 8x1 Storage Research Network Research Storage Research Timeshared Collaboration Fitzpatrick 130 iss 44x2 loco 32x2 cvrl 32x2 sc0 32x2 CHEG 25 EE 10 netscale 1x32 Nieu 20 DeBart 10 Network Research Batch Capacity Biometrics Hadoop MPI Personal Workstations Primary Interactive Users

  6. http://www.cse.nd.edu/~ccl/viz

  7. The Condor Principle • Machine Owners Have Absolute Control • Set who, what, and when can use machine. • Can kick jobs off at any time manually. • Default policy that satisfies most people: • Start job if console idle > 15 minutes • Suspend job if console used or CPU busy. • Kick off job if suspended > 10 minutes. • After that, jobs run in this order: owner, research group, Notre Dame, elsewhere. For the full technical details, see: http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml

  8. What’s the value proposition? • If you install Condor on your workstations, servers, or clusters, then: • You retain immediate, preemptive priority on your machines, both batch and interactive. • You gain access to the unused cycles available on other machines. • By the way, other people get to use your machines when you are not.

  9. http://condor.cse.nd.edu

  10. http://condor.cse.nd.edu

  11. http://condor.cse.nd.edu

  12. I prefer to run jobs owned by user “joe”. I want an INTEL CPU with > 3GB RAM You two should talk to each other. Run job with files X, Y. X job Y Y Y Condor Architecture match maker Represents an available machine. schedd startd Represents a user with jobs to run.

  13. ~700 CPUs at Notre Dame match maker schedd startd schedd startd schedd startd schedd startd schedd startd schedd startd schedd

  14. Flocking to Other Sites 2000 CPUs University of Wisconsin 20,000 CPUs Purdue University 700 CPUs Notre Dame

  15. What is Condor Good For? • Condor works well on large workflows of sequential jobs, provided that they match the machines available to you. • Ideal workload: • One million jobs that require one hour each. • Doesn’t work at all: • An 8-node MPI job that must run now. • Many workloads can be converted into the ideal form, with varying degrees of effort.

  16. High Throughput Computing • Condor is not High Performance Computing • HPC: Run one program as fast as possible. • Condor is High Throughput Computing • HTC: Run as many programs as possible before my paper deadline on May 1st.

  17. Intermission and Questions

  18. Getting Started: If your shell is tcsh: % setenv PATH /afs/nd.edu/user37/condor/software/bin:$PATH If your shell is bash: % export PATH=/afs/nd.edu/user37/condor/software/bin:$PATH Then, create a temporary working space: % mkdir /tmp/YOURNAME % cd /tmp/YOURNAME

  19. Viewing Available Resources • Condor Status Web Page: • http://condor.cse.nd.edu • Command Line Tool: • condor_status • condor_status –constraint ‘(Memory>2048)’ • condor_status –constraint ‘(Arch==“INTEL”)’ • condor_status –constraint ‘(OpSys==“LINUX”)’ • condor_status -run • condor_status –submitters • condor_status -pool boilergrid.rcac.purdue.edu

  20. A Simple Script Job #!/bin/sh echo $@ date uname –a % vi simple.sh % chmod 755 simple.sh % ./simple.sh hello world

  21. A Simple Submit File % vi simple.submit universe = vanilla executable = simple.sh arguments = hello condor output = simple.stdout error = simple.stderr should_transfer_files = yes when_to_transfer_output = on_exit log = simple.logfile queue

  22. Submitting and Watching a Job • Submit the job: • condor_submit simple.submit • Look at the job queue: • condor_q • Remove a job: • condor_rm <#> • See where the job went: • tail -f simple.logfile

  23. Submitting Lots of Jobs % vi simple.submit universe = vanilla executable = simple.sh arguments = hello $(PROCESS) output = simple.stdout.$(PROCESS) error = simple.stderr.$(PROCESS) should_transfer_files = yes when_to_transfer_output = on_exit log = simple.logfile queue 50

  24. What Happened to All My Jobs? • http://condorlog.cse.nd.edu

  25. Setting Requirements • By default, Condor will only run your job on a machine with the same CPU and OS as the submitter. • Use requirements to send your job to other kinds of machines: • requirements = (Memory>2084) • requirements = (Arch==“INTEL” || Arch==“X86_64”) • requirements = (MachineGroup==“fitzlab”) • requirements = (UidDomain!=“nd.edu”) • (Hint: Try out your requirements expressions using condor_status as above.)

  26. Setting Requirements • By default, Condor will assume any machine that satisfies your requirements is sufficient. • Use the rank expression to indicate which machines that you prefer: • rank = (Memory>1024) • rank = (MachineGroup==“fitzlab”) • rank = (Arch==“INTEL”)*10 + (Arch==“X86_64”)*20

  27. File Transfer • Notes to keep in mind: • Condor cannot write to AFS. (no creds) • Not all machines in Condor have AFS. • So, you must specify what files your job needs, and Condor will send them there: • transfer_input_files = x.dat, y.calib, z.library • By default, all files created by your job will be sent home automatically.

  28. In Class Assignment • Execute 50 jobs that run on a machine not at Notre Dame that has >1GB RAM.

More Related