1 / 18

Integrating GPUs into Condor

Integrating GPUs into Condor. Timothy Blattner Marquette University Milwaukee, WI April 22, 2009. Outline. Background and Vision Graphics Cards Condor Approach Problems Conclusions and Future Work. Graphics cards. Powerful – NVIDIA Tesla C1060 240 massively parallel processing cores

kimo
Télécharger la présentation

Integrating GPUs into Condor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009

  2. Outline • Background and Vision • Graphics Cards • Condor Approach • Problems • Conclusions and Future Work

  3. Graphics cards • Powerful – NVIDIA Tesla C1060 • 240 massively parallel processing cores • 4 GB GDDR3 • CUDA Capable • ~993 gigaflops • ~$1,300 • Cheap – NVIDIA 9800 GT • 112 massively parallel processing cores • 512 MB GDDR3 • CUDA Capable • ~$120

  4. Vision and Focus • Pool of computers containing graphics cards, managed by Condor • Provide users the ability to utilize graphics cards identified by Condor Central Manager ? ? ?

  5. Opportunities Resources may already be there • Majority of machines have graphics cards in them • GPU resources sit idle while Condor runs on the CPU Similar work • GPUGRID.net • Distributed computing project using NVIDIA graphics card for atom molecular simulations of proteins • Uses GPU-enabled BOINC client

  6. Prototype Implementation • Linux only • Script queries operating system and graphics card • Hawkeye Cron job manager runs script • Script outputs graphics card information into ClassAd format • Binary for NVIDIA cards for more specific information

  7. Graphics Card Architecture

  8. Graphics card APIs • Favor general purpose computations • CUDA (NVIDIA) • Brook (ATI) • openCL (Khronos Group)

  9. CUDA Programming Model • Kernels are functions run on the device (GPU) • Host (CPU) code invokes kernels and determines • Number of threads • Thread block structure for organizing threads • Kernel invocations are asynchronous • Control returns to the CPU immediately • CUDA provides synchronization primitives • Some CUDA calls (e.g. memory allocation) are synchronous

  10. Hawkeye Cron Job Manager • Provides mechanism for collecting, storing, and using information about computers • Periodically executes specified program(s) • Program outputs in form of ClassAd • Outputs are added to machine's ClassAd

  11. Hawkeye Implementation • Added to local configuration file • Runs script every minute • Condor user must be granted graphics card privileges in order to query the card STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST), UPDATEGPU STARTD_CRON_UPDATEGPU_EXECUTABLE = gpu.sh STARTD_CRON_UPDATEGPU_PERIOD = 1m STARTD_CRON_UPDATEGPU_MODE = Periodic STARTD_CRON_UPDATEGPU_KILL = True

  12. Script Output HasGpu = True NGpu = 1 Gpu0 = "Quadro FX 3700" Gpu0CudaCapable = True Gpu0_Major = 1 Gpu0_Minor = 1 Gpu0Mem = 536150016 Gpu0Procs = 14 Gpu0Cores = 112 Gpu0ShareMem = 16384 Gpu0ThreadsPerBlock = 512 Gpu0ClockRate = 1.24 HasCuda = True -

  13. Job Submission • Users can submit jobs with GPU requirements into Condor • Portable across Linux Distros Universe = vanilla Executable = tests/CudaJob Initialdir = gpuJobs Requirements = (HasGpu == true) && (Gpu0CudaCapable == true) Log = gpu_test.log Error = gpu_test.stderr Output = gpu_test.stdout Queue condor_submit gpu_job.submit

  14. Access Control • /dev/nvidiactl, /dev/nvidia* devices need read/write by submitting/running user • Could be • Nobody, open access • Controlled by Unix group, containing limited users • Integrated more directly with Condor user control, slot users

  15. Problems • Preemption • Jobs running in GPU kernel cannot be interrupted reliably by Unix signals • Watchdog timer • After 5 seconds, job is killed • A Solution: use general purpose graphics card as secondary display • Memory Security • Malicious users, interrupting a job between GPU kernel calls, have the opportunity to overwrite or copy GPU memory

  16. Summary • Condor based approach for advertising GPU resources • Linux-based prototype implementation • Can access available GPUs • Works best on dedicated machines, with no need for preemption • Current Limitations • Doesn’t report GPU usage • Lack of preemption • Limited OS and video card support

  17. Future Work • Create benchmark and testing suite • Handle preemption • Investigate how watchdog works • GPU usage reporting • Integrate memory protection • Support more Operating Systems • Windows and Mac OS X • Support alternative architectures and APIs • Brook and OpenCL

  18. Questions? Contact: timothy.blattner@marquette.edu craig.struble@marquette.edu https://sourceforge.net/projects/condorgpu/

More Related