1 / 26

Job Submission on WestGrid Feb 15 2005 on Access Grid

Job Submission on WestGrid Feb 15 2005 on Access Grid. Introduction. Simon Sharpe, one member of the WestGrid support team The best way to contact us is to email support@westgrid.ca This seminar tells you; How to run, monitor, or cancel your jobs How to select the best site for your job

arne
Télécharger la présentation

Job Submission on WestGrid Feb 15 2005 on Access Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job Submission on WestGridFeb 15 2005on Access Grid

  2. Introduction • Simon Sharpe, one member of the WestGrid support team • The best way to contact us is to email support@westgrid.ca • This seminar tells you; • How to run, monitor, or cancel your jobs • How to select the best site for your job • How to adapt your job submission for different sites • How to get your jobs running as quickly as possible • Feel free to interrupt if you have questions

  3. Getting into the Queue • HPC Resources are valuable research tools • A batch queuing system is needed to • Match jobs to resources • Deliver maximum bang for the research buck • Distribute jobs and collect output across parallel CPUs • Ensure a fair sharing of resources

  4. Getting into the Queue • WestGrid compute sites use TORQUE/Moab • Based on PBS (Portable Batch System) • You need just a few commands common to WestGrid machines • There are important differences in job submission among sites you need to know about • With the diversity of WestGrid, it is possible that there is more than one machine suitable for your job

  5. A Simple Sample • This example show how to run a serial job on Glacier, which is a good choice for serial jobs • The qsub command tells TORQUE to run the job described in the script file serialhello.pbs • The script file serialhello.pbs tells TORQUE how to run the C program serialhello • When your job completes, TORQUE creates two new files in the current directory capturing; • error out from the job • standard out

  6. End of Seminar • Thanks for coming • I wish it was that easy

  7. HPC: One Size Does Not Fit All • When the only tool you have is a hammer, every job looks like a nail • Things that affect system selection; • System dictated by executable or licensing • MPI or OpenMP • Availability: How busy is the system? • Amount of RAM required • Speed or number of processors

  8. HPC: One Size Does Not Fit All • Things that affect system selection (continued); • Scalability of your application • Inter-processor communication requirements • Queue limits (walltime, number of CPUs) • Inertia: It is where we’ve always run it http://www.westgrid.ca/support/System_Status http://www.westgrid.ca/support/Facilities http://www.westgrid.ca/support/software

  9. Uses of WestGrid Machines

  10. TORQUE and Moab Commands

  11. Sample MPI job on Glacier

  12. MPI Submission on Glacier • We need to tell TORQUE how many processors we need • This asks for 2 nodes and 2 processors per node (4 CPUs) • Similar script to last time, but now calling program parallelized with MPI • Adding the walltime estimate helps TORQUE schedule the job • Note that we can pass directives; • on the command line or • in the script • This time we wait in the queue

  13. Sample MPI job on Matrix

  14. Running MPI Jobs on Matrix For Matrix, use nodes and processors/node (ppn) to tell TORQUE how many CPUs your job needs Matrix machines have 2 CPUs/Node A minimal TORQUE script to run a parallel MPI job on Matrix Standard and Error output dropped into the directory we submitted from

  15. Sample MPI job on Lattice

  16. Running MPI Jobs on Lattice For Lattice, use nodes and processors/node to set number of processors. Lattice has 4 processors on each node. In this case we ask for 2 CPUs on one box and 2 on another A minimal TORQUE script to run a parallel MPI job on Lattice Standard and error out dropped into the directory we submitted from

  17. Sample Serial Job on Lattice • Lattice has a high-speed Quadrics interconnect • If your job is serial, it does not take advantage of the Quadrics interconnect • Glacier may be an alternative • Having said that, many serial jobs are run on Lattice

  18. Running Serial Jobs on Lattice On Lattice, we tell TORQUE to run the job described in the script file serialhello.pbs A minimal TORQUE script to run a serial job on Lattice Standard and error out dropped into the directory we submitted from

  19. Sample Parallel job on Cortex

  20. Running Serial Jobs on Cortex On Cortex, we tell TORQUE to run the job described in the script file mpihello.pbs The script which describes how we want cortex to run the parallel program mpihello The standard output file, dropped into our working directory

  21. Sample Parallel Job on Nexus • Nexus is a collection of SGI SMP machines • Several sizes serviced by different queues. • Test on smaller machines, heavy lifting on large ones • A good home for parallel jobs with intense communication requirements and/or large memory needs • More information at; http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html

  22. Running OpenMP Jobs on Nexus You can try trivial OpenMP jobs from the command line. This job ran interactively on the head node. You should not use more than 2 processors for interactive jobs. To run jobs requiring real processing, you must submit them to TORQUE For Nexus, match ncpus with OMP_NUM_THREADS In this case we ask for 8 CPUs on the Helios machine (8-32 CPUs)

  23. Sample Serial Job on Robson • Robson is a new 56 processor Power5 system • 64-bit Linux • Good for serial work, may be suitable for some parallel processing. • Message passing through MPI • More info at; http://www.westgrid.ca/support/robson

  24. Running Serial Jobs on Robson This is a minimal serial job submission script for Robson. It runs the executable “hello” A more elaborate script example is available; http://www.westgrid.ca/support/robson Robson also runs MPI parallel jobs, as described on the above web page

  25. Shortening HPC Cycle • Try your jobs at different sites • Test your process on small jobs • Give realistic walltimes, memory requirements • Apply for a larger Resource Allocation • http://www.westgrid.ca/manage_rac.html

  26. Summary • HPC jobs have differing requirements • WestGrid provides an increasing variety of tools • Use the system that is best for your job • Start off simple and small • Find out how well your job scales • Getting help • Because of implementation differences, “man qsub” might not be your best source of help • Support pages as listed throughout this presentation • Email support@westgrid.ca

More Related