210 likes | 341 Vues
This document provides an overview of running jobs on SDSC resources, including using DataStar, the IA64 cluster, and the HPSS resource. It covers the specifics of different nodes, job queues, and access methods, addressing essential tools like Load Leveler and PBS. It also highlights the need for effective communication with job managers and offers guidelines for accessing HPC resources efficiently. Users will find information on batch and interactive computing, job scheduling, and special command line submissions tailored for various needs.
E N D
Running jobs on SDSC Resources Krishna Muriki Oct 31 , 2006 kmuriki@sdsc.edu SDSC User Services
Agenda !!! • Using DataStar • Using IA64 cluster • Using HPSS resource.
DataStar Overview • P655 :: ( 8-way, 16GB) 176 nodes • P655+ :: ( 8-way, 32GB) 96 nodes • P690 :: ( 32-way, 64GB) 2 nodes • P690 :: ( 32-way, 128GB) 4 nodes • P690 :: ( 32-way, 256GB) 2 nodes Total – 280 nodes :::: 2,432 processors.
Batch/Interactive computing • Batch Job Queues: • Job queue Manager – Load Leveler (tool from IBM) • Job queue Scheduler – Catalina (SDSC internal tool) • Job queue Monitoring – Various tools (commands) • Jobs Accounting – Job filter (SDSC internal PERL scripts)
DataStar Access • Three Login Nodes :: Access modes (platforms) (usage mode) • dslogin.sdsc.edu :: Production runs (P690, 32-way, 64GB) • dspoe.sdsc.edu :: Test/debug runs (P655, 8-way, 16GB) • dsdirect.sdsc.edu :: Special needs (P690, 32-way, 256GB) Note : Above Usage modes division is not very strict.
Test/debug runs (Usage from dspoe) [dspoe.sdsc.edu :: P655, 8-way, 16GB] • Access to two queues: • P655 nodes [shared] • P655 nodes [Not – shared] • Job queues have Job filter + Load Leveler only (very fast) • Special command line submission (along with job script).
Production runs (Usage from dslogin) [dslogin.sdsc.edu :: P690, 32-way, 64GB] • Data transfer/ Src editing/Compliation etc… • Two queues: • Onto p655/p655+ nodes [not shared] • Onto p690 nodes [shared] • Job ques have Job filter + LoadLeveler + Catalina (Slowupdates)
All Special needs (Usage from dsdirect) [dsdirect.sdsc.edu :: P690, 32-way, 256GB] • All Visualization needs • All post data analysis needs • Shared node (with 256 GB of memory) • Process accounting in place • Total (a.out) interactive usage. • No Job filter, No Load Leveler, No Catalina
Suggested usage model • Start with dspoe (test/debug queues) • Do production runs from dslogin (normal & normal32 queues) • Use express queues from dspoe to get it right now. • Use dsdirect for special needs.
Accounting • reslist –u user_name • reslist –a account_name
Now lets do it ! • Example files are located here: • /gpfs/projects/workshop/running_jobs • Copy the whole directory (tcsh) • Use Makefile to compile the source code. • Edit the parameters in the job submission scripts. • Communicate with job manager using his language.
Job Manager language • Ask him to show the queue: llq • Ask him to submit your job to queue: llsubmit • Ask him to cancel your job in the queue: llcancel • Special (more useful commands from SDSC’s inhouse tool – Catalina – plz bare with me – I’m slow ) • ‘showq’ to look at the status of the queue. • ‘show_bf’ to look at the backfill window opportunities
Access to HPSS - 1 • What is HPSS: The centralized, long-term data storage system at SDSC is the High Performance Storage System (HPSS) • currently stores more than 3 PB of data (as of June 2006) • total system capacity of 7.2 PB of data. • Data added at an average rate of 100 TB per month (between Aug’0 5 and Feb’ 06).
Access to HPSS - 2 • First thing – setup your authentication: • run ‘get_hpss_keytab’ script. • Know HPSS language to talk to it: • hsi • htar
IA64 cluster overview • Around 265 nodes. • 2-way nodes • 4GB memory per node. • Batch job environment • Job Manager – PBS (Open source tool) • Job Scheduler – Catalina (SDSC internal tool) • Job Monitoring – Various commands & ‘Clumon’
IA64 Access • IA64 Login Nodes • tg-login1.sdsc.edu ( alias to tg-login.sdsc.edu ) • tg-login2.sdsc.edu • tg-c127.sdsc.edu,tg-c128.sdsc.edu, • tg-c129.sdsc.edu & tg-c130.sdsc.edu.
Queues & Nodes. • Total around 260 nodes • With 2 processors each. • All in single batch queue – ‘dque’ • That’s sufficient now lets do it! • Example files in • /gpfs/projects/workshop/running_jobs • PBS commands – qstat, qsub, qdel
Running Interactive Interactive use is via PBS: qsub -I -V -l walltime=00:30:00 -l nodes=4:ppn=2 • This request is for 4 nodes for interactive use (using 2 cpus/node) for a maximum wall-clock time of 30 minutes. Once the scheduler can honor the request, PBS responds with: “ready” and gives the node names. • Once nodes are assigned, user can now run any interactive command. For example, to run an MPI program, parallel-test on the 4 nodes, 8 cpus: mpirun -np 8 -machinefile $PBS_NODEFILE parallel-test
References • See all web links at • http://www.sdsc.edu/user_services • Reach us at consult@sdsc.edu