210 likes | 339 Vues
Running jobs on SDSC Resources. Krishna Muriki Oct 31 , 2006 kmuriki@sdsc.edu SDSC User Services. Agenda !!!. Using DataStar Using IA64 cluster Using HPSS resource. DataStar Overview. P655 :: ( 8-way, 16GB) 176 nodes P655+ :: ( 8-way, 32GB) 96 nodes
 
                
                E N D
Running jobs on SDSC Resources Krishna Muriki Oct 31 , 2006 kmuriki@sdsc.edu SDSC User Services
Agenda !!! • Using DataStar • Using IA64 cluster • Using HPSS resource.
DataStar Overview • P655 :: ( 8-way, 16GB) 176 nodes • P655+ :: ( 8-way, 32GB) 96 nodes • P690 :: ( 32-way, 64GB) 2 nodes • P690 :: ( 32-way, 128GB) 4 nodes • P690 :: ( 32-way, 256GB) 2 nodes Total – 280 nodes :::: 2,432 processors.
Batch/Interactive computing • Batch Job Queues: • Job queue Manager – Load Leveler (tool from IBM) • Job queue Scheduler – Catalina (SDSC internal tool) • Job queue Monitoring – Various tools (commands) • Jobs Accounting – Job filter (SDSC internal PERL scripts)
DataStar Access • Three Login Nodes :: Access modes (platforms) (usage mode) • dslogin.sdsc.edu :: Production runs (P690, 32-way, 64GB) • dspoe.sdsc.edu :: Test/debug runs (P655, 8-way, 16GB) • dsdirect.sdsc.edu :: Special needs (P690, 32-way, 256GB) Note : Above Usage modes division is not very strict.
Test/debug runs (Usage from dspoe) [dspoe.sdsc.edu :: P655, 8-way, 16GB] • Access to two queues: • P655 nodes [shared] • P655 nodes [Not – shared] • Job queues have Job filter + Load Leveler only (very fast) • Special command line submission (along with job script).
Production runs (Usage from dslogin) [dslogin.sdsc.edu :: P690, 32-way, 64GB] • Data transfer/ Src editing/Compliation etc… • Two queues: • Onto p655/p655+ nodes [not shared] • Onto p690 nodes [shared] • Job ques have Job filter + LoadLeveler + Catalina (Slowupdates)
All Special needs (Usage from dsdirect) [dsdirect.sdsc.edu :: P690, 32-way, 256GB] • All Visualization needs • All post data analysis needs • Shared node (with 256 GB of memory) • Process accounting in place • Total (a.out) interactive usage. • No Job filter, No Load Leveler, No Catalina
Suggested usage model • Start with dspoe (test/debug queues) • Do production runs from dslogin (normal & normal32 queues) • Use express queues from dspoe to get it right now. • Use dsdirect for special needs.
Accounting • reslist –u user_name • reslist –a account_name
Now lets do it ! • Example files are located here: • /gpfs/projects/workshop/running_jobs • Copy the whole directory (tcsh) • Use Makefile to compile the source code. • Edit the parameters in the job submission scripts. • Communicate with job manager using his language.
Job Manager language • Ask him to show the queue: llq • Ask him to submit your job to queue: llsubmit • Ask him to cancel your job in the queue: llcancel • Special (more useful commands from SDSC’s inhouse tool – Catalina – plz bare with me – I’m slow  ) • ‘showq’ to look at the status of the queue. • ‘show_bf’ to look at the backfill window opportunities
Access to HPSS - 1 • What is HPSS: The centralized, long-term data storage system at SDSC is the High Performance Storage System (HPSS) • currently stores more than 3 PB of data (as of June 2006) • total system capacity of 7.2 PB of data. • Data added at an average rate of 100 TB per month (between Aug’0 5 and Feb’ 06).
Access to HPSS - 2 • First thing – setup your authentication: • run ‘get_hpss_keytab’ script. • Know HPSS language to talk to it: • hsi • htar
IA64 cluster overview • Around 265 nodes. • 2-way nodes • 4GB memory per node. • Batch job environment • Job Manager – PBS (Open source tool) • Job Scheduler – Catalina (SDSC internal tool) • Job Monitoring – Various commands & ‘Clumon’
IA64 Access • IA64 Login Nodes • tg-login1.sdsc.edu ( alias to tg-login.sdsc.edu ) • tg-login2.sdsc.edu • tg-c127.sdsc.edu,tg-c128.sdsc.edu, • tg-c129.sdsc.edu & tg-c130.sdsc.edu.
Queues & Nodes. • Total around 260 nodes • With 2 processors each. • All in single batch queue – ‘dque’ • That’s sufficient now lets do it! • Example files in • /gpfs/projects/workshop/running_jobs • PBS commands – qstat, qsub, qdel
Running Interactive Interactive use is via PBS: qsub -I -V -l walltime=00:30:00 -l nodes=4:ppn=2 • This request is for 4 nodes for interactive use (using 2 cpus/node) for a maximum wall-clock time of 30 minutes. Once the scheduler can honor the request, PBS responds with: “ready” and gives the node names. • Once nodes are assigned, user can now run any interactive command. For example, to run an MPI program, parallel-test on the 4 nodes, 8 cpus: mpirun -np 8 -machinefile $PBS_NODEFILE parallel-test
References • See all web links at • http://www.sdsc.edu/user_services • Reach us at consult@sdsc.edu