Richard Casey, PhD RMRCE CSU Center for Bioinformatics

ISTeC Cray High-Performance Computing System Richard Casey, PhD RMRCE CSU Center for Bioinformatics

System Architecture Front Compute blades (batch compute nodes) SeaStar 2+ Interconnect Compute blades (interactive compute nodes) Login node; Boot node; Lustre file system node Back

XT6mCompute Node Architecture 6MB L3 Cache 6MB L3 Cache HT3 DDR3 Channel DDR3 Channel • Each compute node contains 2 processors (2 sockets) • 64-bit AMD Opteron “Magny-Cours” 1.9Ghz processors • 1 NUMA processor = 6 cores • 4 NUMA processors per compute node • 24 cores per compute node • 4 NUMA processors per compute blade • 32 GB RAM (shared) / compute node = 1.664 TB total RAM (ECC DDR3 SDRAM) • 1.33 GB RAM / core DDR3 Channel HT3 HT3 DDR3 Channel HT3 HT3 6MB L3 Cache 6MB L3 Cache DDR3 Channel DDR3 Channel HT3 DDR3 Channel DDR3 Channel HT To Interconnect Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound

Compute Node Status • Check whether interactive and batch compute nodes are up or down: • xtprocadmin NID (HEX) NODENAME TYPE STATUS MODE 12 0xc c0-0c0s3n0 compute up interactive 13 0xd c0-0c0s3n1 compute up interactive 14 0xe c0-0c0s3n2 compute up interactive 15 0xf c0-0c0s3n3 compute up interactive 16 0x10 c0-0c0s4n0 compute up interactive 17 0x11 c0-0c0s4n1 compute up interactive 18 0x12 c0-0c0s4n2 compute up interactive 42 0x2a c0-0c1s2n2 compute up batch 43 0x2b c0-0c1s2n3 compute up batch 44 0x2c c0-0c1s3n0 compute up batch 45 0x2d c0-0c1s3n1 compute up batch 61 0x3d c0-0c1s7n1 compute up batch 62 0x3e c0-0c1s7n2 compute up batch 63 0x3f c0-0c1s7n3 compute up batch • Currently • 1,248 batch compute cores (fluctuates somewhat) • 192 interactive compute cores (fluctuates somewhat) Naming convention: CabinetX-Y Cage-X Slot-X Node-X i.e. Cabinet0-0,Cage0,Slot3,Node0

Compute Node Status • Check the state of interactive and batch compute nodes and whether they are already allocated to other user’s jobs: • xtnodestat Current Allocation Status at Tue Apr 19 08:15:02 2011 C0-0 n3 -------B n2 -------B n1 -------- c1n0 -------- n3 SSSaa;-- n2 aa;-- n1 aa;-- c0n0 SSSaa;-- s01234567 Legend: nonexistent node S service node (login, boot, lustrefs) ; free interactive compute node - free batch compute node A allocated, but idle compute node ? suspect compute node X down compute node Y down or admindown service node Z admindown compute node Available compute nodes: 4 interactive, 38 batch Cabinet ID Batch Compute Nodes Allocated Batch Compute Nodes Free Batch Compute Nodes Service Nodes Interactive Compute Nodes Allocated Interactive Compute Nodes Cage X: Node X Free Interactive Compute Nodes Slots (=blades)

Batch Queues • Current batch queue configuration • Under re-evaluation - may change in future to fair-share queues Queue_name PriorityMax_runtime (wallclock) Max_num_jobs_per_user small high 1 hr. 20 medium medium 24 hrs. 2 large low 168 hrs. (1 week) 1 ccm_queue --- --- --- priority_queue --- --- --- batch --- --- --- woodward --- --- --- woodward_ccm --- --- --- EFS --- --- ---

Batch Jobs • PBS/Torque/Moab Batch Queue Management System • For submission and management of jobs in batch queues • Use for jobs with large resource requirements (long-running, # of cores, memory, etc.) • List all available queues: • qstat –Q (brief) • qstat –Qf(full) rcasey@cray2:~> qstat -Q Queue Max Tot EnaStrQue Run HldWatTrn Ext T ---------------- --- --- --- --- --- --- --- --- --- --- - batch 0 0 yes yes 0 0 0 0 0 0 E • Show the status of jobs in all queues: • qstat(all queued jobs) • qstat – u username (only queued jobs for “username”) • (Note: if there are no jobs running in any of the batch queues, this command will show nothing and just return the Linux prompt). rcasey@cray2:~/lustrefs/mpi_c> qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1753.sdb mpic.jobrcasey 0 R batch

Batch Jobs • Common Job States • Q: job is queued • R: job is running • E: job is exiting after having run • C: job is completed after having run • Submit a job to the default batch queue: • qsub filename • “filename” is the name of a file that contains batch queue commands • Command line directives override batch script directives • i.e. “qsub –N newname script”; “newname” overrides “-N name” in batch script • Delete a job from the batch queues: • qdeljobid • “jobid” is the job ID number as displayed by the “qstat” command. You must be the owner of the job in order to delete it.

Sample Batch Job Script #!/bin/bash #PBS –N jobname #PBS –j oe #PBS –l mppwidth=24 #PBS –l mppdepth=1 #PBS –l walltime=1:00:00 #PBS –q small cd $PBS_O_WORKDIR date export OMP_NUM_THREADS=1 aprun –n24 –d1 executable • Batch queue directives: • -N name of the job • -j oe combine standard output and standard error in single file • -l mppwidth specifies number of cores to allocate to job (MPI tasks) • -l mppdepthspecifies number of threads per core (OpenMP) • -l walltime specifies maximum amount of wall clock time for job to run (hh:mm:ss) • -q specify which queue to submit the job to (if none specified, job is sent to small queue)

Sample Batch Job Script • PBS_O_WORKDIR environment variable is generated by Torque/PBS. Contains absolute path to directory from which you submitted your job. Required for Torque/PBS to find your executable files. • Linux commands and environment variables can be included in batch job script • The value set in aprun “-n” parameter should match value set in PBS “mppwidth” directive • i.e. #PBS –l mppwidth=24 • i.e. aprun –n 24 exe • Request proper resources: • If “-n” or “mppwidth” > 1,248, job will be held in queued state for awhile and then deleted • If “mppwidth” < “-n”, then error message “apsched: claim exceeds reservation's node-count” • If “mppwidth” > “-n”, then OK

Sample Batch Job Script • For MPI code • ALPS places MPI tasks sequentially on cores within compute node • If mppwidth = n > 24, ALPS places MPI tasks on multiple compute nodes #PBS -N mpicode #PBS –j oe #PBS -l mppwidth=12 #PBS -l walltime=00:10:00 #PBS –q small # mppwidth = -n = number of cores cd $PBS_O_WORKDIR cc -o mpicodempicode.c aprun –n12 ./mpicode

Sample Batch Job Script • For OpenMP code • ALPS places OpenMP threads sequentially on cores within compute node • mppdepth = OMP_NUM_THREADS = -d <= 24 • If –d exceeds 24 get error message - “apsched: -d value cannot exceed largest node size” #PBS -N openmpcode #PBS –j oe #PBS -l mppdepth=6 #PBS -l walltime=00:10:00 #PBS -q small # mppdepth = OMP_NUM_THREADS = -d <= 24 number of cores cd $PBS_O_WORKDIR export OMP_NUM_THREADS=6 cc -o openmpcodeopenmpcode.c aprun –d6 ./openmpcode

Sample Batch Job Script • For hybrid MPI / OpenMP code • ALPS places MPI tasks sequentially on cores within compute node & launches OpenMP threads per MPI task • By default, ALPS places one OpenMP thread per MPI task; use mppdepth = OMP_NUM_THREADS = -d to change number of threads per task #PBS -N hybrid #PBS –j oe #PBS -l mppwidth=6 #PBS -l mppdepth=2 #PBS -l walltime=00:10:00 #PBS –q small # mppwidth = -n = number of cores # mppdepth = OMP_NUM_THREADS = -d <= 24 number of OpenMP threads per core cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2 cc -o hybridcodehybridcode.c aprun –n6 –d2 ./hybridcode

Richard Casey, PhD RMRCE CSU Center for Bioinformatics