Lattice QCD Clusters

Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory

Introduction • The LQCD Clusters • Cluster monitoring and response • Cluster job • types • submission, scheduling and allocation • Execution • Wish List • Questions and Answers

The LQCD Clusters

pion and qcd cluster pion cluster front qcd cluster back pion cluster back

kaon cluster kaon cluster front kaon cluster back kaon head-nodes & Infiniband spine

Cluster monitoring • Worker node • nannies monitor critical components/processes such as: • health (cpu/system temperature, cpu/system fan speeds) • batch queue clients (PBS mom) * • disk space • NFS mount points • high speed interconnects Except for * nannies report via email any anomalies that may exist. For * a corrective action is defined. A corrective action needs to be well-defined with sufficient decision paths to fully automate the error diagnosis and recovery process. Users are sophisticated enough to report any performance related issues. • Head-node • nanny monitors critical processes such as: • mrtg graph plotting scripts * • automated scripts to generate cluster status pages * • batch queue server (PBS server) • NFS server * Except for * nanny will restart processes that may have exited abnormally. All unhealthy nodes are reported as blinking on the cluster status pages. Cluster administrators can then analyze the mrtg plots to isolate the problem. • Network fabric • For the high speed network interconnects: • Nannies monitor and plot health of critical components (switch blade temperature, chassis fan speeds) on the 128 port myrinet spine switch. No automated corrective action has been defined for any anomalies that may occur. • Cluster administrators can run Infiniband cluster administration tools to locate bad Infiniband cables, failing spine or leaf switch ports, failing Infiniband HCAs. The Infiniband hardware has been reliable.

Cluster job types • A large fraction of the jobs that are run on the LQCD clusters are limited by: • Memory-bandwidth • Network-bandwidth Memory bandwidth bound Network bandwidth bound

Cluster job execution • Open PBS (Torque) and the Maui scheduler schedule jobs using the "FIFO" algorithm as follows: • Jobs are queued in the order of submission • Maui will run the highest (oldest) jobs in the queue in order, except it will not start a job if any of the following are true: • the job will put the number of running jobs by a particular user over the limit • the job will put the total number of nodes used by a particular user over the limit • the job specifies resources that cannot be fulfilled (e.g. a specific set of nodes requested by the user) • If there are jobs that are not eligible for any of the above, Maui will run the next eligible job. • Under certain conditions, Maui may run the next eligible job if only limit (c) holds. This is called backfilling. Maui will look at the state of the queue and the running jobs, and based on the requested and used wall-clock times predict when the job blocked by (c) will be able to run. If job(s) lower in the queue can run without extending the start time for the job blocked by (c), Maui will run that (those) jobs. • Once a job is ready to run, a set of nodes are allocated to the job exclusively, for the requested wall-time. Almost all jobs run on the LQCD clusters are MPI jobs. Users can explicitly refer to the PBS_NODEFILE environment variable OR it is coded into the mpirun launch script.

Cluster job execution (cont’d) • Typical user jobs are 8, 16 or 32 nodes which run for a maximum wall time of 24 hours. • A user nanny job running on the head-node executes job streams. Each job stream is a PBS job which: • on the job head-node (MPI node 0) copies a lattice (problem) stored in dCache to the local scratch disk. • divides the lattice into the number of nodes and copies the sub-lattices to each node local scratch disk. • launches an MPI process on each node which computes it’s sub-lattice. • the main process (MPI process 0) gathers the results from each node onto the job head-node (MPI node 0) and copies the output into dCache. • marks checkpoints at regular intervals for error recovery. • Output from one job stream is the input lattice for the next job stream. • If a job stream fails, the nanny job restarts the stream from the most recent saved checkpoint.

Wish List • Missing link between the monitoring process and the scheduler. Scheduler could do better by being node and network aware. • Ability to monitor factors that are critical to application performance (e.g. Thermal instabilities can cause throttling of cpu speed which ultimately affects performance). • Very few automated corrective actions defined for components and processes that are currently being monitored. • Using current health data, ability to predict node failures rather than just updating mrtg plots.

Lattice QCD Clusters

Lattice QCD Clusters

Presentation Transcript

UKQCD software for lattice QCD

Lattice QCD Beyond Ground States

Lattice QCD

BARYON STRUCTURE FROM LATTICE QCD

Status of Lattice QCD

Designing Lattice QCD Clusters

Lattice QCD

Hadron Structure from Lattice QCD

LATTICE QCD AND FLAVOR PHYSICS

Lattice QCD and CLQCD Activities

Lattice QCD

Recent results from lattice QCD

Designing Lattice QCD Clusters

LATTICE QCD is FUN

Lattice results on QCD-strings

Simulation Algorithms for Lattice QCD

NN potentials from lattice QCD

Adaptive Multigrid for Lattice QCD

Lattice QCD (INTRODUCTION)

Lattice QCD and the QCD Vacuum Structure

Lattice QCD for Exotic Hadrons

Designing Lattice QCD Clusters