Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang

Outline • Two applications • Zeolite structure search • Binding energy calculations • Solutions • Workflow • Submission system • Exported file system • Resources aggregated • Example week of work

What are Zeolites? • Crystalline micro-porous material • Structures exhibit regular arrays of channels from 0.3 to 1.5nm • When channels are filled with water (or other substance), they make excellent molecular sieves for industrial processes and commercial products, e.g. deodorant in cat litters. • The acid form also has useful catalytic properties, e.g. ZSM-5 used as a co-catalyst in crude oil refinement. • Basic building block is a TO4 tetrahedron • T = Si, Al, P, etc. • Prior to this study, only 180 structures were known

Scientific goals • Goal 1: Discover as many thermodynamically feasible Zeolite structures as possible. • Goal 2: Populate a public database for material scientists to synthesize and experiment with these new structures

Computational methodology • General strategy: Create a potential cell structure and solve its energy function • Approach: • Group potential cell structures with a similar template structure into space groups (230 groups in total) • Each cell structure in the space group is further characterized by the space variables (a,b,c,a,b,g) • Solve the multi-variable energy function for each cell structure using simulated annealing

Ligand binding energy calculations • binding energy is quantitative measure of ligand affinity to receptor • important in docking of ligand to protein • ligand energies can be used as basis for scoring ligand-receptor interactions (used in structure-based drug design)

Scientific goals • Calculate binding energies between trypsin and benzamidine at different values of the force-field parameters • Compare calculated binding energy with experimental values • Validate force-field parameters based on comparison • Apply to different ligand-receptor system

Computational methodology Binding energy is calculated • molecular dynamics (MD) simulations of ligand “disappearing” in water • MD simulations of ligand extinction in the solvated ligand-protein complex • MD calculations were performed with Amber • extinction is parameterized by coupling parameter,  • each job is characterized by a different  and force-field parameters E(aq) E-S(aq) S(aq) 0(aq) E(aq) + S(aq) E-S(aq)

Computational Usage: zeolite search • Ran on TeraGrid • Allocated over two million service units • one million to Purdue Condor pool • one million to all other HPC resources on TeraGrid

Computational usage: ligand binding energy calculation • Running on departmental cluster, TACC Condor cluster and lonestar • Each 2.5 ns simulation takes more than two weeks • Will require additional CPU time

Challenge 1: Hundreds of thousands of simulations need to be run • The energy function for every potential cell structure needs to be solved. • Structures with feasible solutions indicate a feasible structure. • Many sites have a limit to the number of jobs that can be submitted to a local queue.

Challenge 2: Each simulation task is intrinsically serial • Simulated annealing method is intrinsically serial. • Each MD simulation (function of  and force-field parameter) is serial and independent. • Many TeraGrid sites prioritize parallel jobs • There are limited slots for serial jobs.

Challenge 3: Wide variability in execution times • Zeolite search • Pseudo-random solution method iterates over 100 seeds. • potential run times of 10 minutes to 10 hours • Some computation may never complete. • It is inefficient to request a CPU for 10 hours since computation may never need it. • Computation is re-factored into tasks of up to 2 hours.

Challenge 3: Wide variability in execution times • Ligand binding energy calculation • Each MD simulation calculates dynamics to 2.5 ns. • Each 2.5 ns of simulation time takes > two weeks. • Convergence is not assured after 2.5 ns.

Workflow: zeolite search • Level 1 is an ensemble of workflows evaluating a space group • 230 space groups evaluated • Level 2 evaluates a candidate structure • 6000 to 30000 structures per space group • Main task generates solution • Post-processing task checks sanity of result • Retries up to 5 times if results are wrong • Level 3 solves energy function for candidate structure • Chain of 5 sub-tasks • Each sub-task computes over 20 seeds, consuming at most 2 hours of compute time

Workflow: ligand binding energy calculations • Condor cluster has no maximum run time limit. • Lonestar has 24-hr run time limit. • MD jobs need to be restarted. • Workflow jobs need to be submitted to lonestar.

Challenge 4: Application is dynamically linked • Amber was built with Intel shared libraries. • These libraries are not be installed on the backend. • Can copy shared libraries to backend, but wasteful of space ($HOME on some systems is limited)

Challenge 5: Output file needs to be monitored • Some MD simulations do not converge. • It is possible to find out non convergence at 2 ns. • Terminate jobs that do not converge by 2 ns. • No global file system exists on some systems.

Submission system • Want to run many simple jobs/workflows of serial tasks • Condor DAGMan is an excellent tool for this • requires a Condor pool • How to form a Condor pool from HPC systems? • form a virtual cluster managed by Condor using MyCluster • submit jobs/workflows to this

MyCluster overview • Creates a personal virtual cluster for a user • from one or from pieces of different systems • Schedules user jobs onto this cluster • User can pick one of several workload managers • Condor, SGE, OpenPBS • Condor currently on TeraGrid • User submits all their jobs to this workload manager • Deployed on TeraGrid • http://www.teragrid.org/userinfo/jobs/mycluster.php

Starting MyCluster • Log in to a system with MyCluster installed • majority of TeraGrid systems • can be installed on other systems • Execute vo-login to start a session • you’re now in a MyCluster shell 1. Create MyCluster MyCluster Shell Workstation

Configuring MyCluster • Personal cluster is defined using a user-specified configuration file • Identifies which clusters can be part of personal cluster • Specifies limits on portion of those clusters to use • Personal workload manager is started • Condor in this case 2. MyCluster is configured LSF 1. Create MyCluster MyCluster Shell Cluster Condor PBS condor_scheddcondor_collectorcondor_negotiator Workstation Cluster

Submitting Work to MyCluster • Jobs submitted to personal workload manager • for workflows, DAGMan jobs are submitted that in turn submit individual Condor jobs • DAGMan configured to submit at most 380 jobs at a time • Personal workload manager manages jobs like for any other cluster 3. User submits DAGMan jobs 2. MyCluster is configured LSF 1. Create MyCluster MyCluster Shell Cluster Condor PBS Workstation Cluster

MyCluster Resource Management • MyCluster submits parallel jobs to clusters • These jobs start personal workload manager daemons • condor_startd in this case • These daemons contact the personal workload manager saying they have resources available • MyCluster grows and shrinks the size of its virtual cluster • Based on the amount of jobs it’s managing • File system on workstation may be mounted on backend condor_startd 3. User submits DAGMan jobs 2. MyCluster is configured LSF 1. Create MyCluster MyCluster Shell Cluster 4. MyCluster submits and manages WM daemons Condor PBS Workstation XUFS Cluster 5. MyCluster uses XUFS to mount WS file system on remote resources

Example MyCluster login session % vo-login Enter GRID passphrase:  GRAM or SSH login Spawning on lonestar.tacc.utexas.edu Spawning on tg-login2.ncsa.teragrid.org Setting up VO participants ......Done Welcome to your MyCluster/Condor environment To shutdown environment, type "gexit" To detach from environment, type "detach" mycluster(gtcsh.9676)% condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime 32020@compute LINUX INTEL Unclaimed Idle 0.000 2026[?????] … 32021@tg-c383 LINUX IA64 Unclaimed Idle 0.000 2026[?????] Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 2 0 0 2 0 0 IA64/LINUX 2 0 0 2 0 0 Total 4 0 0 4 0 0

Systems aggregated with MyCluster

Expanding and shrinking Condor cluster created with MyCluster (1 week period)

Running and pending jobs in a personal cluster using MyCluster (1 week period)

Project Conclusion • Allocation completely consumed in Jan 2007. • Over 3 million new structures have been found. • http://www.hypotheticalzeolites.net/DATABASE/DEEM/index.php • Ligand binding energy calculations are deployed on rodeo and lonestar • will be deployed on other TG systems • still ongoing...

References • J. R. Boisseau, M. Dahan, E. Roberts, and E. Walker, “TeraGrid User Portal Ensemble Manager: Automatically Provisioning Parameter Sweeps in a Web Browser” • E. Walker, D. J. Earl, and M. W. Deem, “How to Run a Million Jobs in Six Months on the NSF TeraGrid” • http://www.usenix.org/events/worlds06/tech/prelim_papers/walker/walker.pdf • http://www.tacc.utexas.edu/services/userguides/mycluster/ • Please contact ewalker@tacc.utexas.edu

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

Presentation Transcript

Security Challenges in Distributed Computing

Distributed Computing Environments Team

Large-Scale Distributed Computing in the Netherlands

Meeting the Challenges of Managing Large-Scale Scientific Workflows in Distributed Environments

Scheduling Parameter Sweep Workflow in the Grid

Scheduling Parameter Sweep Applications

Large Scale Distributed Computing Systems

Large-Scale Distributed Computing in the Netherlands

Securing Information Transfer in Distributed Computing Environments

Expectations and Reality in Large-Scale, Widely Distributed Systems

Biometric Authentication in Distributed Computing Environments

Using ICENI to run parameter sweep applications across multiple Grid resources

From Grid to Global Computing: Deploying Parameter Sweep Applications

Widely Distributed

Widely Distributed

Widely Distributed

Expectations and Reality in Large-Scale, Widely Distributed Systems

Challenges of Agency Projects Across Heterogeneous Environments

Distributed Computing Environments and Required Coordination