1 / 13

Grid tool integration within the eMinerals project

Grid tool integration within the eMinerals project. Mark Calleja. Background. Model the atomistic processes involved in environmental issues (radioactive waste disposal, pollution, weathering).

Télécharger la présentation

Grid tool integration within the eMinerals project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid tool integration within the eMinerals project Mark Calleja

  2. Background • Model the atomistic processes involved in environmental issues (radioactive waste disposal, pollution, weathering). • Use many different codes/methods: large scale molecular dynamics, lattice dynamics, ab initio DFT, quantum Monte Carlo. • Jobs can last minutes to weeks, require a few Mb to a few Gb of memory. • Project has 12 postdocs (4 scientists, 4 code developers, 4 grid engineers). • Spread over a number of sites: Bath, Cambridge, Daresbury, Reading, RI and UCL.

  3. Minigrid Resources • Two Condor pools: large one at UCL (~930 Windows boxes) and a small one at Cambridge (~25 heterogeneous nodes). • Three Linux clusters, each with one master + 16 nodes running under PBS queues. • An IBM pSeries platform with 24 processors under LoadLeveller. • A number of Storage Resource Broker (SRB) instances, providing ~3TB of distributed, transparent, storage. • Application server, including SRB Metadata Catalogue (MCAT), and database cluster at Daresbury. • All accessed via Globus.

  4. Condor • Department of CS, University of Wisconsin. • Allows the formation of pools from heterogeneous mix of architectures and operating systems. • Excellent for utilising idle CPU cycles. • Highly configurable, allowing very flexible pool policies (when to run Condor jobs, with what priority, for how long, etc.) • All our department desktops are in our pool. • Provides Condor-G, a client tool for submitting jobs to Globus gatekeepers. • Also provides DAGMan, a meta-scheduler for Condor which allows workflows to be built.

  5. Storage Resource Broker (SRB) • San Diego Supercomputer Center • SRB is client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets. • In conjunction with the Metadata Catalog (MCAT), provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. • Provides a number of user interfaces: command line (useful for scripting), Jargon (java toolkit), inQ (Windows GUI) and MySRB (web browser).

  6. Typical work process • Start by uploading input data into the SRB using one of the three client tools: a) S-commands (command line tools) b) InQ (Windows) c) MySRB (web browser) Data in the SRB can be annotated using the Metadata Editor, and then searched using the CCLRC DataPortal. This is especially useful for the output data. • Construct relevant Condor/DAGman submit script/workflow. • Launch onto minigrid using Condor-G client tools.

  7. Job workflow • On remote gatekeeper, run a jobmanager-fork job to create a temporary directory and extract input files from SRB. • Submit next node in workflow to relevant jobmanager, e.g. PBS or Condor, to actually perform the required computational job. • On completion of the job, run another jobmanager-fork job on the relevant gatekeeper to ingest the output data into the SRB and clean up the temporary working area. Gatekeeper SRB Condor JMgr PBS JMgr Condor pool

  8. my_condor_submit • Together with SDSC, approached Wisconsin about absorbing this SRB functionality into condor_submit. • However, Wisconsin would seem to prefer to use Stork, their grid data placement scheduler (currently stand-alone beta version). • In the meantime, we’ve provided our own wrapper to these workflows, called my_condor_submit. • This takes as its argument an ordinary Condor, or Condor-G, submit script, but also recognises some SRB-specific extensions. • Limitations: the SRB extensions currently can’t make use of Condor macros, e.g. job.$$(OpSys). • Currently also developing a job submission portal (see related talk at AHM).

  9. my_condor_submit # Example submit script for a remote Condor pool Universe = globus Globusscheduler = lake.esc.cam.ac.uk/jobmanager-condor-INTEL-LINUX Executable = add.pl Notification = NEVER GlobusRSL = (condorsubmit=(transfer_files ALWAYS)(universe vanilla)(transfer_input_files A, B))(arguments=A B res) Sget = A, B # Or just “Sget = *” Sput = res Sdir = test Sforce = true Output = job.out Log = job.log Error = job.error Queue # To turn into a PBS job replace with: # # Globusscheduler = lake.esc.cam.ac.uk/jobmanager-pbs # GlobusRSL = (arguments=A B res)(job_type=single)

  10. Experience of setting up a minigrid • Providing desktop client machines not seamless: require major firewall reconfiguration and tools have not always been simple to install, though lately better. • Mixed reception by users, especially when we started: “why should I have to hack your scripts to run one simple job?”. • Jobmanager modules have needed tweaking e.g. to allow users to nominate a particular flavour of MPI. • Load balancing across the minigrid still an issue, though we have provided some simple tools to monitor the state of the queues. • Similarly, job monitoring an issue: “what’s the state of health of my week-old simulation?”. Again, some tools provided.

  11. Configuration and administration • Job submission to the clusters is only via globus… • ...except for one cluster which allows gsissh access and direct job submission to facilitate code development and porting. • To share out trouble-shooting work fairly we introduced a ticket-based helpdesk system, OTRS. Users can email problems to helpdesk@eminerals.org • eMinerals members use UK eScience certificates, but for foreign colleagues we’ve set up our own CA to enable access to our minigrid.

  12. Summary and Outlook • eMinerals minigrid now in full service. • It can meet vast majority of our requirements (except maybe for very low latency MPI jobs; e.g. Myrinet). • The integration of the SRB with the job-execution components of the minigrid have provided most obvious added value to the project. • Next major step will be when the ComputePortal goes online. • Also would like to make job submission to non-minigrid facilities transparent, e.g. the NGS clusters. • Keeping an eye on WSRF developments and GT4. • Intend to migrate to SRB v3 “soon”.

More Related