1 / 18

A SUSTAINABLE NATIONAL GATEWAY FOR BIOLOGICAL COMPUTATION

A SUSTAINABLE NATIONAL GATEWAY FOR BIOLOGICAL COMPUTATION. XSEDE 13 July 24, 2013. Galaxy Team: james.taylor@emory.edu anton@bx.psu.edu nate@bx.psu.edu PSC Team: blood@psc.edu ropelews@psc.edu josephin@psc.edu yanovich@psc.edu rbudden@psc.edu zhihui@psc.edu sergiu@psc.edu. Overview.

onofre
Télécharger la présentation

A SUSTAINABLE NATIONAL GATEWAY FOR BIOLOGICAL COMPUTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A SUSTAINABLE NATIONAL GATEWAY FOR BIOLOGICAL COMPUTATION XSEDE 13 July 24, 2013

  2. Galaxy Team: james.taylor@emory.edu anton@bx.psu.edu nate@bx.psu.edu PSC Team:blood@psc.edu ropelews@psc.edu josephin@psc.edu yanovich@psc.edu rbudden@psc.edu zhihui@psc.edu sergiu@psc.edu

  3. Overview • Galaxy: The Need, the Framework, the Popularity Bottleneck • Distributing Galaxy work and data flows to XSEDE systems: first steps • The Future: Galaxy Gateway(s)

  4. 643 HiSeqs = 6.5 Pb/year

  5. The Challenge Biology has rapidly become data intensive,and dependent on computational methodsHow can we ensure that these methods are accessible to researchers?...while also ensuring that scientific resultsremain reproducible?

  6. Galaxy A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storageOpen-source software allowing anyone to freely deploy or extend this platformA community of users and developers

  7. Galaxy integrates existing tools into a uniform framework • Mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning • Workflows can be constructed from scratch or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis

  8. Analysis environment automatically and transparently tracks details Describe analysis tool behavior abstractly Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

  9. The Popularity Bottleneck

  10. So, send jobs to XSEDE! • Entire Galaxy workflows or component tasks. • Especially, tasks that requireHPC, e.g. de-novo assembly applications Velvet (of genome) and Trinity (of transcriptome) to PSC Blacklight (up to 16 TB of coherent shared memory per process). • Should be transparent to the user of usegalaxy.org .

  11. Problems to be solved • Data Migration: Galaxy currently relies on a shared filesystem between the instance host and the execution server to store the reference and user data required by the workflow. This is implemented via NFS. • Remote Job Submission:Galaxyjob execution currently requires a direct interface with the resource manager on the execution server.

  12. Initial Galaxy Data Staging to PSC Transferred 470TB in 21days from PSU to PSC (average ~22TB/day; peak 40 TB/day) rsync used to initially stage and synchronize subsequent updates Data copy maintained in PSC in /arc file system available from compute nodes Data Generation Nodes Storage Penn State 10gigE link PSC Data SuperCell

  13. Galaxy Remote Data Architecture Access is identical from PSU and PSC to the shared dataset via /galaxys2 SLASH2 file system handles consistency and multiple residency coherency and presence Local copies are maintained for performance Jobs run on PSC compute resources such as Blacklight and at PSU Data Generation and Processing Nodes /galaxys2 Penn State SLASH2 Wide-Area Common File system GalaxyFS PSC Data Generation and Processing Nodes /galaxys2

  14. Underlying SLASH2 Architecture I/O servers are very lightweight Can use most backing file systems (ZFS, ext4fs, etc.) Metadata Server (MDS) I/O Servers (IOS) I/O Servers (IOS) Data consistency updates I/O Servers (IOS) I/O Servers (IOS) One at PSU and one at PSC for performance Converts pathnames to object IDs Schedules updates when copies become inconsistent Consistency protocol to avoid incoherent data Residency and network scheduling policies enforced All other file ops (RENAME, SYMLINK, etc.) READ and WRITE Clients Clients are compute resources & dedicated front ends Dataset residency requests issued from administrators and/or users

  15. Submitting to Blacklight • Created a new Galaxy job-running plugin for submission via a remote shell program and PSC “SIMON” Torque command line: CLI runner. • Velvet and Trinity have been incorporated into the Galaxy web platform using Galaxy’s XML interface. • Test jobs have been successfully submitted from Penn State and executed on Blacklight using the data replicated via SLASH2 from Penn State to PSC.

  16. Future Work (1) • Integrate this work with the production public Galaxy site, usegalaxy.org • Dynamic job submission, allowing the selection of appropriate remote or local resources (cores, memory, walltime, etc.) based on individual job requirements, using an Open Grid Services Architecture Basic Execution Service compatible service, such as Unicore.

  17. Future Work (2) • Galaxy-controlled data management, to intelligently create replicas as close as possible to the compute resource that will use the data. • Authentication with Galaxy instances, using XSEDE or other credentials, e.g., InCommon/CILogon. • Additional data transfer capabilities in Galaxy, such as IRODS and Globus Online.

  18. The Vision Ultimately, we envision that any Galaxy instance (in any lab, not just Galaxy Main) will be able to spawn jobs, access data, and share data on external infrastructure whether this is an XSEDE resource, a cluster of Amazon EC2 machines, a remote storage array, etc.

More Related