1 / 14

Distributed Data for Science Workflows

Distributed Data for Science Workflows. Data Architecture Progress Report December 2008. Challenges and Opportunities. TeraGrid is larger than ever before, meaning data is more widely distributed and needs to be more mobile

adanne
Télécharger la présentation

Distributed Data for Science Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Data for Science Workflows Data Architecture Progress Report December 2008

  2. Challenges and Opportunities • TeraGrid is larger than ever before, meaning data is more widely distributed and needs to be more mobile • As previously reported, balance of FLOPS to available storage has drastically changed • TeraGrid user portal, and science gateways, have matured, and interfaces to TG resources have diversified • Need greater emphasis on unified interfaces to data, and integration of data into common workflows

  3. Constraints on the Architecture • We cannot address the issue of available storage • Limited opportunity to improve data transfer performance at the high end • Cannot introduce drastic changes to TG infrastructure at this stage of the project • Remain dependent on the availability of technology and resources for wide-area file systems

  4. Goals for the Data Architecture • Improve the experience of working with data in the TeraGrid for the majority of users • Reliability, Ease of use, Performance • Integrate data management into the user workflow • Balance performance goals against usability • Avoid overdependence on data location • Support the most common use cases as transparently as possible • Move data in, run job, move data out as basic pattern • Organize, search, and retrieve data from large “collections”

  5. Areas of Interest • Simplifying command-line data movement • Extending the reach of WAN file systems • Develop unified data replication and management infrastructure • Extend and unify user portal interfaces to data • Integrate data into scheduling and workflows • Provide common access mechanisms to diverse, distributed data resources

  6. Command-line tools • Many users are still oriented towards shell access • GridFTP is too difficult to use • SSH is widely known but has limited usefulness in current configuration • We need a new approach and/or tool to provide common, easy-to-use data movement, without compromising on performance

  7. Extending Wide-Area File Systems • A “Wide-Area” file system is available on multiple resources • A “Global” file system is available on all TeraGrid resources • Indiana and SDSC each have a WAN-FS in production now • Need to honestly assess the potential for Global file systems, while making WAN file systems available on more resources

  8. Unified Data Management • Management of both data and metadata, which may be stored at one or more locations in TeraGrid • Multiple sites support data collections using SRB, iRODS, databases, web services, etc. • This diversity is good, but also confusing to new users • Need a single service, which may utilize multiple technologies, to provide a common entry point for users

  9. Interfaces to Data • SSH and “ls” are not effective interfaces to large, complex datasets • Portal and Gateway interfaces to data have proven useful and popular, but: • They may not be able to access all resources, may require significant gateway developer effort • Extend user portal to support WAN file systems and distributed data management • Possible to expose user portal internals to ease development of gateways?

  10. Integrating Data into Workflows • Almost all tasks run on TeraGrid require some data management and multiple storage resources • Moving data into an HPC system • Moving results to an analysis or viz system • Moving results to an archive • Need to make these tasks less human-intensive • Users should be able to include these steps as part of their job submission • Tools such as DMOVER, PetaShare already exist but are not widely available in TeraGrid

  11. Some Implementation Plans • Extend current iRODS-based data management infrastructure to additional sites • Test use of REDDNET for distributed data storage and access in TeraGrid • Provide a TGUP interface to Lustre-WAN • Provide a TGUP interface to distributed data and metadata management • Extend current production IU Lustre-WAN and GPFS-WAN to as many compatible resources as possible

  12. More Implementation Plans • Port DMOVER to additional schedulers, deploy across TeraGrid • Develop and execute plan for PSC-based Lustre-WAN and GPFS/pNFS testing and eventual production deployment (already underway) • Work with Gateways group to provide appropriate interfaces to data movement through UP or other mechanisms • Simple changes to SSH/SCP configuration: • Support SCP-based access to data mover nodes • Support simpler addressing of data resources

  13. The Cutting, not the Bleeding Edge • Primary goal is to improve the availability of robust, production technologies for data • Balancing performance, usability and reliability will always be a challenge • Need to be agile in assessing new technologies or improvements on old technologies • Data Working Group should focus on improvements to configuration for a few production components • Make consistent, well-planned efforts to evaluate new components

  14. To-Do List for December • Understanding level of required vs. available effort • Work with other areas/WGs to place Data Architecture in context (CTSS, Gateways, etc) • Setting of priorities and ordering of tasks • Development of timelines and milestones for execution • Presentation of integrated Data Architecture Description and Plan in early January.

More Related