1 / 14

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems. Chris Dabrowski Geoff Fox cdabrowski@nist.gov gcf@indiana.edu. OGF21 Seattle, Washington, USA October 17, 2007. Proposed Meeting Agenda.

Jeffrey
Télécharger la présentation

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems Chris Dabrowski Geoff Fox cdabrowski@nist.govgcf@indiana.edu OGF21 Seattle, Washington, USA October 17, 2007

  2. Proposed Meeting Agenda I. Introduction II. Presentation/Review of Draft OGF Informational Document “Reliability in Grid Computing Systems” • A work in progress III. Discussion IV. Close

  3. Grid Reliability and Robustness RG Purpose:Make recommendations and explore methods for improving reliability and robustness of standards-based grid systems. Main Product:Produce OGF Informational Document that Summarizes the state of work on Grid system reliability and identifies reliability and robustness issues/requirements for grid systems First draft in progress Contributions, review needed! Additional Products: Facilitate collaborations between researchers on grid reliability Preliminary requirements for reliability measurement methods and tools Web pages and reflector Official: https://forge.gridforum.org/sf/projects/gridrel-rg Unofficial: http://gridreliability.nist.gov/ List of resources (in progress) Reflector: gridrel-rg@ogf.org

  4. OGF Informational Document Title: Reliability in Grid Computing Systems: Purpose: Summarizes the state of work on Grid system reliability based on input from grid system practitioners/researchers Identifies issues that must be addressed/solved to ensure reliability and robustness in grid systems Provides basis for identifying requirements for establishing and maintaining high levels of reliability in large-scale Grids Basis for preliminary requirements for methods and tools to measure grid system reliability  Focus on current practices and research that provide insight on how WS and grid specifications may affect grid reliability Serve as resource on reliability issues for OGF working groups developing specifications and for grid developers.

  5. Document basis: previous workshops on grid reliability First workshop (GGF16, Athens, Greece) Site Assessment and Probabilistic Risk Analysis (PRA) of Grid Computing Facilities, by Joe Higgins and Robert Sewell of Sun Microsystems Methods for analyzing risks involved in deploying and configuring grid computing sites Reliable Messaging for Grids and Web Services, by Geoffrey Fox, Shrideep Pallickara, Damodar Yemme, Hasan Bulut and Sima Patel, Community Grids Lab, Indiana University NaradaBrokering: scalable, standards-based management architecture for fault-tolerant grids Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH), by Heon Y. Yeom of Distributed Computing Systems Laboratory, Seoul National University Fault-tolerant MPI (FT-MPICH) with coordinated checkpointing of interacting, parallel processes QoS-Aware Fault Tolerance in Grid Computing, by L. Valcarenghi, F. Cugini, F. Paolucci, and P. Castoldi, Scuola Superiore of Sant’Anna and CNIT, Pisa, Italy Fault-tolerance thru integrating replicated services and QoS capable network protocol layer A Program of Work for Understanding Emergent Behavior in Global Grid Systems, by Kevin Mills and Chris Dabrowski, of the U.S. NIST Developing methods for understanding and controlling complex systems behavior in grids

  6. Document basis: previous workshops on grid reliability Second workshop (OGF19, Chapel Hill, USA) Using a Large-Scale Survivability Architecture to Control Grids: A Status Report, by Zach Hill, Jonathan Rowanhill, Jim Basney, Glenn Wasson, John Knight, Anh Nguyen-Tuong, Andrew Grimshaw and Marty Humphrey, University of Virginia and NCSA/University of Illinois, Urbana-Champaign Reconfigurable Grid system architecture (Willow) for promoting survivability & dependability Platform Symphony Reliability, by Nick Werstiuk, Platform Computing Grid architecture for promoting reliability & dependability through failure detection and failover Managing Grid and Web Services and their exchanged messages, by Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, and Marlon Pierce, Indiana University Results showing performance, scalability and cost-effectiveness of NaradaBrokering architecture Reliability Assessment of Grid Software Systems Using Emergent Features, by Carol Song, Umut Topkara, Jungha Woo, and Sang Phill Park, Purdue University Method for identifying centralized software components likely to impact grid system reliability Reflections on Reliability Issues in OGSA, by Matti Hiltunen, AT&T Labs Summary of requirements for ensuring reliability and availability of OGSA-based services

  7. Document Outline: Reliability in Grid Computing Systems • Introduction • Definitions • Current Practices on Grid System Reliability • Reliability of Grid Applications • Reliability Issues and Preliminary Requirements • Reliability Metrics and Preliminary Measurement Requirements • Summary • Resources

  8. 2. Definitions: • Source • Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. “Basic Concepts and Taxonomy of Dependable and Secure Computing,” • Key definitions: • Reliability, availability, dependability, and fault tolerance • Grid resources • Decomposition of Grid Reliability concerns • Hardware and Software computing resources accessible via grid • Core infrastructure and resource management services • Allocate and manage grid resources • Example: discovery, negotiation, execution management, notification, security, etc. • Underlying connection and data transport facilities: grid network • Overall system perspective

  9. 3. Current practices/research on grid system reliability • Some main points: grid reliability methods • Still leverage redundancy • In deployed systems are based on methods used in cluster computing • Must face scalability & administrative boundary issues • Areas covered • Fault tolerance of grid resources • Fault detection • Recovery methods for grid resources Checkpoint and recovery through process migration, grid resource replication, replication in data grids • Fault removal through testing and code certification • Reliability of supporting infrastructure and management services • Grid connection and transport reliability • Specifications, fault tolerant grid networks, reliable multicasting • Reliability from overall system perspective • Architectural perspective, complex systems perspective

  10. 4. Reliability of grid applications • Some main points: • Grid applications may/should ensure their reliability themselves (perspective of GCPR WG?) • Merging of grid user/client FT methods and provider FT methods? • What’s being done for FT in grid workflows? • Areas covered • Fault tolerance of remote application processes • Fault tolerance of grid resource compositions and workflows • Workflows composed with languages/tools for grid environments • Workflows composed with languages/tools for generic web service environments • Merging application and provider fault tolerance strategies

  11. 4. Reliability issues and preliminary requirements • Fault removal • Cost-benefits of testing grid components to determine which functions and kind of tests needed (component, integration, or interaction tests) • Fault Tolerance • Fault detection: need for scalability of methods, fault taxonomies • Recovery: tradeoffs between methods, understanding which methods to use and when, and coordinated checkpoint methods. • Special requirements for infrastructure and resource management services • Criticality of services leads to different tradeoff dynamics • Fault tolerance for grid networking and data transport • FT/control in overlays, combining overlays, dedicated networks, enhance specs for reliability(?), reliable multicasting? • Fault tolerance of grid applications • User vs provider FT, FT considerations for workflow languages?

  12. 5. Metrics and preliminary measurement requirements • Preliminary work on grid reliability metrics • OGF Network measurement working group (2004), analysis of reliability of a grid by Xie and colleagues (2004). • Preliminary requirements for metrics, three classes: • OGF NM WG • Metrics to measure availability and reliability of individual grid resources (needed by grid users for evaluation purposes) • Metrics to measure reliability of entire grid or significant subsections (as above)

  13. 6. Summary • TBD 7. Resources • Over 180 cited • Organized topically in an appendix • Additional sources to be worked in

  14. Presentation Summary • Document work in progress • Please review and comment! • Please contribute!

More Related