Enhancing Scientific Computing with NetSolve: A Robust Infrastructure for Resource Sharing
NetSolve, developed by Henri Casanova and Jack Dongarra at the University of Tennessee and Oak Ridge National Laboratory, provides a comprehensive framework for harnessing vast computational resources across networks. It aims to simplify scientific computing by reducing installation overhead, masking complexities of distributed computing, and ensuring platform independence. Key features include extensibility, load balancing, fault tolerance, and support for multiple programming languages. NetSolve facilitates efficient computation-sharing models and promotes convenience and reliability for the scientific community.
Enhancing Scientific Computing with NetSolve: A Robust Infrastructure for Resource Sharing
E N D
Presentation Transcript
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve
Objectives • Harnessing vast computational resources on the network • Hardware • Software • Convenient for scientific computing community • Reducing installation and programming overhead • Masking complexity related to distributed computing
Data Data Code Code Server Client Computation on the server Computation-Sharing Models Proxy Computing
Computation-Sharing ModelsCode Shipping Code Code Data Client Server Computation on the client
Computation-Sharing ModelsRemote Computation Data Data Code Client Server Computation on the server
Design issues • Platform independence to accommodate heterogeneity • User friendly • Extensibility • Load balancing • Fault tolerance
NetSolve Architecture “OS” Resources
NetSolve Client Interface C, Fortran, Java, Matlab, and Mathematica >> a = rand(100); b= rand(100,1); >> x = netsolve(’ax = b’, a, b); >> a = rand(100); b= rand(100,1); >> request = netsolve_nb (’send’, ’ax = b’, a, b); >> x = netsolve_nb(’probe’, request); Not ready >> x= netsolve_nb(’wait’, request);
NetSolve Wrappers • Problem description file for extensibility @PROBLEM ipars @INCLUDE ”ipars.h” @LIB /home/user/lib/libipars.a @DECRIPTION Parallel Sub-Surface Flow Simulator @INPUT 2 @OBJECT STRING CHAR model @OBJECT FILE CHAR infile • Compiled into wrappers around scientific libraries • XDR for platform-independent data transfer
NetSolve Load Balancing • Assigning a task to the “best” machine • Establishing a performance model Network delay, server properties, task properties • Measuring and monitoring dynamic system states • Load balancing at a finer granularity • Parallelism through non-blocking interface • Task migration
NetSolve Fault Tolerance • Inter-server fault tolerance Fault tolerance among NetSolve servers • Intra-server fault tolerance Fault tolerance within a NetSolve server
NetSolve Fault Tolerance Inter-server Fault Tolerance Performed by NetSolve agents • Basic approach • Failure detection + task reallocation • Overload detection + task migration • Introducing NetSolve storage servers • Store checkpoints or any information related to fault tolerance (must be platform-independent) • No reliance on failed or overloaded server for task migration
NetSolve Fault ToleranceIntra-server Fault Tolerance • Not a new problem • Could be invisible to NetSolve • Can take advantage of platform-specific features for fault tolerance • Possible integration with inter-server fault tolerance
Diskless Checkpointing Checksums and Reverse Computation • Diskless checkpointing eliminates the need for stable storage • N servers + a checkpointing server • At any point, consistent checkpoints taken at N servers (stored in memory) • A checksum of checkpoints stored at the checkpointing server • Rollback using reverse computation • State recovery using the checksum
Applications • MCell with NetSolve Large code, small data • Matlab with NetSolve Tradeoffs between parallelism and overhead • IPARS with NetSolve • ImageVision with NetSolve
Conclusion • An interesting infrastructure for sharing computational resources Both software and hardware • Convenience, performance, and reliability • Playground for fault tolerance Both general and specific