Cluster Operating System Support For Parallel Autonomic Computing

Cluster Operating System Support ForParallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin University Geelong, Vic 3217, Australia

A Need for More than Execution Performance • Performance is a critical assessment criterion • Security, reliability, and ease of programming are neglected • Furthermore • Parallel computers are seen as being user unfriendly • Parallel processing is not used on daily basis • Ordinary users have to be involved in programming activities that are of the operating system nature • Ordinary engineers, managers, etc do not have, and should not have, specialized knowledge needed to program operating system oriented activities COSET’2004

Aim of Our Research • IBM has launched a comprehensive program • “to re-examine an obsession with faster, smaller, and more powerful” • “to look at the evolution of computing from a more holistic perspective” • IBM’s Autonomic Computing - one of the Grand Challenges • Parallel processing on non-dedicated clusters could benefit from the Autonomic Computing vision • Aim: to show a general design of services and initial implementation of a system that moves parallel processing on clusters to the computing mainstream using the Autonomic Computing vision COSET’2004

IBM’s Autonomic Computing • The name “autonomic” has not caught on everywhere, if only because it’s IBM’s • Microsoft – “trustworthy” • Others prefer more generic – “self-managing” • Many see “autonomic computing” as one of the basic parts of a revolutionary technology that • Will start the new .com boom • Will move parallel computing on clusters to the Computing mainstream COSET’2004

IBM’s Autonomic Computing • Characteristics of autonomic computing systems • knows itself • configures and reconfigures itself under varying and unpredictable conditions • optimizes its working • performs something akin to healing • provides self-protection • knows its surrounding environment • exists in an open (non-hermetic) environment • anticipates the optimized resources needed while keeping its complexity hidden COSET’2004

Related Work • A number of projects related to Autonomous Computing are mentioned by the IBM website • While many of the reported projects engage in some aspects of Autonomic Computing none engage in research to develop a system that has all eight of the characteristics required • None of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters. COSET’2004

Design of Autonomic Elements (Services) Providing Autonomic Computing on Non-dedicated Clusters • We have proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing environment on a non-dedicated cluster • Three component levels • Services • Computers • Non-dedicated cluster • Note: we have not addressed • Hardware aspects • Administration aspects COSET’2004

Cluster Knows Itself • A need for resource discovery • This autonomic element runs on each computer • Activities • Acquires knowledge of static parameters of computers • processor type (e.g., speed) • memory size • available software • Acquires knowledge of dynamic parameters of clusters • computers’ load • available memory • communication pattern and volume COSET’2004

Resource Discovery Service Design Computer i Communication Pattern & Load Resource Discovery Computational Load & Parameters Local Communication Load Computation element1 Computation element2 CPU Main Memory RemoteCommunication Load Computer j Resource Discovery CPU Main Memory Computation element1 Computation element2 COSET’2004

Cluster Configures and Reconfigures Itself under Varying and Unpredictable Conditions • In a non-dedicated cluster there are times when • Some computers are lightly loaded or idle • Some computers cannot be used • owners removed them from a shared pool of resources • are heavy loaded • To offer high availability, i.e., to configure and reconfigure itself, the system • Forms parallel virtual clusters adaptively and dynamically • Forming is based on load and changing resources COSET’2004

Availability Service Design Availability Services Virtual Parallel Cluster (t1) Virtual Parallel Cluster (t0) RD RD RD RD RD RD RD RD Virtual Parallel Cluster (t2) Virtual Parallel Cluster (t3) Where times t0< t1< t2< t3 COSET’2004

Cluster Should Optimize Its Working • Application computation elements should be placed optimally • To improve performance there is a need for • Computation load • Available memory • Communication costs • To optimize cluster’s working there is • Static allocation and load balancing • Ability to change performance indices that reflect user objectives • Computation element migration, creation and duplication • Setting of computation priorities of applications COSET’2004

High Performance Service Design Global Scheduler Static Allocation Load Balancing { where: P1 → C1, P2 → C2, ……… {Pi, Pj} → Cn } {where, which, when: Pi : Cn → C3} AvailabilityServices C3 C1 C2 P2 P1 Virtual Parallel Cluster Migration Pi Pj Cn COSET’2004

Cluster Should Perform Something Akin To Healing • Hardware and software faults can occur • Failures lead to the termination of computations • To provide something akin to healing • Faults are identified and reported • Checkpointing of parallel computation element of applications is provided • Recovery from failures is employed • Migrating applications from faulty computers to healthy computers is carried out automatically • Redundant/replicated services are provided COSET’2004

Self-Healing Service Design Checkpointing (coordinated) C1 C2 Cj Checkpointfor Compute Elem i Checkpointfor Compute Elem i Computation Element i Ck Checkpoint for Computation Element i Compute Elem i after crash recovery Recovery Disk COSET’2004

Clusters Should Provide Self-Protection • Computation elements of parallel applications are distributed • Computation elements communicate using messages • They are the subject of passive and active attacks • To provide self-protection: • Virus detection and recovery must be offered • Resource protection should be a mandatory service • Encryption, as a countermeasure against passive attacks, should be used • Authentication, as a countermeasure against active attacks, should be used COSET’2004

To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment • There are applications that require • More computation power • Specialized software • Unique peripheral devices etc • Many owners cannot afford such resources • Some owners can offer their services and resources to appropriate users COSET’2004

To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment • To benefit from existing unique resources • Resource discovery of other clusters is provided • Advertising services is in place • Systems are able to cooperate • Negotiation is in use • Brokerage of resources and services are used • Resources are shared in a distributed manner • “The move toward a grid” should be in place COSET’2004

Grid-like Service Design Cluster 1 Cluster 2 Advertisement Computational Services BrokerageServicess Brokerage Services Exporting Services Storage/Memory Services Withdrawal Services Printer Services Information Services Cluster 3 Cluster n Import Requests Brokerage Servicess Brokerage Servicess COSET’2004

A Cluster Should Anticipate the Optimized Resources Needed While Keeping Its Complexity Hidden • The scarcity of software to assist ordinary programmers limits the harnessing of the computing power of non-dedicated clusters • This implies • A programming environment simple to use • Knowledge of resource distribution not needed • Message passing and shared memory programming supported transparently COSET’2004

Easy Programming Service Design Programming Environment Message Passingor PVM / MPI Communication Primitives DSM System Services of anOperating System Shared Memory Kernel Services of anOperating System COSET’2004

The Holos Services for Autonomic Computing Clusters • Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster that • could be routinely employed by ordinary engineers, managers, etc • able to support next generation application software executing on clusters • We followed the IBM’s vision recommendations regarding autonomic elements • We decided to view autonomic elements as processes • Each computer is a multi-process systems with its objectives • A cluster is a set of multi-process systems with its objectives COSET’2004

MP / PVM / MPI Process DSM Process Parallel Processes Broker- age Server Global Scheduler Execution Server Migration Server System Servers Check- point Server Resource Discovery Server DSM Server Space Manage Server IPC Server Process Manage Server Kernel Servers GENESIS Microkernel Holos • Holos was developed based on the P2P and microkernel paradigms • The microkernel provides services such as • local IPC • basic paging operations • interrupt handling • context switching • Three groups of processes: • kernel servers • system servers • application processes • Kernel and system servers are stationary, application processes are mobile • All processes communicate using messages COSET’2004

System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters • Resource Discovery Server - collects data about computation and communication load • Availability Server - dynamically and adaptively forms a parallel virtual cluster for the application • Global Scheduling Server – maps application processes using static allocation and dynamic load balancing on the computers of the virtual parallel cluster COSET’2004

System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters • Execution Server - coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers • Migration Server - coordinates moving application processes to other computers • DSM Server - hides the distributed nature of the cluster’s memory and allows writing code as though using physically shared memory COSET’2004

System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters • Checkpoint Server - coordinates creation of checkpoints for an executing application • Fault Recovery Server – recovers application processes / applications using checkpoints • IAC Server - supports remote interprocess communication and supports group communication within sets of application processes • Brokerage Server – supports advertising and sharing services through service exporting, importing and revoking COSET’2004

Holos Possesses the Autonomic Computing Characteristics COSET’2004

Conclusion • Autonomic computing has been shown to be a basic part of a revolutionary technology that • Could move parallel computing on non-dedicated clusters to the computing mainstream • (Will start the new .com boom – is to be shown) • The development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster • The Holos cluster operating system has been built from scratch COSET’2004

Cluster Operating System Support For Parallel Autonomic Computing