Enforcing User-defined Management Logic in Large Scale Systems

Enforcing User-defined Management Logic in Large Scale Systems Srinath Perera Indiana University, Bloomington

Motivation & the Problem Related Work Proposed Architecture Scalability Results Robustness Contributions Outline

IT is becoming a part of our everyday life • Increases size of potential user bases of systems (Google, Facebook, Amazon …). • Information Avalanche. • National, Global scale data collection • Success in this setting is decided by our ability to make sense of this data – scale matters (Google!). • Technological advances • Connectivity , SOA, Complex systems possible. • Computing power everywhere (multicore, smart phones). • Cloud - Lower the barrier for scale. Motivation: Large Scale systems We have the need and means to build large scale systems

Changes are a norm rather than an exception - 10,000 servers, each having MTTF of thousand days => 10 failures/day [Jeff Dean]. • High Operational Cost - When a system scales up, complexity increases. • More than 75% TCO (Total Cost of Ownership) based on Patterson et al. data. (Dominated by salaries.) • 50% IT budget spent on recovering from failures [Ganek et al.] • Unreliable Middleware - Grid reliability among all operations 55%-80% [Khalili et al.]. Then the success rate of a service or a workflow that has 6 grid operations is 0.26 !!! • Efforts to avoid failures have been unsuccessful - Not a problem to be solved, but a fact to cope with [Patterson] Building them is Feasible, but Keeping them Running ?? System Management is a Potential Solution to this Problem!!

Support user-defined Management Logic • Management usecases differ from system to system • => only big organizations can afford to build specific frameworks • => need user-defined management logic. • Ease of authoring management logic is important. • Scalable • Robust – changes are a norm rather than an exception! • Dynamic - resources often join and leave. Need a dynamic and robust management framework that supports user-defined management logic. Management Framework for Large Scale Systems should

Large scale systems need many managers • One manager does not scale nor robust • Each manager has a Partial view of the system • a subset of resources are assigned to each manager • But a Global view is Preferred (ease of authoring logic) • Logic that work on local data need emergent properties, and hard for user to author them. • We all think in terms of global properties, • Example : “If the system does not have 5 message brokers, create new brokers and connect them to the broker network.” : detect <5 brokers, find the best place to create new one, create new one, and connect it to existing brokers. Problem: Enforcing user-defined management logic (that depends on a global view) on large-scale systems? And Application of such a framework to manage systems. The Problem

Systems without Global Control • Centralized management systems (e.g. Rainbow) • Managers that act independently (e.g. Extreme (Kx), DREAM), and manual coordination (e.g. IBM Tivoli). • Systems with Global Control • Decentralized control - DMonA , and Deugo et al. - • Monitor and run a State Machine of the system - Dubey et al. • Consistent Shared View - Georgiadis et al., component Managers collaborate via total ordered multicast to maintain a system according to architectural constraints. Related Work

Systems with Global Control (Contd.) • Management Hierarchy • Management hierarchy where the topmost layer is replicated (E.g. Monalisa ,Gadgil et al.). • Typically Aggregation is used at each level. • Aggregation hides information about a single resource. • Hierarchy with Policies • WildCat - agent group based hierarchy that communicates via whiteboards and use policies to control agents. Authors concern about the scalability of whiteboards. • Cooperating Managers - No Global control loop • Schoenwaelder - a group of cooperating agents and a master agent (IP multi-cast) • ANDREA - create dynamic Hierarchies, delegate tasks to other managers via delegate statements in the management logic. Related Work(Contd.)

Proposed Solution: Hasthi Architecture • Useful • Application to a Large-Scale E-Science Project (LEAD) • Sound • Scalable (Empirical results) • Robust and Dynamic (Proof + Empirical results) • Main Contribution “Proposing, implementing, and analyzing a dynamic and robust management architecture, which can manage large-scale systems by enforcing user-defined management logic that depend on a global view of the managed system state, and application of the management logic to manage systems.” Outline of the Evidence

Big Picture (Hasthi) • Hasthi Has three Parts • Manager Cloud – distributed architecture that binds managers and resources in the system as one cohesive unit. • Meta-Model that represents the system state. • Decision Framework.

Manager Cloud • Managers form a P2P network (Pastry), which is used for Initialization and Recovery (Elections). • Normal Operations use SOAP over HTTP

Meta-Model • Meta-model represents the monitoring data collected from the system. Summarized meta-model provides a global view. • Delta-consistency – changes are reflected within a bounded time (a concept borrowed from shared memory multiprocessors [see Singla et al.]).

Decision Framework • Users define management logic as rules: Local and Global. • Manager control loops evaluate partial meta-models using local rules. • The coordinator control loop evaluates the summarized meta-models using global rules (Global view). • Actions triggered by rules analyze meta-model and decide on solutions.

Management Rules rule "RestartFailedServices" when service:ManagedService(state == "CrashedState"); host:Host(state != "CrashedState", service.host == name); then system.invoke(new RestartAction(service), new ActionCallback() { public void actionSucessful(ManagementAction action) { ..... } public void actionFailed(ManagementAction action,Throwable e) { service.setState("UnRepairableState"); system.invoke( new UserInteractionAction(system, service, action,e)); }}); end • When the condition given using the object query language is met, actions in the then-clause are carried out. • Use Rete algorithm to evaluate meta-objects and execute corrective actions. Tradeoff between space and time. Rules (Drools) evaluate meta-objects (which represent resources) and execute actions, which analyze meta-objects and decide on solutions.

Action Types • Create a New service • Restart a running service or recover a failed service • Relocate a service • Tune and configure a resource – change the configuration of a resource or change the structure of the system. • User Interaction Action • Actions implementation: • Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host. • Use Hasthi Agent integrated with each resource. • Hasthi provides default management actions, but users can write their own. Management Actions

Management Complexities Even with a Global view, management can go wrong in many ways. Following are some complexities and proposed remedies (Chapter 7 for details). • Failed Management Actions– Hasthi uses the resource lifecycle, which sets resource state as “Unrecoverable” if an action failed, and ask for user help. • Lost system structure (broken links) – services can use the “dependency-discovery” operation to find other services. • Lost state – Hasthi does not preserve state but helps resources to locate their storage locations. (resource expose the location as a property and Hasthi pass it as a argument when it recovers the services) • Lost messages – retry + session level checkpoints • Fail positives (Custom failure detectors) & Network Paritions

Application of Hasthi Find 10% Errors that happen 90% of the time Figure Out how to preserve state across changes

LEADUsecase • LEAD services are stateless or have a persistent state. Data storage is best effort. We can recover by restarting services. • Recover from Host & Service Failures – restart the failed services • Recover workflows - Detect when the system has failed and recovered and resurrect any failed Workflows.

Scalability: Test Setup Q? • Coordinator Test Setup: • Test-Manager that simulates all messages generated by a normal manager managing a set of resources. • We simulated a large-scale system using Test-Managers. • The coordinator does not see a difference. • Main Test Setup • Large scale deployment of LEAD. • Multiple replicas of the complete LEAD stack. • Each service simulates a management workload using a randomized algorithm. • Set of rules to manage the system, and each test ran for a 1 hour with 30 seconds epoch time.

Measurements (Metrics)

One Manager Overhead (Resource Heartbeat Latency, Manager Loop Overhead, Manager Heartbeat Latency) Managers Overhead (Coordinator Loop, Manager Heartbeat ) • One manager scales to 5000-8000 resources, Hasthi scales more with added managers. Need more tests to find the limits.

Coordinator Limit: (Manager Heartbeat Latency, Coordinator Loop Overhead) vs. Resource count • Close to a Linear overhead, the coordinator scales to 100,000 resources and 1000 managers, and the number of managers does not make a much difference. • Why? (1) Summarization, (2) Only transfer Changes, (3) Rete Algorithm, which only evaluates changes (tradeoff between speed vs. memory).

Manager Independence: (Resource heartbeat, Manager Loop vs. Manager Heartbeat) vs. resources per Manager • We measured the limit of a manager and the limit of the coordinator. • Hypothesis: a manager overhead only depends on resources assigned to a manager, not on other managers or resources in the system • we can scale up Hasthi (e.g. 100 managers, 1000 resources each). • Verify Hypothesis: • A Scatter Plot: overhead vs. number of resources per Manager. • Same X values are reasonably close to each other. • Hypothesis is valid till 2000 resources at least. • Why? Managers do not usually interact with other managers, but talk with the coordinator.

One manager scales to 5000-8000 resources. Managers only depend on resources assigned to them (at least till 2000 resources) and are not affected by other Managers in the system. Coordinator scales to 100,000 resources and 1000 managers (100-1000 resources per manager < 2000 limit in #2). System scales to 100,000 resources. Scalability: Summary Q?

Robustness: Correctness Proof • Proof Outline:We took all states and proved that for any state there is a forced sequence that recovers the system within a bounded time. Self Stabilization = the system reaches a safe state regardless of the initial state and continues to be at that state. • We proved (in Chapter 5) given a system managed with Hasthi there exists a constant h for that system such that Hasthi Self Stabilizes if managers do not join or leave and communication failures do not happen for a continuous h time interval.

Availability of Hasthi • Availability = MTTF/(MTTF+MTTR) -----------------------------------(1). The Proof provides the recovery time. Let us use that to calculate Availability as a function of MTTF of a single manager. • Let us Assume a system managed with n independent managers each manager having MTTF (Mean Time To Failure) of Ѳ. • Then • Managers are independent => We can use an exponential distribution to model their failures. (Srinivasan [143]). • Then p, the probability no failures happen within a unit (second) time is • by Srinivasan [143]------------------------------------(2). • MTTF of Hasthi is Ѳ/n (according to Baumann [108]) ---------------(3)

Definition: NF(r) = time elapsed for the first continuous time interval r with no failures to happen. • Then h_c = E[NF(r)] • E[NF(r)] same as the expected value for r continuous HEADS to occur with a biased coin with p probability of a HEAD. • It has been shown that -----------(4) • Using (2) and (4), we can calculate h_c = E[Nf(r)].

Similar result to recover from manager failures h_m = E[NF(m)]. • We have 1 coordinator and n-1 managers, therefore • -----------------------(5) • Therefore using h_m and h_c we can find MTTR. • We know both MTTR (by Equation 5) and MTTF (by Equation 3); therefore, we know availability = MTTF / (MTTF + MTTR) as a function of Ѳ (MTTF of one Manager).

Parameters • Ѳ = MTTF of a manager • r, m continuous time intervals defined by the proof • n the number managers in the system • Since our proof provides an upper bound for the recovery time, the result is a lower bound for availability.

Availability vs. Manager MTTF Availability classes defined by Gray et al. Managed Systems (83 hours downtime/year) Well Managed Systems (9 hours downtime/year) Fault Tolerant Systems (1 hours downtime/year)

Robustness: Empirical Results • Instrument Hasthi to generate events about status, add a new manager, kill the current coordinator, and measure the time to detect, to recover Hasthi, and to build the meta-model. • Did the test 100 times. Detection time decreases (O(1/n)), election time increases (O(log(n))), recovery time increases, overall time decreases!! Recovery time about 80 seconds.

Availability of the Managed System • With LEAD recovery took about 2 minutes (60 + 20 + 30 sec) • When we calculated, the availability of LEAD with Hasthi is 0.995 - 0.999, which is about 40-10 hours downtime/ year

With Global view of the system, User can author management logic the same way they reason about the system (easy and Intuitive). • There is a tradeoff between scalability and explicit management logic, but Hasthi covers most usecases while supporting explicit user defined management logic. When building generic management frameworks, it is possible to enforce user-defined global and local management logic in most real world usecases. Implications Of Our Results

Contributions Problem: Enforcing user-defined management logic (that depend on a global view of the managed system) on large-scale systems? And Application of such a framework to manage systems. • Proposed an architecture to solve this problem (“Manager-Cloud Algorithm” + monitoring information as a meta-model of the system that exhibits delta-consistency + Decision Framework). • Proved its robustness analytically and verified it empirically. • Implemented the architecture and empirically demonstrated that it can scale to mange most real world usecases. • A demonstration that despite its dependency on a global view, a Management Framework can scale to manage most real world usecases • Analyzed applications of user-defined management logic to manage systems, proposed solutions to management complexities arise from these applications, and applied it to manage a large-scale e-science project.

Questions

Graphical Composition of Management Logic to simplify management logic authoring. Building a Distributed Service Container on top of Hasthi. Making the Coordinator Lightweight, thus try to increase the scalability limit of Hasthi. Further explore the Application of Management Frameworks. Future Work

Backup Slides

Sensitivity: Rules • To find sensitivity to rules, 7 Rules sets, each having more rules then the one before, with 40,000 resources • Almost linear Overhead, seem to be stable. We also verified by running 100,000 resources against the most complex rule set.

Sensitivity: Epoch Time • Epoch times are time periods between heartbeats and control loop evaluations etc, and they decide how fast Hasthi reacts to failures. • Why overhead reduce with smaller epoch? Rete algorithm remembers old results and only evaluates new results. Small epoch means less changes, which means less overhead!!

Sensitivity: Workload • Increase failures in the system (increase workload on Hasthi) and measure with 40,000 resources. • Hasthi is stable, why? Hasthi uses a job queue to execute actions asynchronously. Therefore, can withstand higher workloads and surges.

Useful: LEAD Integration • Integrate Hasthi with LEAD. Hasthi recovers LEAD from services and host failures and recovers failed workflows. • A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.

Comparison With Gadgil et al. • CGLM evaluates each resource parallely, Hasthi does it as a batch. • Hasthi creates a HTTP connection every time where as CGLM uses a pool of connections.

Comparison With Gadgil et al. Contd.

Resource LifeCycle

Types of Management Agents

In Memory Agent Implementation

Management Action Implementation

Enforcing User-defined Management Logic in Large Scale Systems

Enforcing User-defined Management Logic in Large Scale Systems

Presentation Transcript

User Defined Functions

User Defined Functions

Systems Logic for Sustained Large Scale Implementation

Large-scale enterprise content management

Large-scale adaptive systems

Large-Scale Distributed Systems

Hardware Impairments in Large-scale MISO Systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

User-Defined Types

Large- scale Linked Data Management

Large Scale Distributed Computing Systems

Extracting User Profiles from Large Scale Data

Large-Scale Distributed Systems

Large Scale Computing Systems

Large Scale File Systems

Large-scale adaptive systems

Large-Scale Systems

Large Scale Systems Design G52LSS

ENERGY EFFICIENCY IN LARGE SCALE DISTRIBUTED SYSTEMS