240 likes | 371 Vues
This work delves into understanding and managing operator errors in Internet services, which often lead to service outages. The study reveals that many outages stem from operator actions due to complex architectures, evolving systems, and a lack of effective tools. By categorizing 42 operator mistakes, we examine their impacts on service availability and propose strategies for prevention, recovery, and automation. A validation framework has been developed to allow operators to test their actions before implementation, significantly reducing the impacts of mistakes observed in practice.
E N D
Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation) Fabián E. Bustamante, Winter 2006
Motivation • Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. • Expect 24 x 7 availability, but service outages still happen! • A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation • Very little detail on operator mistakes • Details strongly guarded by companies and administrators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
This work • Understanding: Gather detailed data on operators’ mistakes • What categories of mistakes? • What’s the impact on the service? • How do mistakes correlate with experience, impact? • Caveat: this is not a complete study of operator behavior • Approaches to deal with operator mistakes: prevention, recovery, automation • Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service • Like offline testing, but: • Virtual environment (extension of online environment) • Real workload • Migration back and forth with minimal operator involvement CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Contributions • Detailed information on operator tasks and mistakes • 43 exp. - detailed data on operator behavior inc. 42 mistakes • 64% immediately degraded throughput • 57% were software configuration mistakes • Human experiments are possible and valuable! • Designed and prototyped a validation infrastructure • Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service • 2 techniques to allow operators to validate their actions • Demonstrated validation is a promising technique for reducing impact of operator mistakes • 66% of all mistakes observed in operator study caught • 6/9 mistakes caught in live operator exp. w/ validation • Successfully tested with synthetically injected mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Representative environment • Choice of human subjects and experiments • Results • Validation: Preventing exposure of mistakes • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Multi-tiered Internet services On-line auction service ~ EBay Client emulator exercises the service Web Server Web Server Tier 1 Application Server Application Server Application Server Tier 2 Tier 3 Database Code from the DynaServer project! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Tasks, operators & training • Tasks – two categories • Scheduled maintenance tasks (proactive), e.g. upgrade sw • Diagnose-and-repair tasks (reactive), e.g. disk failure • Operator composition • 14 computer science graduate students • 5 professional programmers (Ask Jeeves) • 2 sysadmins from our department • Categorization of operators – w/ filled in questionnaire • 11 novices – some familiarity with set up • 5 intermediates – experience with a similar service • 5 experts - in-charge of a service requiring high uptime • Operator training • Novice operators given warm-up tasks • Material describing service, and detailed steps for tasks CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Experimental setup • Service • 3-tier auction service, and client emulator from Rice University’s DynaServer Project • Loaded at 35% of capacity • Machines • 2 Web servers (Apache), • 5 application servers (Tomcat), • 1 database machine (MYSQL) • Operator assistance & data capture • Monitor service throughput • Modified bash shell for command and result trace • Manual observation • Noting anomalies in operator behavior • Bailing out ‘lost’ operators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
First Apache misconfigured and restarted Second Apache misconfigured and restarted Application server added Example trace • Task: Add an application server • Mistake: Apache misconfiguration • Impact: Degraded throughput CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Sampling of other mistakes • Adding a new application server • Omission of new application server from backend member list • Syntax errors, duplicate entries, wrong hostnames • Launching the wrong version of software • Migrating the database for performance upgrade • Incorrect privileges for accessing the database • Security vulnerability • Database installed on wrong disk CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Operator mistakes: Category vs. impact • 64% of all mistakes had immediate impact on service performance • 36% resulted in latent faults • Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment • Obs. #2: Undetectable latent errors will still require online-recovery techniques CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Operator mistakes • Misconfigurations account for 57% of all errors • Config. mistakes spanning multiple components are more likely (global misconfigurations) • Obs. #1: Tools to manipulate & check configs are crucial • Obs. #2: Careful maintaining multiple versions of s/w CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Operator categories • Experts also made mistakes! • Complexity of tasks executed by experts were higher CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Summary of operator study • 43 experiments 42 mistakes • 27 (64%) mistakes caused immediate impact on service performance • 24 (57%) were software configuration mistakes • Mistakes were made across all operator categories • Trace of operator commands & service performance for all experiments • Available at http://vivo.cs.rutgers.edu CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Validation: Preventing exposure of mistakes • Technique • Experimental evaluation • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Validation of operator’s actions • Validation • Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) • Correctness is tested by: • Migrate the component(s) to virtual sand-box environment, • Subject to a real load, • Compare behavior to a known correct one, and • Migrate back to online environment • Types of validation: • Replica-based: Compare with online replica (real time) • Trace-based: Compare with logged behavior CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Compare Application State Database Compare Compare Validating a component: Replica-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Web ServerProxy Application Server Application Server Application Server Tier 2 DatabaseProxy Tier 3 Shunt CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Compare Web ServerProxy State State Application Server Database DatabaseProxy Compare Shunt Validating a component: Trace-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Application Server Application Server Tier 2 Tier 3 CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Implementation details • Shunting performed in middleware layer • Each request tagged with a unique ID all along the request path • Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL) • Reuse discovery and communication interfaces, common messaging core • State management requires well-defined export and import API • Stateful servers often support such API • Comparator functions to detect errors • Simple throughput, flow, and content comparators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Validating our prototype: results • Live operator experiments • Operator given option of type of validation, duration, and to skip validation • Validation caught 6 out of 9 mistakes from 8 experiments with validation • Mistake-injection experiments • Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) • Operator-emulation experiments • Operator command scripts derived from the 42 operator mistakes • Both trace-based and replica validation caught 22 mistakes • Multi-component validation caught 4 latent (component interaction) mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Reduction in impact with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Fewer mistakes with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Shunting & buffering overheads • Shunting overhead for replica-based validation 39% additional CPU • All requests and responses are captured and forwarded to validation slice • Trace-based validation is slightly better 32 % additional CPU • Overhead is incurred on single component, and only during validation • Various optimizations can reduce overhead to 13-22% • Examples: response summary (64byte), sampling (session boundaries) • Buffering capacity during state check pointing and duplication • Required to buffer only about 150 requests for small state sizes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Caveats, limitations & open Issues • Non-determinism increases complexity of comparators and proxies • E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps • Hard state management may require operator intervention • Component requires initialization prior to online migration • Bootstrapping the validation • Validating an intended modification of service behavior – nothing to compare with! • How long to validate? What types of validation? • Duration spent in validation implies reduced online capacity • Future work: Taking validation further… • Validate operator actions on databases, network components • Combine validation with diagnosis for assisting operators • Other validation techniques: Model-based validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University