Addressing Operator Mistakes in Internet Services: Insights and Solutions

Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation) Fabián E. Bustamante, Winter 2006

Motivation • Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. • Expect 24 x 7 availability, but service outages still happen! • A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation • Very little detail on operator mistakes • Details strongly guarded by companies and administrators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

This work • Understanding: Gather detailed data on operators’ mistakes • What categories of mistakes? • What’s the impact on the service? • How do mistakes correlate with experience, impact? • Caveat: this is not a complete study of operator behavior • Approaches to deal with operator mistakes: prevention, recovery, automation • Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service • Like offline testing, but: • Virtual environment (extension of online environment) • Real workload • Migration back and forth with minimal operator involvement CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Contributions • Detailed information on operator tasks and mistakes • 43 exp. - detailed data on operator behavior inc. 42 mistakes • 64% immediately degraded throughput • 57% were software configuration mistakes • Human experiments are possible and valuable! • Designed and prototyped a validation infrastructure • Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service • 2 techniques to allow operators to validate their actions • Demonstrated validation is a promising technique for reducing impact of operator mistakes • 66% of all mistakes observed in operator study caught • 6/9 mistakes caught in live operator exp. w/ validation • Successfully tested with synthetically injected mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Representative environment • Choice of human subjects and experiments • Results • Validation: Preventing exposure of mistakes • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Multi-tiered Internet services On-line auction service ~ EBay Client emulator exercises the service Web Server Web Server Tier 1 Application Server Application Server Application Server Tier 2 Tier 3 Database Code from the DynaServer project! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Tasks, operators & training • Tasks – two categories • Scheduled maintenance tasks (proactive), e.g. upgrade sw • Diagnose-and-repair tasks (reactive), e.g. disk failure • Operator composition • 14 computer science graduate students • 5 professional programmers (Ask Jeeves) • 2 sysadmins from our department • Categorization of operators – w/ filled in questionnaire • 11 novices – some familiarity with set up • 5 intermediates – experience with a similar service • 5 experts - in-charge of a service requiring high uptime • Operator training • Novice operators given warm-up tasks • Material describing service, and detailed steps for tasks CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Experimental setup • Service • 3-tier auction service, and client emulator from Rice University’s DynaServer Project • Loaded at 35% of capacity • Machines • 2 Web servers (Apache), • 5 application servers (Tomcat), • 1 database machine (MYSQL) • Operator assistance & data capture • Monitor service throughput • Modified bash shell for command and result trace • Manual observation • Noting anomalies in operator behavior • Bailing out ‘lost’ operators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

First Apache misconfigured and restarted Second Apache misconfigured and restarted Application server added Example trace • Task: Add an application server • Mistake: Apache misconfiguration • Impact: Degraded throughput CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Sampling of other mistakes • Adding a new application server • Omission of new application server from backend member list • Syntax errors, duplicate entries, wrong hostnames • Launching the wrong version of software • Migrating the database for performance upgrade • Incorrect privileges for accessing the database • Security vulnerability • Database installed on wrong disk CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Operator mistakes: Category vs. impact • 64% of all mistakes had immediate impact on service performance • 36% resulted in latent faults • Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment • Obs. #2: Undetectable latent errors will still require online-recovery techniques CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Operator mistakes • Misconfigurations account for 57% of all errors • Config. mistakes spanning multiple components are more likely (global misconfigurations) • Obs. #1: Tools to manipulate & check configs are crucial • Obs. #2: Careful maintaining multiple versions of s/w CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Operator categories • Experts also made mistakes! • Complexity of tasks executed by experts were higher CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Summary of operator study • 43 experiments  42 mistakes • 27 (64%) mistakes caused immediate impact on service performance • 24 (57%) were software configuration mistakes • Mistakes were made across all operator categories • Trace of operator commands & service performance for all experiments • Available at http://vivo.cs.rutgers.edu CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Validation: Preventing exposure of mistakes • Technique • Experimental evaluation • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Validation of operator’s actions • Validation • Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) • Correctness is tested by: • Migrate the component(s) to virtual sand-box environment, • Subject to a real load, • Compare behavior to a known correct one, and • Migrate back to online environment • Types of validation: • Replica-based: Compare with online replica (real time) • Trace-based: Compare with logged behavior CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Compare Application State Database Compare Compare Validating a component: Replica-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Web ServerProxy Application Server Application Server Application Server Tier 2 DatabaseProxy Tier 3 Shunt CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Compare Web ServerProxy State State Application Server Database DatabaseProxy Compare Shunt Validating a component: Trace-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Application Server Application Server Tier 2 Tier 3 CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Implementation details • Shunting performed in middleware layer • Each request tagged with a unique ID all along the request path • Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL) • Reuse discovery and communication interfaces, common messaging core • State management requires well-defined export and import API • Stateful servers often support such API • Comparator functions to detect errors • Simple throughput, flow, and content comparators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Validating our prototype: results • Live operator experiments • Operator given option of type of validation, duration, and to skip validation • Validation caught 6 out of 9 mistakes from 8 experiments with validation • Mistake-injection experiments • Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) • Operator-emulation experiments • Operator command scripts derived from the 42 operator mistakes • Both trace-based and replica validation caught 22 mistakes • Multi-component validation caught 4 latent (component interaction) mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Reduction in impact with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Fewer mistakes with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Shunting & buffering overheads • Shunting overhead for replica-based validation  39% additional CPU • All requests and responses are captured and forwarded to validation slice • Trace-based validation is slightly better  32 % additional CPU • Overhead is incurred on single component, and only during validation • Various optimizations can reduce overhead to 13-22% • Examples: response summary (64byte), sampling (session boundaries) • Buffering capacity during state check pointing and duplication • Required to buffer only about 150 requests for small state sizes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Caveats, limitations & open Issues • Non-determinism increases complexity of comparators and proxies • E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps • Hard state management may require operator intervention • Component requires initialization prior to online migration • Bootstrapping the validation • Validating an intended modification of service behavior – nothing to compare with! • How long to validate? What types of validation? • Duration spent in validation implies reduced online capacity • Future work: Taking validation further… • Validate operator actions on databases, network components • Combine validation with diagnosis for assisting operators • Other validation techniques: Model-based validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Addressing Operator Mistakes in Internet Services: Insights and Solutions