Enhancing Cloud Reliability: Automated Checks for Failure Recovery with FATE and DESTINI

Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do†, Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau†, Remzi H. Arpaci-Dusseau†, Koushik Sen University of California, Berkeley † University of Wisconsin, Madison

Cloud Era Solve bigger human problems Use cluster of thousands of machines 2

Failures in The Cloud “The future is a world of failures everywhere” - Garth Gibson “Recovery must be a first-class operation” -Raghu Ramakrishnan “Reliability has to come from the software” - Jeffrey Dean 3

Why Failure Recovery Hard? • Testing is not advanced enough against complex failures • Diverse, frequent, and multiple failures • FaceBook photo loss • Recovery is under specified • Need to specify failure recovery behaviors • Customized well-grounded protocols • Example: Paxos made live – An engineering perspective [PODC’ 07] 6

Our Solutions • FTS (“FATE”) – Failure Testing Service • New abstraction for failure exploration • Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification • Enable concise recovery specifications • We have written 74 checks (3 lines / check) • Note: Names have changed since the paper 7

Summary of Findings • Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra • Found 16 new bugs • Reproduced 74 bugs • Problems found • Inconsistency • Data loss • Rack awareness broken • Unavailability 8

Outline • Introduction • FATE • DESTINI • Evaluation • Summary 9

X1 X2 X3 Setup Stage Alloc. Req. Data Transfer Stage Failures at DIFFERENT STAGES lead to DIFFERENT FAILURE BEHAVIORS Goal: Exercise different failure recovery path M C 1 2 3 M C 1 2 3 4 Setup Stage Recovery: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 Data transfer Stage Recovery: Continue on surviving nodes Bug in Data Transfer Stage Recovery 10

X X X X X X FATE M C 1 2 3 • A failure injection framework • target IO points • Systematically exploring failure • Multiple failures • New abstraction of failure scenario • Remember injected failures • Increase failure coverage 11

Failure ID 2 3 12

How Developers Build Failure ID? • FATE intercepts all I/Os • Use aspectJ to collect information at every I/O point • I/O buffers (e.g file buffer, network buffer) • Target I/O (e.g. file name, IP address) • Reverse engineer for domain specific information 13

Failure ID 2 3 12

A B B C B A C A A B A B C A Exploring Failure Space M C 1 2 3 M C 1 2 3 Exp #1: A AB Exp #2: B AC Exp #3: C BC 14

DESTINI • Enable concise recovery specifications • Check if expected behaviors match with actualbehaviors • Important elements: • Expectations • Facts • Failure Events • Check Timing • Interpose network and disk protocols 16

Writing specifications “Violation if expectationis different from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation , AND 17

X X Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); 18

X X Correct recovery Incorrect recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); BUILD EXPECTATIONS CAPTURE FACTS 19

X Building Expectations M C 1 2 3 expectedNodes(B, N) :- getBlockPipe(B, N); Master Client Give me list of nodes for B [Node 1, Node 2, Node 3] 20

X setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack); goodAcksCnt (B, COUNT<Ack>) :- setupAcks (B, Pos, Ack), Ack == ’OK’; nodesCnt (B, COUNT<Node>) :- pipeNodes (B, , N, ); writeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := “Data Transfer”; Updating Expectation M C 1 2 3 DELexpectedNodes(B, N) :- fateCrashNode(N), writeStage(B, Stage), Stage = “Data Transfer”, expectedNode(B, N) • “Client receives all acks from setup stage writeStage”  enter Data Transfer stage • Precise failure events • Different stages  different recovery behaviors  different specifications • FATE and DESTINI must work hand in hand 21

X X Capture Facts Correct recovery Incorrect recovery actualNodes(B, N) :- blocksLocation(B, N, Gs), latestGenStamp(B, Gs) M C 1 2 3 M C 1 2 3 B_gs2 B_gs1 B_gs1 22

Violation and Check-Timing • There is a point in time where recovery is ongoing, thus specifications are violated • Need precise events to decide when the check should be done • In this example, upon block completion • incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), • cnpComplete(B) ; 23

Capture Facts, Build Expectation from IO events • No need to interpose internal functions • Specification Reuse • For the first check, # rules : #check is 16:1 • Overall, #rules: # check ratio is 3:1 Rules 24

Evaluation • FATE: 3900 lines, DESTINI: 1200 lines • Applied FATE and DESTINI to three cloud systems • HDFS, ZooKeeper, Cassandra • 40,000 unique combination of failures • Found 16 new bugs, reproduced 74 bugs • 74 recovery specifications • 3 lines / check 26

Bugs found • Reduced availability and performance • Data loss due to multiple failures • Data loss in log recovery protocol • Data loss in append protocol • Rack awareness property is broken 27

Conclusion • FATE explores multiple failure systematically • DESTINI enables concise recovery specifications • FATE and DESTINI: a unified framework • Testing recovery specifications requires a failure service • Failure service needs recovery specifications to catch recovery bugs 28

Thank you! QUESTIONS? Berkeley Orders of Magnitude http://boom.cs.berkeley.edu The Advanced Systems Laboratory http://www.cs.wisc.edu/adsl Downloads our full TR paper from these websites 29

New Challenges • Exponential growth of multiple failures • FATE exercised 40,000 failure combinations in 80 hours 30

DESTINI vs. Related works 31

Filters FATE Architecture HDFS Failure Server Workload Driver while (server injects new failureIDs) { runWorkload(); // e.g hdfs.write } Fail/ No Fail? Failure Surface Java SDK

N D C DESTINI DESTINI stateY(..) :- cnpEv(..), state(X); FATE

Current state of the Art: • Failure exploration • Rarely deal with multiple failures • Or using random approach • System specifications • Unit test checking: cumbersome • WiDS, Pip: not integrated with failure service

Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Setup Static: InputStream.read() Domain: - Src : Node 2 - Dest: Node 3 - Type: Data Transfer Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Data Transfer M C 1 2 3 4 M C 1 2 3 X1 Recovery 1: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 X2 X3 Bug in recovery 2 Recovery 2: Continue on surviving nodes 35

Enhancing Cloud Reliability: Automated Checks for Failure Recovery with FATE and DESTINI

Enhancing Cloud Reliability: Automated Checks for Failure Recovery with FATE and DESTINI

Presentation Transcript

(Thousands)

Can you fool me? Towards automatically checking protocol gullibility

What to Do With Thousands of GPS Tracks

What to Do With Thousands of GPS Tracks

Towards Pre-Deployment Detection of Performance Failures in Cloud Systems

TOWARDS NEW BITUMEN SPECIFICATIONS

TOWARDS NEW BITUMEN SPECIFICATIONS

Symbolic Model Checking for Large Software Specifications

quality control of thousands of experiments with qcML

Consistency Checking of RM-ODP Specifications

Modelling and Analysing of Security Protocol: Lecture 7 Automatically Checking Protocols

Deriving formal specifications (almost) automatically

Model Checking Large Software Specifications

SWAP-Assembler: Scalable and Efficient Genome Assembly towards Thousands of Cores

Towards a Calculus for UML-RT Specifications

“ one of thousands ”

Model-Checking JML Specifications with Bogor

THOUSANDS OF YEARS

Efficient Checking of Component Specifications in Java Systems

Towards Automatically Checking Thousands of Failures with Micro-Specifications

Automatically Checking the Correctness of Program Analyses and Transformations

Cast of Thousands