360 likes | 477 Vues
This paper presents FATE (Failure Testing Service) and DESTINI (Declarative Testing Specification) to improve failure recovery in cloud systems. With the increasing complexity and frequency of failures, traditional testing methods are insufficient. FATE systematically explores thousands of failure combinations, while DESTINI enables concise recovery specifications. Applied to HDFS, ZooKeeper, and Cassandra, our approach identified 16 new bugs and reproduced 74, highlighting issues like inconsistency and data loss. This innovative framework aims to enhance cloud service reliability through robust failure recovery protocols.
E N D
Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do†, Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau†, Remzi H. Arpaci-Dusseau†, Koushik Sen University of California, Berkeley † University of Wisconsin, Madison
Cloud Era Solve bigger human problems Use cluster of thousands of machines 2
Failures in The Cloud “The future is a world of failures everywhere” - Garth Gibson “Recovery must be a first-class operation” -Raghu Ramakrishnan “Reliability has to come from the software” - Jeffrey Dean 3
Why Failure Recovery Hard? • Testing is not advanced enough against complex failures • Diverse, frequent, and multiple failures • FaceBook photo loss • Recovery is under specified • Need to specify failure recovery behaviors • Customized well-grounded protocols • Example: Paxos made live – An engineering perspective [PODC’ 07] 6
Our Solutions • FTS (“FATE”) – Failure Testing Service • New abstraction for failure exploration • Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification • Enable concise recovery specifications • We have written 74 checks (3 lines / check) • Note: Names have changed since the paper 7
Summary of Findings • Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra • Found 16 new bugs • Reproduced 74 bugs • Problems found • Inconsistency • Data loss • Rack awareness broken • Unavailability 8
Outline • Introduction • FATE • DESTINI • Evaluation • Summary 9
X1 X2 X3 Setup Stage Alloc. Req. Data Transfer Stage Failures at DIFFERENT STAGES lead to DIFFERENT FAILURE BEHAVIORS Goal: Exercise different failure recovery path M C 1 2 3 M C 1 2 3 4 Setup Stage Recovery: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 Data transfer Stage Recovery: Continue on surviving nodes Bug in Data Transfer Stage Recovery 10
X X X X X X FATE M C 1 2 3 • A failure injection framework • target IO points • Systematically exploring failure • Multiple failures • New abstraction of failure scenario • Remember injected failures • Increase failure coverage 11
Failure ID 2 3 12
How Developers Build Failure ID? • FATE intercepts all I/Os • Use aspectJ to collect information at every I/O point • I/O buffers (e.g file buffer, network buffer) • Target I/O (e.g. file name, IP address) • Reverse engineer for domain specific information 13
Failure ID 2 3 12
A B B C B A C A A B A B C A Exploring Failure Space M C 1 2 3 M C 1 2 3 Exp #1: A AB Exp #2: B AC Exp #3: C BC 14
Outline • Introduction • FATE • DESTINI • Evaluation • Summary 15
DESTINI • Enable concise recovery specifications • Check if expected behaviors match with actualbehaviors • Important elements: • Expectations • Facts • Failure Events • Check Timing • Interpose network and disk protocols 16
Writing specifications “Violation if expectationis different from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation , AND 17
X X Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); 18
X X Correct recovery Incorrect recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); BUILD EXPECTATIONS CAPTURE FACTS 19
X Building Expectations M C 1 2 3 expectedNodes(B, N) :- getBlockPipe(B, N); Master Client Give me list of nodes for B [Node 1, Node 2, Node 3] 20
X setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack); goodAcksCnt (B, COUNT<Ack>) :- setupAcks (B, Pos, Ack), Ack == ’OK’; nodesCnt (B, COUNT<Node>) :- pipeNodes (B, , N, ); writeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := “Data Transfer”; Updating Expectation M C 1 2 3 DELexpectedNodes(B, N) :- fateCrashNode(N), writeStage(B, Stage), Stage = “Data Transfer”, expectedNode(B, N) • “Client receives all acks from setup stage writeStage” enter Data Transfer stage • Precise failure events • Different stages different recovery behaviors different specifications • FATE and DESTINI must work hand in hand 21
X X Capture Facts Correct recovery Incorrect recovery actualNodes(B, N) :- blocksLocation(B, N, Gs), latestGenStamp(B, Gs) M C 1 2 3 M C 1 2 3 B_gs2 B_gs1 B_gs1 22
Violation and Check-Timing • There is a point in time where recovery is ongoing, thus specifications are violated • Need precise events to decide when the check should be done • In this example, upon block completion • incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), • cnpComplete(B) ; 23
Capture Facts, Build Expectation from IO events • No need to interpose internal functions • Specification Reuse • For the first check, # rules : #check is 16:1 • Overall, #rules: # check ratio is 3:1 Rules 24
Outline • Introduction • FATE • DESTINI • Evaluation • Summary 25
Evaluation • FATE: 3900 lines, DESTINI: 1200 lines • Applied FATE and DESTINI to three cloud systems • HDFS, ZooKeeper, Cassandra • 40,000 unique combination of failures • Found 16 new bugs, reproduced 74 bugs • 74 recovery specifications • 3 lines / check 26
Bugs found • Reduced availability and performance • Data loss due to multiple failures • Data loss in log recovery protocol • Data loss in append protocol • Rack awareness property is broken 27
Conclusion • FATE explores multiple failure systematically • DESTINI enables concise recovery specifications • FATE and DESTINI: a unified framework • Testing recovery specifications requires a failure service • Failure service needs recovery specifications to catch recovery bugs 28
Thank you! QUESTIONS? Berkeley Orders of Magnitude http://boom.cs.berkeley.edu The Advanced Systems Laboratory http://www.cs.wisc.edu/adsl Downloads our full TR paper from these websites 29
New Challenges • Exponential growth of multiple failures • FATE exercised 40,000 failure combinations in 80 hours 30
Filters FATE Architecture HDFS Failure Server Workload Driver while (server injects new failureIDs) { runWorkload(); // e.g hdfs.write } Fail/ No Fail? Failure Surface Java SDK
N D C DESTINI DESTINI stateY(..) :- cnpEv(..), state(X); FATE
Current state of the Art: • Failure exploration • Rarely deal with multiple failures • Or using random approach • System specifications • Unit test checking: cumbersome • WiDS, Pip: not integrated with failure service
Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Setup Static: InputStream.read() Domain: - Src : Node 2 - Dest: Node 3 - Type: Data Transfer Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Data Transfer M C 1 2 3 4 M C 1 2 3 X1 Recovery 1: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 X2 X3 Bug in recovery 2 Recovery 2: Continue on surviving nodes 35