Hector Garcia-Molina

CS 347: Parallel and DistributedData ManagementNotes06: Reliable DistributedDatabase Management Hector Garcia-Molina Notes06

Reliable distributed database management • Reliability • Failure models • Scenarios Notes06

Reliability • Correctness • Serializability • Atomicity • Persistence • Availability Notes06

Types of failures • Processor failures • Halt, delay, restart, bezerk, ... • Storage failures • Volatile, non-volatile, atomic write, transient errors, spontaneous failures • Network failures • Lost message, out-of-order messages, partitions, bounded delay Notes06

More Types of failures • Malevolent failures • Multiple failures • Detectable failures Notes06

Failure models • Cannot protect against everything • Unlikely failures (e.g., flooding in the Sahara) • Expensive to protect failures (e.g., earthquake) • Failures we know how to protect against (e.g., message sequence numbers; stable storage) Notes06

Failure model: Desired Events Expected Undesired Unexpected Notes06

Node models (1) Fail-stop nodes time perfect halted recovery perfect Volatile memory lost Stable storage ok Notes06

Node models (2) Byzantine nodes A Perfect Perfect Arbitrary failure Recovery B C At any given time, at most some fraction f of nodes failed (typically f < 1/2 or f < 1/3) Notes06

Network models (1) Reliable network - in order messages - no spontaneous messages - timeout TD I.e., no lost messages, except for node failures Destination down (not paused) If no ack in TD sec. Notes06

Variation of reliable net: • Persistent messages • If destination down, net will eventually deliver message • Simplifies node recovery, but leads to inefficiencies (hides too much) • Not considered here Notes06

Network models (2) Partitionable network - In order messages - No spontaneous messages - no timeout; nodes can have different view of failures Notes06

Scenarios • Reliable network • Fail-stop nodes • No data replication (1) • Data replication (2) • Partitionable network • Fail-stop nodes (3) Notes06

No Data Replication • Reliable network, fail-stop nodes • Basic idea: node P controls X P net Item X Notes06

No Data Replication • Reliable network, fail-stop nodes • Basic idea: node P controls X P net Item X - Single control point simplifies concurrency control, recovery - Not an availability hit: if P down, X unavailable too! Notes06

“P controls X” means - P does concurrency control for X - P does recovery for X Notes06

Say transaction T wants to access X: req PT is process that represents T at this node PT Local DMBS X Lock mgr LOG Notes06

Process models (A) Cohorts Spawn process Communication Data Access USER T1 T2 T3 Local DMBS Local DMBS Local DMBS Notes06

Process models (B) Transaction servers (manager) USER Trans MGR Trans MGR Trans MGR Local DMBS Local DMBS Local DMBS Notes06

Cohorts: application code responsible for remote access • Transaction manager: “system” handles distribution, remote access Notes06

Distributed commit problem Transaction T . Action: a1,a2 Action: a3 Action: a4,a5 Notes06

Centralized two-phase commit Coordinator Participant I I exec nok go exec* nok abort* exec ok W A W A ok* commit * abort - commit - C C Notes06

Notation: Incoming message Outgoing message ( * = everyone) • When participant enters “W” state: • it must have acquired all resources • it can only abort or commit if so instructed by a coordinator • Coordinator only enters “C” state if all participants are in “W”, i.e., it is certain that all will eventually commit Notes06

Handling node failures • Coordinator and participant logs are used to reconstruct state before failure Notes06

-> Example: after participant fails: Log: T1 X undo/redo info T1 Y info T1 “W” state ... ... Notes06

 At recovery: • T1 is in “W” state • Obtain X,Y write locks (no read locks!) • Wait for message from coordinator (or ask coordinator for outcome) Notes06

 Other examples: • No “W” record on log  abort T1 • See “C” record on log  finish T1 Notes06

Next • Add timeouts to cope with messages lost during crashes • Add finish (“F”) state for coordinator – all done, can forget outcome Notes06

I nok abort* _go_ exec* Coordinator W A ok* commit* nok* - C F c-ok* - t=timeout cping=coord. ping Notes06

ping abort ping - _t_ cping _t_ abort* ping commit _t_ cping I nok abort* _go_ exec* Coordinator W A ok* commit* nok* - C F c-ok* - t=timeout cping=coord. ping Notes06

exec nok I Participant exec ok abort nok W A commit c-ok C Notes06

cping done cping , _t - ping cping done “done” message counts as either c-ok or n-ok for coordinator exec nok I Participant exec ok abort nok W A commit c-ok C Notes06

cping done cping , _t - ping equivalent to finish state cping done “done” message counts as either c-ok or n-ok for coordinator exec nok I Participant exec ok abort nok W A commit c-ok C Notes06

Presumed abort protocol • “F” and “A” states combined in coordinator • Saves persistent space (forget quicker) • Presumed commit is analogous Notes06

Presumed abort-coordinator(participant unchanged) I ping abort nok, t abort* _go_ exec* ping - A/F W ok* commit* c-ok* - C ping commit _t_ cping Notes06

T1 start part={a,b} T1 OK from a RCV ... ... Remember: all state transitions must be logged Example: tracking who has sent “OK” msgs Log at coord: • After failure, we know still waiting for OK from node b • Alternative: do not log receipts of “OK”s abort T1 ... Notes06

Example: logging receipt of C-OK messages • If logged, can recover state • If not logged: • resend commit * • participants reply “done” if duplicate Notes06

2PC is blocking Sample scenario: Coord P2 W P1 P3 W P4 W Notes06

coord P2 w P3 P1 w w P4 Case I: P1 “W”; coordinator sent commits P1 “C” Case II: P1 NOK; P1 A  P2, P3, P4 (surviving participants) cannot safely abort or commit transaction Notes06

Variants of 2PC ok ok ok • Linear Coord • Hierarchical commit commit commit Notes06

Variants of 2PC • Distributed • Nodes broadcast all messages • Every node knows when to commit Notes06

3PC = non-blocking commit • Assume: failed node is down forever • Key idea: before committing, coordinator tells participants everyone is ok Notes06

3PC I I exec nok _exec_ ok _go_ exec* Coordinator Participant nok abort* W A W A abort - pre ack ok* pre * P P ack** commit* commit - ** means allnon-failed nodes C C Notes06

3PC recovery rules: termination protocol • Survivors try to complete transaction, based on their current states • Goal: • If dead nodes committed or aborted, then survivors should not contradict! • Else, survivors can do as they please... survivors Notes06

survivors • Let {S1,S2,…Sn} be survivor sites • If one or more Si = COMMIT  COMMIT T • If one or more Si = ABORT  ABORT T • If one or more Si = PREPARE  T could not have aborted  COMMIT T • If no Si = PREPARE (or COMMIT) T could not have committed  ABORT T Notes06

? ? W W P Example: Notes06

? ? W W I Example: Notes06

? ? C P P Example: Notes06

? ? W A P Example: Notes06

 Once survivors make decision, they must select new coordinator to continue 3PC P P C C W P C C W P P C Decide to commit Time 1 Time 2 Time 3 Time 4 Notes06

Hector Garcia-Molina

Hector Garcia-Molina

Presentation Transcript

Garcia-Molina, Ullman, and Widom

Garcia-Molina, Ullman, and Widom

Hector Berlioz

Hector Guzman

Hector Gonzalez

Hector Berloiz

Melissa Molina

(Slides by Hector Garcia-Molina, www-db.stanford/~hector/cs245/notes.htm)

Hector Garcia-Molina Zoltan Gyongyi

Sixto MOLINA

Hector

L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina )

Combating Web Spam with TrustRank ( Zoltan Gyongyi , Hector Garcia-Molina, Jan Pedersen )

Mor Naaman, Yee Jiun Song, Andreas Paepcke, Hector Garcia-Molina

HECTOR

Hector Lopez

(Slides by Hector Garcia-Molina, www-db.stanford/~hector/cs245/notes.htm)

HECTOR