D 3 S: Debugging Deployed Distributed Systems

D3S: Debugging Deployed Distributed Systems XuezhengLiu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC

Debugging distributed systems is difficult • Bugs are difficult to reproduce • Many machines executing concurrently • Machines/network may fail • Consistent snapshots are not easy to get • Current approaches • Multi-threaded debugging • Model-checking • Runtime-checking

State of the Arts • Example • Distributed reader-writer locks • Log-based debugging • Step1: add logs • void ClientNode::OnLockAcquired(…) { • … • print_log( m_NodeID, lock, mode); • } • Step2: Collect logs • Step3: Write checking scripts

Problems • Too much manual effort • Difficult to anticipate what to log • Too much? • Too little? • Checking for large system is challenging • A central checker cannot keep up • Snapshots must be consistent

D3S Contribution • A simple language for writing distributed predicates • Programmers can change what is being checked on-the-fly • Failure tolerant consistent snapshot for predicate checking • Evaluation with five real-world applications

D3S Workflow state Conflict! state state state state Predicate: no conflict locks Violation! Checker Checker

Glance at D3S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0  { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning };

D3S Parallel Predicate Checker Lock clients Expose states individually Key: LockID Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 L1 Reconstruct: SN1, SN2, … (C1,L1,E),(C5,L1,S) (C2,L3,S) Checkers

Summary of Checking Language • Predicate • Any property calculated from a finite number of consecutive state snapshots • Highlights • Sequential programs (w/ mapping) • Reuse app types in the script and C++ code • Binary Instrumentation • Supports for reducing the overhead (in the paper) • Incremental checking • Sampling the time or snapshots

Snapshots • Use Lamport clock • Instrument network library • 1000 logic clocks per second • Problem: how does the checker know whether it receives all necessary states for a snapshot?

Consistent Snapshot { (A, L0, S) }, ts=2 { }, ts=10 { (A, L1, E) }, ts=16 A • Membership • What if a process does not have state to expose for a long time? • What if a checker fails? { (B, L1, E) }, ts=6 ts=12 B SA(2) SB(6) Detect failure SA(10) SA(16) Checker M(2)={A,B} SB(2)=?? M(6)={A,B} SA(6)=?? M(10)={A,B} SA(6)=SA(2) check(6) SB(10)=SB(6) check(10) M(16)={A} check(16)

Experimental Method • Debugging five real systems • Can D3S help developers find bugs? • Are predicates simple to write? • Is the checking overhead acceptable? • Case: Chord implementation – i3 • Using predecessors and successors list to stabilize • “holes” and overlap

Chord Overlay • Consistency vs. Availability: cannot get both • Global measure on the factors • See the tradeoff quantitatively for performance tuning • Capable of checking detailed key coverage • Perfect Ring: • No overlap, no hole • Aggregated key coverage is 100% ???

Summary of Results Data center apps Wide area apps

Overhead (PacificA) • Less than 8%, in most cases less than 4%. • I/O overhead < 0.5% • Overhead is negligible in other checked systems

Related Work • Log analysis • Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] • Predicate checking at replay time • WiDS Checker[NSDI’07], Friday[NSDI’07] • P2-based online monitoring • P2-monitor[EuroSys’06] • Model checking • MaceMC[NSDI’07], CMC[OSDI’04]

Conclusion • Predicate checking is effective for debugging deployed & large-scale distributed systems • D3S enables: • Change of what is monitored on-the-fly • Checking with multiple checkers • Specify predicate in sequential & centralized manner

Thank You • Thank the authors for providing some of slides

PNUTSYahoo!’s Hosted Data Serving Platform Brian F. Cooper et al. @ Yahoo! Research Presented by Ying-Yi Liang * Some slides come from the authors’ version

What is the Problem • The web era: web applications • Users are picky – low latency; high availability • Enterprises are greedy – high scalability • Things go fast – new ideas expires very soon • Two ways of developing a cool web application • Making your own fire: quick, cool, but tiring, error prone • Using huge “powerful” building blocks: wonderful, but the market would have shifted away when you are done • Both ways do not scale very well… • Something is missing – an infrastructure specially tailored for web applications!

Web Application Model • Object sharing: Blogs, Flicker, Web Picasa, Youtube, … • Social: Facebook, Twitter, … • Listing: Yahoo! Shopping, del.icio.us, news • They require: • High scalability, availability and fault tolerance • Acceptable latency w.r.t. geographically distributed requests • Simplified query API • Some consistency (weaker than SC)

A 42342 E A 42342 E B 42521 W B 42521 W C 66354 W F 15677 E D 12352 E E 75656 C B 42521 W A 42342 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E PNUTS – DB in the Cloud CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure

Basic Concepts Primary Key Record Tablet Field

A view from 10,000-ft

PNUTS Storage Architecture Clients REST API Routers Message Broker Tablet controller Storage units

Geographic Replication Clients REST API Region 1 Routers Message Broker Tablet controller Region 2 Storage units Region 3

Storage unit Tablets In-region Load Balance

Data and Query Models • Simplified rational data model: tables of records • Typed columns • Typical data types plus the blob type • Does not enforce inter-table relationship • Operation: selection, projection (no join, aggregation, …) • Options: point access, range query, multiget

MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 Storage unit 1 Storage unit 2 Storage unit 3 Record Assignment Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Router Lime Mango Orange Strawberry Tomato Watermelon

Write key k SU SU SU 2 4 1 8 3 5 7 6 Single Point Update Sequence # for key k Write key k Routers Message brokers Write key k Sequence # for key k SUCCESS Write key k

MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? Storage unit 1 Storage unit 2 Storage unit 3 Range Query Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon

Relaxed Consistency • ACID transactions • Sequential consistency: too strong • Non-trivial overhead for asynchronous settings • Users can tolerate stale data in many cases • Go hybrid: eventual consistency + mechanism for SC • Use versioning to cope with asynchrony Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1

Relaxed Consistency read_any() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1

Relaxed Consistency read_latest() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1

Relaxed Consistency read_critical(“v.6”) Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1

Relaxed Consistency write() Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1

Relaxed Consistency test_and_set_write(v.7) ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1

Membership Management • Record timelines should be coherent for each replica • Updates must be applied to the latest version • Use mastership • Per-record basis • Only one replica has mastership at anytime • All update requests are sent to master to get ordered • Routers & YMB maintain mastership information • Replica receiving frequent write req. gets the mastership • Leader election service provided by ZooKeeper

ZooKeeper • A distributed system is like a zoo, someone needs to be in charge of it. • ZooKeeper is a highly available, scalable coordination svc. • ZooKeeper plays two roles in PNUTS • Coordination service • Publish/subscribe service • Guarantees: • Sequential consistency; Single system image • Atomicity (as in ACID); Durability; Timeliness • A tiny kernel for upper level building blocks

ZooKeeper: High Availability • High availability via replication • A fault-tolerant persistent store • Providing sequential consistency

ZooKeeper: Services • Publish/Subscribe Service • Contents stored in ZooKeeper organized as directory trees • Publish: write to specific znode • Subscribe: read specific znode • Coordination via automatic name resolution • By appending sequence number to names • CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) • “/…/x-1”, “/…/x-2”, … • Ephemeral nodes: znodes living as long as the session

ZooKeeper Example: Lock 1) id = create(“…/locks/x-”, SEQUENCE | EMPHEMERAL); 2) children = getChildren(“…/locks”, false); 3) if (children.head == id) exit(); 4) test = exists(name of last child before id, true); 5) if (test == false) goto 2); 6) wait for modification to “…/locks”; 7) goto 2);

ZooKeeper Is Powerful • Many core svc. in distributed sys. built on ZooKeeper • Consensus • Distributed locks (exclusive, shared) • Membership • Leader election • Job tracker binding • … • More information at http://hadoop.apache.org/zookeeper/

Experimental Setup • Production PNUTS code • Enhanced with ordered table type • Three PNUTS regions • 2 west coast, 1 east coast • 5 storage units, 2 message brokers, 1 router • West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array • East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk • Workload • 1200-3600 requests/second • 0-50% writes • 80% locality

Scalability

Sensitivity to R/W Ratio

Sensitivity to Request Dist.

Related Work • Google BigTable/GFS • Fault-tolerance and consistency via Chubby • Strong consistency – Chubby not scalable • Lack of geographic replication support • Targeting analytical workloads • Amazon Dynamo • Unstructured data • Peer-to-peer style solution • Eventual consistency • Facebook Cassandra (still kind of a secret) • Structured storage over peer-to-peer network • Eventual consistency • Always writable property – success even in the face of a failure

Discussion • Can all web applications tolerate stale data? • Is doing replication completely across WAN a good idea? • Single level router vs. B+ tree style router hierarchy • Tiny service kernel vs. stand alone services • Is relaxed consistency just right or too weak? • Is exposing record versions to applications a good idea? • Should security be integrated into PNUTS? • Using pub/sub service as undo logs

D 3 S: Debugging Deployed Distributed Systems