CS 294-8Distributed Data Structureshttp://www.cs.berkeley.edu/~yelick/294
Agenda • Overview • Interface Issues • Implementation Techniques • Fault Tolerance • Performance
Overview • Distributed data structures are an obvious abstraction for distributed systems. Right? • What do you want to hide within one? • Data layout? • When communication is required? • # and location of replicas • Load balancing
Distributed Data Structures • Most of these are containers • Two fundamentally difference kinds: • Those with integrators or ability to look at all container elements • Arrays, meshes, databases*, graphs* and trees* (sometimes) • Those with only single element ops • Queue, directory (hash table or tree), all *’d items above
DDS in Ninja • Described in Gribble, Brewer, Hellerstein, Culler • A distributed data structure (DDS) is a self-managing layer for persistent data. • High availability, concurrency, consistency, durability, fault tolerance, scalability • A distributed hash table is an example • Uses two-phase commits for consistency • Partitioning for scalability
Scheduling Structures • In serial code, most scheduling is done with a stack (often implicit), a FIFO queue, or a priority queue • Do all of these makes sense in a distributed setting? • Are there others?
Distributed Queues • Load balancing (work stealing…) • Push new work onto a stack • Execute locally by popping from the stack • Steal remotely by removing from the bottom of the stack (FIFO)
Interfaces (1) • Blocking atomic interfaces: operations happen between invocation and return • Internally each operation performs locking or other form of synchronization • Non-blocking “atomic” interfaces: operation happens sometime after invocation • Often paired with completion synchronization • Request/response for each operation • Wait for all “my” operations to complete • Wait for all operations in the world to complete
Interfaces (2) • Non-atomic interface: use external synchronization • Undefined under certain kinds (or all) concurrency • May be paired with bracketing synchronization • Aquire-insert-lock, insert, insert, Release-insert-lock • Begin-transaction… • Operations with no semantics (no-ops) • Prefetch, Flush copies, … • Operations that allow for failures • Signal “failed”
DDS Interfaces • Contrast: • RDBMS’s provide ACID semantics on transactions • Distributed files systems: NFS weak, Frangipani and AFS stronger • DDS: • All operations on elements are atomic (indivisible, all or nothing) • This seems to mean that the hash table operations that involve a single element are atomic • One-copy equivalence: replication of elements is invisible • No transaction across elements or operations
Implementation Strategies (1) • Two simple techniques • Partitioning: • Used when the d.s. is large • Used when writes/updates are frequent • Replication: • Used when writes are infrequent and reads are very frequent • Used to tolerate failures • Full static replication is extreme; dynamic partial replication is more common • Many hybrids and variations
Implementation Strategies (2) • Moving data to computation good for: • dynamic load balancing • I.e., idle processors grab work • smaller objects in ops involving > 1 object • Moving computation to data good for: • large data structures • Other?
DDS: Distributed Hash Table • Operations include: • Create, Destroy • Put, Get, and Remove • Built with storage “bricks” • Each manage a single node, network-visible hash table • Contain a buffer cache, lock manager, network stubs and skeletons • Data is partitioned, and partitions are replicated • Replica groups are used for each partition
DDS: Distributed Hash Table • Operations on elements: • Get – use any replica in appropriate group • Put or remove – update all replicas in group using two-phase commit • DDS library is commit coordinator • If individual node crashes during commit phase, it is removed from replica • If DDS fails during commit phase, individual nodes will coordinate: if any have committed, all must
DDS: Hash Table Key: 110011 0 1 0 1 0 1 0 1 0 1 DP map RG map
Example: Aleph Directory • Maps names to mobile objects • Files, locks (?), processes,… • Interested in performance at scale, not reliability • Two basic protocols: • Home: each object has a fixed “home” PE that keeps track of cache copies • Arrow: based on path-reversal idea
Path Reversal Find
Aleph Directory Performance • Aleph is implemented as Java packages on top of RMI (and UDP?) • Run on small systems (up to 16 nodes) • Assumed that “home” centralized solution would be faster at this scale • 2 messages to request; 2 to retrieve • Arrow was actually faster • Log2 p to request; 1 to retrieve • In practice, only 2 to request (counter ex.)
Hybrid Directory Protocol • Essentially the same as the “home” protocol, except • Link waiting processors into a chain (across the processors) • Each keeps the id of the processor ahead of it in the chain • Under high contention, resource moves down the chain • Performance: • Faster than home and arrow on counter benchmark and some others…
How Many Data Structures? • Gribble et al claim: • “We believe that given a small set of DDS types (such as a hash table, a tree, and an administrative log), authors will be able to build a large class of interesting and sophisticated servers.” • Do you believe this? • What does it imply about tools vs. libraries?
Administrivia • Gautam Kar and Joe L. Hellerstein speaking Thursday • Papers online • Contact me about meeting with them • Final projects: • Send mail to schedule meeting with me • Next week: • Tuesday: guest lecture by Aaron Brown on benchmarks; related to Kar and Hellerstein work. • Still to come: Gray, Lamport, and Liskov