Mehmet Gunes

CPE 401 / 601Computer Network Systems Distributed Hash Tables Mehmet Gunes Modified from AshwinBharambe, Robert Morris, Krishna Gummadi and JinyangLi

Peer to peer systems • Symmetric in node functionalities • Anybody can join and leave • Everybody gives and takes • Great potentials • Huge aggregate network capacity, storage etc. • Resilient to single point of failure and attack • E.g. Gnapster, Gnutella, Bittorrent, Skype, Joost

The lookup problem N2 N1 N3 Put (Key=“title” Value=file data…) Internet ? Client Publisher Get(key=“title”) N4 N6 N5 • At the heart of all DHTs

Centralized lookup (Napster) N2 N1 SetLoc(“title”, N4) N3 Client DB N4 Publisher@ Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 Simple, but O(N) state and a single point of failure • Not scalable • Central service needs $$$

Improved #1: Hierarchical lookup • Performs hierarchical lookup (like DNS) • More scalable than a central site • Root of the hierarchy is still a single point of vulnerability

Flooded queries (Gnutella) N2 N1 Lookup(“title”) N3 Client N4 Publisher@ Key=“title” Value=file data… N6 N8 N7 N9 Too much overhead traffic Limited in scale No guarantee for results Robust, but worst case O(N) messages per lookup

Improved #2: super peers • E.g. Kazaa, Skype • Lookup only floods super peers • Still no guarantees for lookup results

DHT approach: routing • Structured: • put(key, value); get(key, value) • maintain routing state for fast lookup • Symmetric: • all nodes have identical roles • Many protocols: • Chord, Kademlia, Pastry, Tapestry, CAN, Viceroy

Routed queries (Freenet, Chord, etc.) N2 N1 N3 Client N4 Lookup(“title”) Publisher Key=“title” Value=file data… N6 N8 N7 N9

Hash Table • Name-value pairs (or key-value pairs) • E.g,. “Mehmet Hadi Gunes” and mgunes@unr.edu • E.g., “http://cse.unr.edu/” and the Web page • E.g., “HitSong.mp3” and “12.78.183.2” • Hash table • Data structure that associates keys with values value lookup(key) key value

Key Values: Examples Amazon: • Key: customerID • Value: customer profile (e.g., buying history, credit card, ..) Facebook, Twitter: • Key: UserID • Value: user profile (e.g. posting history, photos, friends, …) • iCloud/iTunes: • Key: Movie/song name • Value: Movie, Song

System Examples Amazon • Dynamo: internal key value store used to power Amazon.com (shopping cart) • Simple Storage System (S3) BigTable/HBase/Hypertable: distributed, scalable data storage Cassandra: “distributed data management system” (developed by Facebook) Memcached: in-memory key-value store for small chunks of arbitrary data (strings, objects) eDonkey/eMule: peer-to-peer sharing system …

Distributed Hash Table • Hash table spread over many nodes • Distributed over a wide area • Main design goals • Decentralization • no central coordinator • Scalability • efficient even with large # of nodes • Fault tolerance • tolerate nodes joining/leaving

…. node node node Distributed hash table (DHT) Distributed application data get (key) put(key, data) Distributed hash table lookup(key) node IP address Lookup service • DHT distributes data storage over many nodes • Each node knows little information • Low computational/memory overhead

Key Value Store Also called a Distributed Hash Table (DHT) Main idea: partition set of key-values across many machines key, value …

K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea

K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Neighboring nodes are “connected” at the application-level

K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Operation: take key as input; route messages to node holding key

K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea insert(K1,V1) Operation: take key as input; route messages to node holding key

(K1,V1) K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Operation: take key as input; route messages to node holding key

K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea retrieve (K1) Operation: take key as input; route messages to node holding key

Distributed Hash Table • Two key design decisions • How do we map names on to nodes? • How do we route a request to that node?

Hash Functions • Hashing • Transform the key into a number • And use the number to index an array • Example hash function • Hash(x) = x mod 101, mapping to 0, 1, …, 100 • Challenges • What if there are more than 101 nodes? Fewer? • Which nodes correspond to each hash value? • What if nodes come and go over time?

Consistent Hashing • “view” = subset of hash buckets that are visible • For this conversation, “view” is O(n) neighbors • But don’t need strong consistency on views • Desired features • Balanced: in any one view, load is equal across buckets • Smoothness: little impact on hash bucket contents when buckets are added/removed • Spread: small set of hash buckets that may hold an object regardless of views • Load: across views, # objects assigned to hash bucket is small

Challenges Fault Tolerance: handle machine failures without losing data and without degradation in performance Scalability: • Need to scale to thousands of machines • Need to allow easy addition of new machines Consistency: maintain data consistency in face of node failures and message losses Heterogeneity (if deployed as peer-to-peer systems): • Latency: ms to secs • Bandwidth: kb/s to Gb/s …

Key Fundamental Design Idea - I • Consistent Hashing • Map keys and nodes to an identifier space; implicit assignment of responsibility B C D A Identifiers 1111111111 0000000000 • Mapping performed using hash functions (e.g., SHA-1) • Spread nodes and keys uniformly throughout

Fundamental Design Idea - II • Prefix / Hypercube routing Source Destination

Definition of a DHT • Hash table  supports two operations • insert(key, value) • value = lookup(key) • Distributed • Map hash-buckets to nodes • Requirements • Uniform distribution of buckets • Cost of insert and lookup should scale well • Amount of local state (routing table size) should scale well

DHT ideas • Scalable routing needs a notion of “closeness” • Ring structure [Chord]

01100100 01100101 01100110 01100111 011000** 011001** 011010** 011011** Different notions of “closeness” • Tree-like structure, [Pastry, Kademlia, Tapestry] 01100101 Distance measured as the length of longest matching prefix with the lookup key

0,0.5, 0.5,1 0.5,0.5, 1,1 0.5,0.25,0.75,0.5 0.75,0, 1,0.5 0,0, 0.5,0.5 0.5,0,0.75,0.25 Different notions of “closeness” • Cartesian space [CAN] (1,1) Distance measured as geometric distance (0,0)

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory put(K14, V14) K5 N2 K14 N3 K105 N50 put(K14, V14) K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory get(K14) V14 K5 N2 K14 N3 K105 N50 get(K14) V14 K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query (below) • Return node to requester and let requester contact node Master/Directory put(K14, V14) N3 K14 N3 K5 N2 put(K14, V14) K105 N50 K14 V14 K5 V5 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query • Return node to requester and let requester contact node get(K14) N3 Master/Directory K14 N3 V14 K5 N2 get(K14) K105 N50 K14 V14 K5 V5 K105 V105 … N2 N3 N50 N1

Discussion: Iterative vs. Recursive Query Recursive Query Advantages: • Faster, as typically master/directory closer to nodes • Easier to maintain consistency, as master/directory can serialize puts()/gets() Disadvantages: scalability bottleneck, as all “Values” go through master/directory Iterative Query Advantages: more scalable Disadvantages: slower, harder to enforce data consistency Recursive Iterative Master/Directory Master/Directory get(K14) get(K14) V14 N3 N3 K14 N3 K14 V14 get(K14) get(K14) V14 V14 V14 K14 K14 … … N2 N3 N50 N2 N3 N50 N1 N1

Fault Tolerance Replicate value on several nodes Usually, place replicas on different racks in a datacenterto guard against rack failures Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14), N1 put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Fault Tolerance Again, we can have • Recursive replication (previous slide) • Iterative replication (this slide) Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Fault Tolerance Or we can use recursive query and iterative replication… Master/Directory put(K14, V14) K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Scalability Storage: use more nodes Number of requests: • Can serve requests from all nodes on which a value is stored in parallel • Master can replicate a popular value on more nodes Master/directory scalability: • Replicate it • Partition it, so different keys are served by different masters/directories Q: How do you partition?

Scalability: Load Balancing Directory keeps track of storage availability at each node • Preferentially insert new values on nodes with more storage available What happens when a new node is added? • Move values from heavy loaded nodes to the new node What happens when a node fails? • Need to replicate values from failed node to other nodes

Consistency Need to make sure that a value is replicated correctly How do we know a value has been replicated on every node? • Wait for acknowledgements from every node What happens if a node fails during replication? • Pick another node and try again What happens if a node is slow? • Slow down the entire put()? Pick another node? In general, with multiple replicas • Slow puts and fast gets

Consistency (cont’d) If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order Master/Directory • put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in reverse order • What does get(K14) return? • Undefined! K5 N2 put(K14, V14’’) put(K14, V14’) K14 N1,N3 K105 N50 put(K14, V14’') put(K14, V14’) put(K14, V14’’) put(K14, V14’) K14 K14 K14 K14 V14 V14’ V14’’ V14 K5 V5 K105 V105 … N2 N3 N50 N1

Consistency Key consistency: All nodes agree on who holds a particular key Data consistency: All nodes agree on the value associated with a key Reachability consistency: All nodes are connected and can be reached in a lookup. Chord aims to provide eventual reachability consistency, upon which key and data consistency are built.

Quorum Consensus Improve put() and get() operation performance • Define a replica set of size N • put() waits for acknowledgements from at least W replicas • get() waits for responses from at least R replicas • W+R > N Why does it work? • There is at least one node that contains the update

Quorum Consensus Example N=3, W=2, R=2 Replica set for K14: {N1, N3, N4} Assume put() on N3 fails put(K14, V14) ACK put(K14, V14) put(K14, V14) ACK K14 V14 K14 V14 N2 N3 N4 N1

Quorum Consensus Example Now, issuing get() to any two nodes out of three will return the answer get(K14) get(K14) nil V14 K14 V14 K14 V14 N2 N3 N4 N1

Scaling Up Directory Challenge: • Directory contains a number of entries equal to number of (key, value) tuples in the system • Can be hundreds of billions of entries in the system! Solution: consistent hashing Associate to each node a unique idin an uni-dimensional space 0..2m-1 • Partition this space across m machines • Assume keys are in same uni-dimensional space • Each (Key, Value) is stored at the node with the smallest ID larger than Key

Mehmet Gunes

Mehmet Gunes

Presentation Transcript

Mehmet Kanık

GUNES SU ATLAMAZ

TURKISH INSURANCE INDUSTRY AND GUNES SIGORTA