OceanStore: Global Distributed Storage System for Data Security

Distributed Architectures

Introduction • Computing everywhere: • Desktop, Laptop, Palmtop • Cars, Cellphones • Shoes? Clothing? Walls? • Connectivity everywhere: • Rapid growth of bandwidth in the interior of the net • Broadband to the home and office • Wireless technologies such as CMDA, Satellite, laser • There is a need for persistent storage. • Is it possible to provide an Internet-based distributed storage system.

System Requirements • Key requirements: • Be able to deal with intermittent connectivity provided to some computing devices • Information must be kept secure from theft and denial of service attacks. • Information must be extremely durable. • Information must be separate from location. • Should have a uniform and highly available access to information.

Implications of Requirements • Don’t want to worry about backup • Don’t want to worry about obsolescence • Need lots of resources to make data secure and highly available, BUT don’t want to own them • Outsourcing of storage already becoming popular • Pay monthly fee and your “data is out there” • Simple payment interface One bill from one company

Global Persistent Store • Persistent store should be characterized as follows: • Transparent: Permits behavior to be independent of the device themselves • Consistently: Allows users to safely access the same information from many different devices simultaneously. • Reliably: Devices can be rebooted or replaced without losing vital configuration information

Applications • Groupware and Personal Information Management Tools • Examples: Calendar, email, contact lists, distributed design tools. • Must allow for concurrent updates from many people • Users must see an ever-progressing view of shaired information even when conflicts occurs. • Digital libraries and repositories for scientific data. • Require massive quantities of storage. • Replication for durability and availability is desirable.

Questions about Persistent Information • Where is persistent information stored? • Want: Geographic independence for availability, durability, and freedom to adapt to circumstances • How is it protected? • Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity • Can we make it indestructible? • Want: Redundancy with continuous repair and redistribution for long-term durability • Is it hard to manage? • Want: Automatic optimization, diagnosis and repair • Who owns the aggregate resouces? • Want: Utility Infrastructure!

Canadian OceanStore Sprint AT&T IBM Pac Bell IBM OceanStore: Utility-based Infrastructure • Transparent data service provided by federationof companies: • Monthly fee paid to one service provider • Companies buy and sell capacity from each other

OceanStore: Everyone’s Data, One Big Utility“The data is just out there” • How many files in the OceanStore? • Assume 1010 people in world • Say 10,000 files/person (very conservative?) • So 1014 files in OceanStore! • If 1 gig files (ok, a stretch), get 1 mole of bytes!

Oceanstore Goals • Untrusted infrastructure • Only clients can be trusted • Servers can crash, or leak information to third parties • Most of the servers are working correctly most of the time • Class of trusted servers that can carry out protocols on the clients behalf (financially liable for integrity of data) • Nomadic Data Access • Data can be cached anywhere, anytime (promiscuous caching) • Continuous introspective monitoring to locate data close to the user

OceanStore:Everyone’s data, One big Utility “The data is just out there” • Separate information from location • Locality is an only an optimization (an important one!) • Wide-scale coding and replication for durability • All information is globally identified • Unique identifiers are hashes over names & keys • Single uniform lookup interface replaces: DNS, server location, data location • No centralized namespace required

OceanStore Assumptions • Untrusted Infrastructure: • The OceanStore is comprised of untrusted components • Only ciphertext within the infrastructure • Responsible Party: • Some organization (i.e. service provider) guarantees that your data is consistent and durable • Not trusted with content of data, merely its integrity • Mostly Well-Connected: • Data producers and consumers are connected to a high-bandwidth network most of the time • Exploit multicast for quicker consistency when possible • Promiscuous Caching: • Data may be cached anywhere, anytime • Optimistic Concurrency via Conflict Resolution: • Avoid locking in the wide area • Applications use object-based interface for updates

Naming • At the lowest level, OceanStore objects are identified by a globally unique identifier (GUID). • A GUID is a pseudorandom fixed length bit string. • The GUID does not contain any location information and is not human readable. • Passwd vs. 12agfs237fundfhg666abcdefg999ldfnhgga • Generation of GUIDs is adapted from the concept of a self-certifying pathname which inherently specify all information necessary to communicate securely with remote file servers (e.g., network address; public key).

Naming • Create hierarchies using “directory” objects • To allow arbitrary directory hierarchies to be built, Oceanstore allows directores to contain pointers to other directories. • A user can choose several directories as “roots” and secure those directories through external methods, such as a public key authority. • These root directories are only roots wrt clients using them. The system has no root. • An object GUID is a secure hash (160 bit SHA-1) of owner’s key and some human readable name

Naming • Is 160 bits enough? • Good enough for now • Requires over 280 unique objects before collisions worrisome.

Access Control • Restricting Readers • To prevent unauthorized reads, we encrypt all data in the system that is not completely public and distribute the encryption key to those readers with read permission. • To revoke read permission: • Data should be deleted from all replicas • Data should be re-encrypted • New keys should be distributed • Clients can still access old data till it is deleted in all replicas.

Access Control • Restricting Writers • All writes are signed to prevent unauthorized writes. • Validity checked by Access Control Lists (ACLs) • The owner can securely choose the ACL x for an object foo by providing a signed certificate that translates to “Owner says use ACL x for object foo” • If A trusts B, B trusts C and C trusts D, then does A trust D? • Depends on the policy! • Needs work

Data Routing and Location • Entities are free to reside on any of the OceanStore servers. • This provides maxiumum flexibility in selecting policies for replication, availability, caching and migration. • Makes the process of locating and interaction with the objects more complex.

Routing • Address an object by its GUID • Message label: GUID, random number; no destination IP address. • Oceanstore routes message to closest GUID replica. • Oceanstore combines data location and routing: • No name service to attack • Save one round-trip for location discovery

Two-Tiered Approach to Routing • Fast, probabilistic search: • Built from attenuated Bloom filters • Why? This works well if items accessed frequently reside close to where they are being used. • Slow guaranteed search based on the Plaxton Mesh data structure used for underlying routing infrastructure • Messages begin by routing from node to node along the distributed data structure until the destination is discovered.

Attenuated Bloom Filter • Fast, probabilistic search algorithm based on Bloom filters. • A Bloom filter is a randomized data structure • Strings are stored using multiple hash functions • It can be queried to check the presence of a string • Membership queries result in rare false positives but never false negatives.

h1 1 1 h2 1 h3 1 hk Attenuated Bloom Filter • Bloom filter • Assume hash functions h1,h2,…hk • A Bloom filter is a bit vector of length w. • On input the Bloom filter computes k hash functions (denoted h1,h2,…hk ) X

Attenuated Bloom Filter • To determine whether the set represented by a Bloom filter contains a given element, that element is hashed and the corresponding bits in the filter are examined. • If all of the bits are set, the set may contain the object. • False positive probability decreases exponentially with linear increase in the number of hash functions and memory.

Attenuated Bloom Filter • Attenuated Bloom filter • Array of d Bloom filters • Associate each neighbor link with an attenuated Bloom filter. • The first filter in the array summarizes documents available from that neighbor (one hop along the link). • The i th Bloom filter is a union of all Bloom filters for all of the nodes that are a distance of i hops.

4b n3 00011 2 11100 11010 11010 11011 5 4a n1 n2 10101 11100 1 n4 3 00011 • Rounded Box: Local Bloom Filter • Unrounded Box: Attenuated Bloom filter Attenuated Bloom Filter

Attenuated Bloom Filter • Example • Assume that an object whose GUID hashes to bits 0,1,and 3 is being searched for. The steps are as follows: • The local Bloom Filter for n1 shows that it does not have the object. • The attenuated Bloom filter at n1 is used to determine which neighbor may have the object. • The query moves to n2 whose Bloom filter indicates that it does not have the document. • (a) Examining the attenuated Bloom filter shows that n4 doesn’t have the document but that n3 may. • The query is forwarded to n3 which verifies that it has the object.

Attenuated Bloom Filter • Routing • Performing a location query: • The querying node examine the first level of each of its neighbors’ attenuated Bloom Filters. • If one of the filters matches, it is likely that the desired data item is only one hop away, and the query is forwarded to the matching neighbor closest to the current node in network latency. • If no filter matches, the querying node looks for a match in the 2nd level of every filter. • If a match is found, the query is forwarded to the matching neighbor of lowest latency. • If the data can’t be found then go to the deterministic search.

Deterministic Routing and Location • Based on Plaxton trees • A Plaxton tree is a distributed tree structure where every node is the root of a tree. • Every server in the system is assigned a random node identifier (node-ID). • Each object has root node • f(GUID) = RootID • Root node is responsible for storing object’s location

Deterministic Routing and Location • Local routing maps are used by each node. These are referred to as neighbor maps. • Nodes keep “nearest” neighbor pointers differing in one digit. • Each node is connected to other nodes via neighbor links of various levels. • Level 1 edges from a given node connect to the 16 nodes closest (in network latency) with different values in the lowest digit of their addresses.

Deterministic Routing and Location • Level-2 edges connect to the 16 closest nodes that match in the lowest digit and have different second digits, etc; • Neighborhood links provide a route from every node to every other node in the system: • Resolve the destination node address one digit at a time using a level-1 edge for the first digit, a level-2 edge for the second etc • A Plaxton mesh refers to a set of trees where there is a tree rooted at every node.

0 digits equal 3 digits equal 0642 x042 xx02 xxx0 1642 x142 xx12 xxx1 MSB unequal digit = 0 2642 x242 xx22 xxx2 3642 x342 xx32 xxx3 4642 x442 xx42 xxx4 5642 x542 xx52 xxx5 6642 x642 xx62 xxx6 7642 x742 xx72 xxx7 MSB unequal digit = 1 Lookup map for node 0642 Deterministic Routing and Location Example: Route from ID 0694 to 5442 0694  0692 0642  0442  5442 Tapestry: An Infrastructure for Fault-tolerant Wide-area Location & Routing J. Kubiatowicz and A. Joseph UC Berkeley Technical Report

Global Routing Algorithm • Move to the closest node to you that shares at least one more digit with the target destination • E.g.: 0325 B4F8 9098 7598  4598 B4F8 L1 L2 3E98 0325 9098 2BB8 0098 1598 L3 7598 4598 L4 D598 2118 87CA

Deterministic Routing and Location • This routing method guarantees that any existing unique node in the system will be found in at most LogbN logical hops, in a system with an N size namespace using IDs of base b.

Deterministic Routing and Location • A server S publishes that it has an object O by routing a message to the “root node” of O. • The publishing process consists of sending a message toward the root node. • At each hop along the way,the publish message stores location information in the form of a mapping <Object-ID(O),Server-ID(S)>. • These mappings are simply pointers to the server S where O is being stored.

Deterministic Routing and Location • During a location query, clients send messages to objects. A message destined for O is initially routed toward’s O’s root. • At each step, if the message encounters a node that contains the location mapping for O, it is immediately redirected to the server containing the object. • Else the message is forwarded one step closer to the root.

Deterministic Routing and Location • Where multiple copies of data exist in Plaxton, each node en route to the root node stores locations of all replicas. • The replica chosen may be defined by the application.

Achieving Fault Tolerance • If each object has a single root then it becomes a single point of failure, the potential subject of denial of service attacks, and an availability problem. • This is addressed by having multiple roots which can be achieved by hashing each GUID with a small number of different salt values. • The result maps to several different root nodes, thus gaining redundancy and making it difficult to target for denial of service.

Achieving Fault Tolerance • This scheme is sensitive to corruption in the links and pointers. • Bad links can be immediately detected and routing can be continued by jumping to a random neighbor node. • Each tree spans every node, hence any node should be able to reach the root.

Update Model and Consitency Resolution • Depending on the application, there could be a high degree of write sharing. • Do not really want to do wide-area locking.

Update Model and Consistency Resolution • The process of conflict resolution starts with a series of updates, chooses a total order among them, then applies them atomically in that order. • The easiest way to compute this order is to require that all updates pass through a master replica. • All other replicas can be thought of as read caches. • Problems • Does not scale • Vulnerable to Denial of Service (DOS) attacks • Requires trusting a single replica

Update Model and Consistency Resolution • The notion of distributed consistency solves DoS/Trust issues • Replace master replica with a primary tier of replicas. • These replicas cooperate with one another in Byzantine agreement protocol to choose the final commit order for updates. • Problem: • The message traffic needed by Byzantine agreement protocols is high.

Update Model and Consistency Resolution • Basically a two tier architecture is used. • Primary tier is where replicas use Byzantine Agreement • A secondary tier acts as a distributed read/write cache. • The secondary tier is kept up-to-date via “push” or “pull”

The Path of an Update • Client creates and signs an update against an object • Cilent sends update to primary tier and several random secondary tier replicas • While primary tier applies update against master copy, secondary tier propagates update epidemically. • Primary tier’s decision is passed to all replicas

Update Model and Consistency Resolution • A secondary tier of replicas communicate among themselves and the primary tier using a protocol that implements an epidemic algorithm. • Epidemic algorithms propagate updates to all replicas in as few messages as possible. • A process P picks another server Q at random and subsequently exchanges updates with Q.

The Path of an Update

Update Model and Consistency Resolution • Note that secondary replicas contain both tentative and committed data. • The epidemic-style communication pattern is used used to quickly spread tentative commits among themselves and to pick a tentative serialization order which is based on optimistically timestamping their updates.

Byzantine Agreement • A set of processes need to agree on a value, after one or more processes have proposed what that value (decision) should be • Processes may be correct, crashed, or they may exhibit arbitrary (Byzantine) failures • Messages are exchanged on an one-to-one basis, and they are not signed

Problems of Agreement • The general goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue and to establish that consensus within a finite number of steps. • What if processes exhibit Byzantine failures. • This is often compared to armies in the Byzantine Empire in which there many conspiracies, intrigue and untruthfulness were alleged to be common in ruling circles. • A traitor is analogous to a fault.

Byzantine Agreement • We will illustrate by example where there are 4 generals, where one is a traitor. • Step 1: • Every general sends a (reliable) message to every other general announcing his troop strength. • Loyal generals tell the truth. • Traitors tell every other general a different lie. • Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

OceanStore: Global Distributed Storage System for Data Security

OceanStore: Global Distributed Storage System for Data Security

Presentation Transcript

Distributed VR Architectures

Distributed Systems Architectures

Architectural Design, Distributed Systems Architectures

Distributed System Concepts and Architectures

Distributed System Concepts and Architectures

Architectural Design, Distributed Systems Architectures

Distributed (Operating) Systems -Architectures-

Distributed (Operating) Systems -Architectures-

Ch4: Distributed Systems Architectures

Distributed System Concepts and Architectures Summary

Design Requirements for Distributed Architectures

MIMD Distributed Memory Architectures

Distributed Systems and Architectures

Architectures for Distributed Systems

Distributed Systems Architectures

Distributed System Concepts and Architectures

Pattern-Oriented Distributed Software Architectures

Distributed System Architectures

Architectures for Distributed Systems

Distributed Systems Architectures

Distributed Systems Architectures

Distributed Systems Architectures