1 / 41

OceanStore: An Architecture for Global-Scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent Storage. Introduction. Vision: ubiquitous computing devices Goal: transparency Where to store persistent information? How to protect against system failures? How to upgrade components without losing configuration info?

faulknerr
Télécharger la présentation

OceanStore: An Architecture for Global-Scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStore: An Architecture for Global-Scale Persistent Storage

  2. Introduction • Vision: ubiquitous computing devices • Goal: transparency • Where to store persistent information? • How to protect against system failures? • How to upgrade components without losing configuration info? • How to manage consistency?

  3. Introduction • Requirements • Intermittent connectivity • Secure from theft and denial-of-service • Durable information • Automatic and reliable archival services • Information divorced from location • Geographically distributed servers • Caching close to clients • Information can migrate to wherever it is needed • Scale: 1010 users, each with 10,000 files

  4. OceanStore: A True Data Utility • Utility model: consumers pay a monthly fee in exchange for access to persistent storage • Highly available data from anywhere • Automatic replication for disaster recovery • Strong security • Providers would buy and sell capacity among themselves for mobile users • Deep archival storage: use excess of storage space to ease data management

  5. Two Unique Goals • Use untrusted infrastructure • May crash without warning • Encrypted information in the infrastructure • Responsible party is financially responsible for the integrity of data • Support nomadic data • Data can be cached anywhere, anytime • Continuous introspective monitoring to manage caching and locality

  6. System Overview • The fundamental unit in OceanStore: a persistent object • Named by a globally unique identifier (GUID) • Replicated and stored on multiple servers • Independent of the server (floating replicas) • Two mechanisms to locate a replica • Probabilistically probe neighboring machines • Slower deterministic algorithm

  7. OceanStore Updates • Each update (or groups of updates) to an object creates a new version • Consistency is based on versioning • No need for backup • Pointers are permanent

  8. OceanStore Objects • An active object is the latest version of its data • An archival object is a permanent, read-only version of the object • Encoded with an erasure code • Any m out of n fragments can reconstruct the original data • Can support either weak or strong consistency models

  9. Applications • Groupware: calendar, email, contact lists, distributed design tools • Allow concurrent updates • Provide ways to merge information and detect conflicts

  10. Applications • Digital libraries • Require massive quantities of storage • Replication for durability and availability • Deep archival storage to survive disaster • Seamless migration of data to where it is needed • Sensor data aggregation and dissemination

  11. Naming • GUID: pseudo-random fixed-length bit string • Naming facility • Decentralized • Self-certifying path names • GUID = hash(user key, file name) • Multiple roots in OceanStore • GUID of a server is a secure hash of its key • GUID of a data fragment is a secure hash of the data content

  12. Access Control • Reader restriction • Encrypt all data • Revocation • Delete all replicas • Encrypt all replicas with a new key • A server can use old keys to access cached old data

  13. Access Control • Writer restriction • Writes are signed • Reads are restricted at clients • Writes are restricted at servers

  14. Data Location and Routing • Objects can reside on any of the OceanStore servers • Use query routing to locate objects

  15. Distributed Routing in OceanStore • Every object is identified by one or more GUIDs • Different replicas of the same object has the same GUID • OceanStore messages are labeled with • A destination GUID (built on top of IP) • A random number • A small predicate

  16. Bloom Filters • Based on the idea of hill-climbing • If a query cannot be satisfied by a server, local information is used to route the query to a likely neighbor • Via a modified version of a Bloom filter

  17. Bloom Filter • A Bloom filter • Represents a set S = {S1, … Sn} • Is depicted by a m bit array, filter[m] • Uses r independent hash functions • h1…hr • for i = 1…n • for j = 1…r • filter[hj[Si]] = 1

  18. Insertion Example • m = 6, r = 3 • To insert word x • h1(x) = 0 • h2(x) = 3 • h3(x) = 5 • filter[] = {1, 0, 0, 1, 0, 1}

  19. Insertion Example • m = 6, r = 3 • To insert word y • h1(y) = 1 • h2(y) = 3 • h3(y) = 5 • filter[] = {1, 1, 0, 1, 0, 1}

  20. Testing Example • filter[] = {1, 1, 0, 1, 0, 1} • Does x belong to the set? • filter[h1(x)] = filter[0] = 1 • filter[h2(x)] = filter[3] = 1 • filter[h3(x)] = filter[5] = 1 • Does z belong to the set? • filter[h1(z)] = filter[2] = 0  no • filter[h2(z)] = filter[3] = 1 • filter[h3(z)] = filter[5] = 1

  21. False Positives • If filter[i] = 0, it’s not in S • If filter[i] = 1, it’s probably in S • False positive rate depends on • Number of hash functions • Array size • Number of unique elements in S

  22. Attenuated Bloom Filters • An attenuated Bloom filter of depth D is an array of D normal Bloom filters • ith Bloom filter is the union of all the Bloom filters for all of the nodes at a distance i • One filter per network edge

  23. Attenuated Bloom Filters • Lookup 11010

  24. The Global Algorithm: Wide-Scale Distributed Data Location • Plaxton’s randomized hierarchical distributed data structure • Resolve one digit of the node id at a time

  25. The Global Algorithm: Wide-Scale Distributed Data Location

  26. Achieving Locality • Each new replica only needs to traverse O(log(n)) hops to reach the root, where n is the number of the servers

  27. Achieving Fault Tolerance • Avoid failures at roots • Each root GUID is hashed with a small number of different salt values • Make it difficult to target a single GUID for DoS attacks • If failures are detected, just jump to any node to reach the root • OceanStore continually monitors and repairs broken pointers

  28. Advantages of Distributed Information • Redundant paths to roots • Scalable with a combination of probabilistic and global algorithms • Easy to locate and recover failed components • Plaxton links form a natural substrate for admission controls and multicasts

  29. Achieving Maintenance-Free Operation • Recursive node insertion and removal • Replicated roots • Use beacons to detect faults • Time-to-live fields to update routes • Second-chance algorithm to avoid false diagnoses of failed components • Avoid the cost of recovering lost nodes • Automatic reconstruction of data for failed servers

  30. Update Model • Conflict resolution update model • Challenge: • Untrusted infrastructure • Access only to ciphertext

  31. Update Format and Semantics • An update: a list of predicates associated with actions • If any of the predicates evaluates to be true, the actions associated with the earliest true predicate are atomically applied • Everything is logged

  32. Extending the Model to Work over Ciphertext • Supported predicates • Compare version (unencrypted metadata) • Compare size (unencrypted metadata) • Compare block • Compare a hash of the encrypted block • Search • Returns only yes/no • Cannot be initiated by the server • Replace/insert/delete/append block

  33. Serializing Updates in an Untrusted Infrastructure • Use a small primary tier of replicas to serialize updates • Minimize communication • Meanwhile, a secondary tier of replicas optimistically propagate updates among themselves • Final ordering from primary tier is multicasted to secondary replicas

  34. A Direct Path to Clients and Archival Storage • Updates flow directly from a client to the primary tier, where they are serialized and then multicast to the secondary servers • Updates are tightly coupled with archival • Archival fragments are generated at serialization time and distributed with updates

  35. Efficiency of the Consistency Protocol • For updates > 4Kbytes, network overhead < 100% • Approximate latency per update < 1 second

  36. Deep Archival Storage • Erasure encoded block fragments • Use small and widely distributed fragments to increase reliability • Administrative domains are ranked by their reliability and trustworthiness • Avoid locations with correlated failures

  37. The OceanStore API • Session: a sequence of reads and writes to potentially different objects • Session guarantees: define the level of consistency • Updates • Callback: for user defined events (commit) • Façade: an interface to the conventional API • UNIX file system, transactional databases, WWW gateways

  38. Introspection • Observation modules monitor the activity of a running system and track system behavior • Optimization modules adjust the computation computation observation optimization

  39. Uses of Introspection • Cluster recognition • Identify related files • Replica management • Adjust replication factors • Migrate floating replicas

  40. Related Work • Space/time trade-offs in hash coding with allowable errors. In Communications of the ACM, 13(7), pp. 422-426, July 1970 • The Bayou architecture: Support for data sharing among mobile users. In Proc. of IEEE Workshop on Mobile Computing Systems and Applications, Dec 1994

  41. Related Work • A tutorial on reed-solomon coding for faulting tolerance in raid-like systems. Software Practice and Experience, 27(9), pp. 995-1012, September 1997 • Accessing nearby copies of replicated objects in a distributed environment. In Proc. of ACM SPAA, June 1997 • Search on encrypted data. IEEE SRSP, May 2000

More Related