1 / 40

OceanStore Global-Scale Persistent Storage

OceanStore Global-Scale Persistent Storage. Ying Lu CSCE496/896 Spring 2011. Give Credits. Many slides are from John Kubiatowicz, University of California at Berkeley I have modified them and added new slides. Motivation. Personal Information Mgmt is the Killer App

sarahharris
Télécharger la présentation

OceanStore Global-Scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStoreGlobal-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011

  2. Give Credits • Many slides are from John Kubiatowicz, University of California at Berkeley • I have modified them and added new slides

  3. Motivation • Personal Information Mgmt is the Killer App • Not corporate processing but management, analysis, aggregation, dissemination, filtering for the individual • Automated extraction and organization of daily activities to assist people • Information Technology as a Utility • Continuous service delivery, on a planetary-scale, on top of a highly dynamic information base

  4. OceanStore Context: Ubiquitous Computing • Computing everywhere: • Desktop, Laptop, Palmtop, Cars, Cellphones • Shoes? Clothing? Walls? • Connectivity everywhere: • Rapid growth of bandwidth in the interior of the net • Broadband to the home and office • Wireless technologies such as CDMA, Satellite, laser • Rise of the thin-client metaphor: • Services provided by interior of network • Incredibly thin clients on the leaves • MEMS devices -- sensors+CPU+wireless net in 1mm3 • Mobile society: people move and devices are disposable

  5. What do we need for personal information management?

  6. Questions about information: • Where is persistent information stored? • 20th-century tie between location and content outdated • How is it protected? • Can disgruntled employee of ISP sell your secrets? • Can’t trust anyone (how paranoid are you?) • Can we make it indestructible? • Want our data to survive “the big one”! • Highly resistant to hackers (denial of service) • Wide-scale disaster recovery • Is it hard to manage? • Worst failures are human-related • Want automatic (introspective) diagnose and repair

  7. First Observation:Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the ultimate goal • Computers should disappear into the background • In storage context: • Don’t want to worry about backup, obsolescence • Need lots of resources to make data secure and highly available, BUT don’t want to own them • Outsourcing of storage already very popular • Pay monthly fee and your “data is out there” • Simple payment interface one bill from one company

  8. Second Observation:Need wide-scale deployment • Many components with geographic separation • System not disabled by natural disasters • Can adapt to changes in demand and regional outages • Wide-scale use and sharing also requires wide-scale deployment • Bandwidth increasing rapidly, but latency bounded by speed of light • Handling many people with same system leads to economies of scale

  9. OceanStore:Everyone’s data, One big Utility “The data is just out there” • Separate information from location • Locality is only an optimization (an important one!) • Wide-scale coding and replication for durability • All information is globally identified • Unique identifiers are hashes over names & keys • Single uniform lookup interface • No centralized namespace required

  10. Amusing back of the envelope calculation(courtesy Bill Bolotsky, Microsoft) • How many files in the OceanStore? • Assume 1010 people in world • Say 10,000 files/person (very conservative?) • So 1014 files in OceanStore! • If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements… … but small relative to physical constants

  11. Utility-based Infrastructure Canadian OceanStore • Service provided by confederation of companies • Monthly fee paid to one service provider • Companies buy and sell capacity from each other Sprint AT&T IBM Pac Bell IBM

  12. Outline • Motivation • Properties of the OceanStore • Specific Technologies and approaches: • Naming and Data Location • Conflict resolution on encrypted data • Replication and Deep archival storage • Introspective computing for optimization and repair • Economic models • Conclusion

  13. Ubiquitous Devices  Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for OceanStore storage substrate: • Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks • Coherence:too much data for naïve users to keep coherent “by hand” • Automatic replica management and optimization:huge quantities of data cannot be managed manually • Simple and automatic recovery from disasters: probability of failure increases with size of system • Utility model: world-scale system requires cooperation across administrative boundaries

  14. OceanStore Technologies I:Naming and Data Location • Requirements: • System-level names should help to authenticate data • Route to nearby data without global communication • Don’t inhibit rapid relocation of data • OceanStore approach: Two-level search with embedded routing • Underlying namespace is flat and built from secure cryptographic hashes (160-bit SHA-1) • Search process combines quick, probabilistic search with slower guaranteed search

  15. Floating Replica Universal Name Active Data Name OID Version OID Global Object Resolution Commit Logs Checkpoint OID Root Structure Update OID: Archive versions: Version OID1 Version OID2 Version OID3 Global Object Resolution Global Object Resolution Global Object Resolution Archival copy or snapshot Archival copy or snapshot Archival copy or snapshot Erasure Coded: Universal Location Facility • Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches

  16. Routing Two-tiered approach • Fast probabilistic routing algorithm • Entities that are accessed frequently are likely to reside close to where they are being used (ensured by introspection) • Slower, guaranteed hierarchical routing method Self-optimizing

  17. 01234 bit 01234 bit 11011 reliable factors 11011 11010 X M Y z 10 1st 1st 11100 11100 1st 11011 10 2nd 2nd 11011 11011 (0,1,3) (1,3,4) (0,2,4) (0,1,4) 10101 11100 11100 11010 11001 1st 00011 Query for X (11010) reliable factors 00011 00011 1st 11000 100 2nd 00100 11011 100 3rd 11010 100 Probabilistic RoutingAlgorithm self-optimizing on the depth of the attenuated bloom filter array n3 n1 n2 n4 Bloom filter on each node; Attenuated Bloom filter on each directed edge.

  18. Hierarchical RoutingAlgorithm • Based on Plaxton scheme • Every server in the system is assigned a random node-ID • Object’s root • each object is mapped to a single node whose node-ID matches the object’s GUID in the most bits (starting from the least significant) • Information about the GUID (such as location) were stored at its root

  19. 1 x927 x927 1 x431 x431 1 2 0265 0265 1 x633 x633 1 x742 x742 9834 9834 1215 1215 2 3 3714 3714 1624 1624 2344 2344 2 3 5724 5724 7144 7144 4 Construct Plaxton Mesh 0324 1324 …

  20. GUID 0x43FE 3 4 2 NodeID 0x79FE NodeID 0x23FE NodeID 0x993E NodeID 0x43FE NodeID 0x43FE 1 4 NodeID 0x73FE NodeID 0x44FE 3 2 1 3 NodeID 0xF990 4 4 3 2 NodeID 0x035E NodeID 0x04FE 3 NodeID 0x13FE 4 NodeID 0x555E NodeID 0xABFE 2 NodeID 0x9990 3 1 2 1 2 3 NodeID 0x239E NodeID 0x73FF NodeID 0x1290 NodeID 0x423E 1 Basic Plaxton MeshIncremental suffix-based routing e d c b a

  21. Use of Plaxton MeshRandomization and Locality

  22. OceanStore Enhancements of the Plaxton Mesh • Documents have multiple roots (Salted hash of GUID) • Each node has multiple neighbor links • Searches proceed along multiple paths • Tradeoff between reliability, performance and bandwidth? • Dynamic node insertion and deletion algorithms • Continuous repair and incremental optimization of links self-healing self-optimizing self-configuration

  23. OceanStore Technologies II:Rapid Update in an Untrusted Infrastructure • Requirements: • Scalable coherence mechanism which can operate directly on encrypted datawithout revealing information • Handle Byzantine failures • Rapid dissemination of committed information • OceanStore Approach: • Operations-based interface using conflict resolution • Modeled after Xerox Bayou  updates packets include:Predicate/action pairs which operate on encrypted data • User signs Updates and principle party signs commits • Committed data multicast to clients

  24. Update Model • Concurrent updates w/o wide-area locking • Conflict resolution • Updates Serialization • A master replica? • Role of primary tier of replicas • All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol • A secondary tier of replicas • The result of the updates is multicast down the dissemination tree to all the secondary replicas

  25. Agreement Need agreement in DS: Leader, commit, synchronize Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps Perfect processes, faulty channels: two-army Faulty processes, perfect channels: Byzantine generals

  26. Two-Army Problem

  27. Possible Consensus Agreement is possible in synchronous DS [e.g., Lamport et al.] Messages can be guaranteed to be delivered within a known, finite time. Byzantine Generals Problem A synchronous DS: can distinguish a slow process from a crashed one

  28. Byzantine Generals Problem    

  29. Byzantine Generals -Example (1) The Byzantine generals problem for 3 loyal generals and1 traitor. The generals announce the time to launch the attack (by messages marked by their ids). The vectors that each general assembles based on (a) The vectors that each general receives, where every general passes his vector from (b) to every other general.

  30. Byzantine Generals –Example (2) The same as in previous slide, except now with 2 loyal generals and one traitor.

  31. Byzantine Generals Given three processes, if one fails, consensus is impossible Given N processes, if F processes fail, consensus is impossible if N  3F

  32. Tentative Updates:Epidemic Dissemination

  33. Committed Updates:Multicast Dissemination

  34. Data Coding Model • Two distinct forms of data: active and archival • Active Data in Floating Replicas • Latest version of the object • Archival Data in Erasure Coded Fragments • A permanent, read-only version of the object • During commit, previous version coded with erasure-code and spread over 100s or 1000s of nodes • Advantage: any 1/2 or 1/4 of fragments regenerates data

  35. Full Copy Full Copy Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Conflict Resolution Logs Conflict Resolution Logs Floating Replica Erasure-coded Fragments Floating Replica and Deep Archival Coding

  36. Proactive Self-Maintenance • Continuous testing and repair of information • Slow sweep through all information to make sure there are sufficient erasure-coded fragments • Continuously reevaluate risk and redistribute data • Slow sweep and repair of metadata/search trees • Continuous online self-testing of HW and SW • Detects flaky, failing, or buggy components via: • fault injection:triggering hardware and software error handling paths to verify their integrity/existence • stress testing: pushing HW/SW components past normal operating parameters • scrubbing: periodic restoration of potentially “decaying” hardware or software state • Automates preventive maintenance

  37. OceanStore Technologies IV:Introspective Optimization • Requirements: • Reasonable job on global-scale optimization problem • Take advantage of locality whenever possible • Sensitivity to limited storage and bandwidth at endpoints • Repair of data structures, increasing of redundancy • Stability in chaotic environment  Active Feedback • OceanStore Approach: • Introspective monitoring and analysis of relationships to cluster information by relatedness • Time series-analysis of user and data motion • Rearrangement and replication in response to monitoring • Clustered prefetching: fetch related objects • Proactive-prefetching: get data there before needed • Rearrangement in response to overload and attack

  38. Example: Client Introspection • Client observer and optimizer components • Greedy agents working on the behalf of the client • Watches client activity/combines with historical info • Performs clustering and time-series analysis • Forwards results to infrastructure (privacy issues!) • Monitoring state of network to adapt behaviour • Typical Actions: • Cluster related files together • Prefetch files that will be needed soon • Create/destroy floating replicas

  39. OceanStore Conclusion • The Time is now for a Universal Data Utility • Ubiquitous computing and connectivity is (almost) here! • Confederation of utility providers is right model • OceanStore holds all data, everywhere • Local storage is a cache on global storage • Provides security in an untrusted infrastructure • Exploits economies of scale to: • Provide high-availability and extreme survivability • Lower maintenance cost: • self-diagnosis and repair • Insensitivity to technology changes:Just unplug one set of servers, plug in others

More Related