1 / 50

Pond

Pond. The OceanStore Prototype. Introduction. Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte storage within three years Solution: OceanStore. OceanStore. Internet-scale Cooperative file system High durability

Télécharger la présentation

Pond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pond The OceanStore Prototype

  2. Introduction • Problem: Rising cost of storage management • Observations: • Universal connectivity via Internet • $100 terabyte storage within three years • Solution: OceanStore

  3. OceanStore • Internet-scale • Cooperative file system • High durability • Universal availability • Two-tier storage system • Upper tier: powerful servers • Lower tier: less powerful hosts

  4. OceanStore

  5. More on OceanStore • Unit of storage: data object • Applications: email, UNIX file system • Requirements for the object interface • Information universally accessible • Balance between privacy and sharing • Simple and usable consistency model • Data integrity

  6. OceanStore Assumptions • Infrastructure untrusted except in aggregate • Most nodes are not faulty and malicious • Infrastructure constantly changing • Resources enter and exit the network without prior warning • Self-organizing, self-repairing, self-tuning

  7. OceanStore Challenges • Expressive storage interface • High durability on untrusted and changing base

  8. Data Model • The view of the system that is presented to client applications

  9. Storage Organization • OceanStore data object ~= file • Ordered sequence of read-only versions • Every version of every object kept forever • Can be used as backup • An object contains metadata, data, and references to previous versions

  10. Storage Organization • A stream of objects identified by AGUID • Active globally-unique identifier • Cryptographically-secure hash of an application-specific name and the owner’s public key • Prevents namespace collisions

  11. Storage Organization • Each version of data object stored in a B-tree like data structure • Each block has a BGUID • Cryptographically-secure hash of the block content • Each version has a VGUID • Two versions may share blocks

  12. Storage Organization

  13. Application-Specific Consistency • An update is the operation of adding a new version to the head of a version stream • Updates are applied atomically • Represented as an array of potential actions • Each guarded by a predicate

  14. Application-Specific Consistency • Example actions • Replacing some bytes • Appending new data to an object • Truncating an object • Example predicates • Check for the latest version number • Compare bytes

  15. Application-Specific Consistency • To implement ACID semantic • Check for readers • If none, update • Append to a mailbox • No checking • No explicit locks or leases

  16. Application-Specific Consistency • Predicate for reads • Examples • Can’t read something older than 30 seconds • Only can read data from a specific time frame

  17. System Architecture • Unit of synchronization: data object • Changes to different objects are independent

  18. Virtualization through Tapestry • Resources are virtual and not tied to particular hardware • A virtual resource has a GUID, globally unique identifier • Use Tapestry, a decentralized object location and routing system • Scalable overlay network, built on TCP/IP

  19. Virtualization through Tapestry • Use GUIDs to address hosts and resources • Hosts publish the GUIDs of their resources in Tapestry • Hosts also can unpublish GUIDs and leave the network

  20. Replication and Consistency • A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs • No issues for replication • The mapping from AGUID to the latest VGUID may change • Use primary-copy replication

  21. Replication and Consistency • The primary copy • Enforces access control • Serializes concurrent updates

  22. Archival Storage • Replication: 2x storage to tolerate one failure • Erasure code is much better • A block is divided into m fragments • m fragments encoded into n > m fragments • Any m fragments can restore the original object

  23. Caching of Data Objects • Reconstructing a block from erasure code is an expensive process • Need to locate m fragments from m machines • Use whole-block caching for frequently-read objects

  24. Caching of Data Objects • To read a block, look for the block first • If not available • Find block fragments • Decode fragments • Publish that the host now caches the block • Amortize the cost of erasure encoding/decoding

  25. Caching of Data Objects • Updates are pushed to secondary replicas via application-level multicast tree

  26. The Full Update Path • Serialized updates are disseminated via the multicast tree for an object • At the same time, updates are encoded and fragmented for long-term storage

  27. The Full Update Path

  28. The Primary Replica • Primary servers run Byzantine agreement protocol • Need more than 2/3 nonfaulty participants • Messages required grow quadratic in the number of participants

  29. Public-Key Cryptography • Too expensive • Use symmetric-key message authentication codes (MACs) • Two to three orders of magnitude faster • Downside: can’t prove the authenticity of a message to the third party • Used only for the inner ring • Public-key cryptography for outer ring

  30. Proactive Threshold Signatures • Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the system • Not practical for a long-lived system • Need to reboot servers at regular intervals • Key holders are fixed

  31. Proactive Threshold Signatures • Proactive threshold signatures • More flexibility in choosing the membership of the inner ring • A public key is paired with a number of private keys • Each server uses its key to generate a signature share

  32. Proactive Threshold Signatures • Any k shares may be combined to produce a full signature • To change membership of an inner ring • Regenerate signature shares • No need to change the public key • Transparent to secondary hosts

  33. The Responsible Party • Who chooses the inner ring? • Responsible party: • A server that publishes sets of failure-independent nodes • Through offline measurement and analysis

  34. Software Architecture • Java atop the Staged Event Driven Architecture (SEDA) • Each subsystem is implemented as a stage • With each own state and thread pool • Stages communicate through events • 50,000 semicolons by five graduate students and many undergrad interns

  35. Software Architecture

  36. Language Choice • Java: speed of development • Strongly typed • Garbage collected • Reduced debugging time • Support for events • Easy to port multithreaded code in Java • Ported to Windows 2000 in one week

  37. Language Choice • Problems with Java: • Unpredictability introduced by garbage collection • Every thread in the system is halted while the garbage collector runs • Any on-going process stalls for ~100 milliseconds • May add several seconds to requests travel cross machines

  38. Experimental Setup • Two test beds • Local cluster of 42 machines at Berkeley • Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM • 2 36GB hard drives, RAID 0 • Gigabit Ethernet adaptor • Linux 2.4.18 SMP

  39. Experimental Setup • PlanetLab, ~100 nodes across ~40 sites • 1.2 GHz Pentium III, 1GB RAM • ~1000 virtual nodes

  40. Storage Overhead • For 32 choose 16 erasure encoding • 2.7x for data > 8KB • For 64 choose 16 erasure encoding • 4.8x for data > 8KB

  41. The Latency Benchmark • A single client submits updates of various sizes to a four-node inner ring • Metric: Time from before the request is signed to the signature over the result is checked • Update 40 MB of data over 1000 updates, with 100ms between updates

  42. The Latency Benchmark

  43. The Throughput Microbenchmark • A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ring • The clients • Create their objects • Synchronize themselves • Update the object as many time as possible for 100 seconds

  44. The Throughput Microbenchmark

  45. Archive Retrieval Performance • Populate the archive by submitting updates of various sizes to a four-node inner ring • Delete all copies of the data in its reconstructed form • A single client submits reads

  46. Archive Retrieval Performance • Throughput: • 1.19 MB/s (Planetlab) • 2.59 MB/s (local cluster) • Latency • ~30-70 milliseconds

  47. The Stream Benchmark • Ran 500 virtual nodes on PlanetLab • Inner Ring in SF Bay Area • Replicas clustered in 7 largest P-Lab sites • Streams updates to all replicas • One writer - content creator – repeatedly appends to data object • Others read new versions as they arrive • Measure network resource consumption

  48. The Stream Benchmark

  49. The Tag Benchmark • Measures the latency of token passing • OceanStore 2.2 times slower than TCP/IP

  50. The Andrew Benchmark • File system benchmark • 4.6x than NFS in read-intensive phases • 7.3x slower in write-intensive phases

More Related