1 / 68

myS3 Fabrizio Manfredi Furuholmen Federico Mosca

myS3 Fabrizio Manfredi Furuholmen Federico Mosca. Agenda. Introduction Goals P rincipals myS3 Architecture Internals Sub project Conclusion Developments. Unsolved problem. Web Interface .

kareem
Télécharger la présentation

myS3 Fabrizio Manfredi Furuholmen Federico Mosca

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. myS3 Fabrizio Manfredi FuruholmenFederico Mosca

  2. Agenda • Introduction • Goals • Principals • myS3 • Architecture • Internals • Sub project • Conclusion • Developments

  3. Unsolved problem

  4. Web Interface “Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web…”

  5. S3 • Every file you upload to Amazon S3 is stored in a container called a bucket. • Each bucket name should be unique. • Each bucket can contain an unlimited number of object (key/value). • Buckets cannot be nested, you can not create a bucket within a bucket. • Object • Id • Version • Metadata • Subresources • ACL • Http Rest Call • Byte range transfer • Parallel transfer

  6. myS3 Translate S3 Request to local Disk

  7. Mapping • S3 Bucket is a directory in the AFS space • S3 Object is file or a directory, the directory • S3 ACL Fake object AFS ACL permission are returned as a S3 metadata unix permission are returned as a S3 metadata • All other S3 features are not implemented

  8. S3 Request Objects in the same bucket don’t have any relation !!! No Hierarchically GET /mybucket/puppy.jpg GET /mybucket/yesterday/puppy.jp “yesterday” doesn’t exist GET /mybucket/puppy.jpg HTTP/1.1 User-Agent: dotnet Host: s3.amazonaws.com Date: Tue, 15 Jan 2008 21:20:27 +0000 x-amz-date: Tue, 15 Jan 2008 21:20:27 +0000 Authorization: AWS AKIAIOSFODNN7EXAMPLE:k3nL7gH3+PadhTEVn5EXAMPLE

  9. S3 Request • For retrieving directory content : • Prefix for the parent directory • ‘/’ for end name Delimiter • For create a Directoy • Object name with ‘/’ at the end <ListBucketResultxmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Name>ExampleBucket</Name> <Prefix>/mydir/</Prefix> <Marker></Marker> <MaxKeys>1000</MaxKeys> <Delimiter>/</Delimiter> <IsTruncated>false</IsTruncated> <Contents>

  10. AWS Auth Authorization = "AWS" + " " + AWSAccessKeyId + ":" + Signature; Signature = Base64( HMAC-SHA1( YourSecretAccessKeyID, UTF-8-Encoding-Of( StringToSign ) ) ); StringToSign = HTTP-Verb + "\n" + Content-MD5 + "\n" + Content-Type + "\n" + Date + "\n" + CanonicalizedAmzHeaders + CanonicalizedResource; CanonicalizedResource = [ "/" + Bucket ] + <HTTP-Request-URI, from the protocol name up to the query string> + [ subresource, if present. For example "?acl", "?location", "?logging", or "?torrent"]; CanonicalizedAmzHeaders = <described below>

  11. Authentication IP Base Computer Account, the authentication of the users is handle by internal db Impersonate Forge the ticket for the users on the server side, the authentication is handle by internal db Token Generation Web interface authentication( kbrauth), one time AWS token generation

  12. Server Architecture S3 Interface Web Interface Interface Bucket Manager Storage Manager Token Manager Managers Auth Manager Storage Driver Cache Drivers Plugin /afs

  13. InternalDB Bucket DB - Contains the map btw the bucket name and the AFS Path ex. Myhome -> /afs/beolink/home/manfred Token DB - Contains the access key and secret key for Amazon Authentication, with web base authentication the db contains the kerberos token

  14. Storage Manager NFS style Most of the operation are made on temporary file (.NFSXXX) Caching Save temporary file in non AFS space NoWait Return Ok as soon the file is on the S3 server Mem Keep file transferred in memory (max 100MB) ACL Enable write operation on AFS ACL MD5 Enable or disable MD5

  15. TODO Parallel Transfer Locking Kerberos Token base Chunk transfer (http 100)/ byte range transfer Create a interface for CloudStack Automatic Volume release

  16. RestFS

  17. GOAL Create a framework for testing a new technologies and paradigm

  18. Principle 1/3 “Moving Computation is Cheaper than Moving Data”

  19. Principle 2/3 “There is always a failure waiting around the corner” *Werner Vogel

  20. Principle 3/3 “Decompose into small loosely coupled, stateless building blocks” *’ Leaving a Legacy System Revisited’ Chad Fowler

  21. Five pylons

  22. RestFSKeyWords

  23. Object Data Metadata Block 1 Properties Hash Serial ACL Object Block 2 Serial Hash Ext Properties Serial Block … Hash Segments Serial Block n Hash Attributes set by user Serial

  24. BucketDiscovery Cell 1 DNS Lookup Bucket name Bucket name Cell RL IP list N server Client Server list + Load info N server Server list priority List Cell 2

  25. RestFS Cache client side ServerList DNS Resource Locator Tokens Cache Federated Auth Temporary Pub/Sub List Callbacks Locks Metadata cache RestFS Metadata Persistent Block cache RestFS Block RestFS Block RestFS Block

  26. Server Architecture S3 RestFS RPC Auth Token Resource Locator Sub/Pub Interface Service Storage Mgr Meta Mgr Locks Mgr Auth Manager Token Manager Resource Manager Callbacks Manager Managers Auth Service Token Service RL Service Callback Service Meta Service Block Service Locks Service Distributed Cache Storage Driver Meta Driver Locks Driver Drivers Plugin Resource Driver Callbacks Driver Auth Driver Token Driver Backends

  27. Mounting Cell Bucket N Objects Cell Bucket N Objects

  28. Object Versioning Cell The segment contain the diff to upstream object Bucket N Objects Objects Objects Each object knows the previous and the next. The current object knows the previous and the last

  29. Block Storage

  30. Backend: ConsistentHashing Number of key to move for add/remove a node : Keys/Node= keys to relocate Blocks are collected in shards http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

  31. Block Storage • AFS - Volume store a range of HASH - Chunk is write in 3 volume - Server • PISA - cluster of node - communication base on zmq - consensus base on raft • CEPH - Use CEPH node directly

  32. Backend: Storage • 3 Copies • Configurable read and write consistent level and security: • 2W1R • 2W2R • 1W1R • … Monitor of neighbored small cluster of 3 nodes (GOSSIP) Mini cluster election key space reclaim for replica coordination, leave join cluster

  33. Protocols

  34. RestFS Protocol WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection. GET /mychat HTTP/1.1 Host: server.example.com Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw== Sec-WebSocket-Protocol: chat Sec-WebSocket-Version: 13 Origin: http://example.com Standard HTTP/HTTPS port JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simple --> { "method": ”readBlock", "params": [”…"], "id": 1} <-- { "result": [..], "error": null, "id": 1} Simple to covert in python dict BSON short for Binary JSON, is a binary-encoded serialization of JSON-like documents.. BSON can be compared to binary interchange formats {"hello": "world"} → "\x16\x00\x00\x00\x02hello\x00  \x06\x00\x00\x00world\x00\x00" *Compression is a long story…

  35. Protocols Metadata • { "method": ”readBlock", "params": [“ • bucket_name: test, • segment:1 , • blocks:[1,2,3,4]"], • "id": 1} Collecting per segment Parallel request per segment • { "method": ”getSegmentVer", "params": [“ • bucket_name: test, • segment:1 , • , "id": 1} • <-- { "result": [ • ver: 1335519328.091779 • ], • "error": null, "id": 1} Check cached Data • { "method": ”getSegmentHash", "params": [“ • bucket_name: test, • segment:1 , • , "id": 1} • <-- { "result": [ • 1:16db0420c9cc29a9d89ff89cd191bd2045e47378 • 2:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c • … • ], • "error": null, "id": 1} Block hash list for a specific segment

  36. NOSQL DB

  37. Redis performance $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q SET: 552028.75 requests per second GET: 707463.75 requests per second LPUSH: 767459.75 requests per second LPOP: 770119.38 requests per second

  38. Code

  39. Pluggable Interface, dynamic load

  40. Support

  41. Thankyouhttp://restfs.beolink.orgmanfred.furuholmen@gmail.comfege85@gmail.comThankyouhttp://restfs.beolink.orgmanfred.furuholmen@gmail.comfege85@gmail.com

  42. Bucket

  43. Bucket The bucket has many properties, the property element is a collection of object information, with this element you can retrieve the default value for the bucket (logging level, security level, ect). • Properties objects: • Property • Property Ext • Property ACL • Property Stats Bucket Name BucketName zebra Property segment_size= 512 block_size= 16k max_read’=1000 Bucket_size=0 Bucket_quota=10000 storage_class=STANDARD compression= none logging=enable bucket_type=fs … Default parameters - Filesystm, The bucket is used as a filesystem - Logging, Logging operation done on the specific Bucket - Replica RO, Bucket shadow replication …Custom definition Python Dict

  44. Objects

  45. Object Data Metadata Block 1 Properties Hash Serial ACL Object Block 2 Serial Hash Ext Properties Serial Block … Hash Segments Serial Block n Hash Attributes set by user Serial

  46. MetaDataProperties Object type: - Data, Contains files - Folder, Special object that contain others objects - Mount point, Contains the name of the buckets - Link, Contains the name of the objects - Immutable, Gold image Custom, Defined by the users Bucket name Object id (Special id is : bucket_name.ROOTis the starting point of the file system ) Object zebra.c1d2197420bd41ef24fc665f228e2c76e98da247 Property Object_type=data segment_size= 512 block_size= 16k content_type = md5=ab86d732d11beb65ed0183d6a87b9b0 max_read’=1000 storage_class=STANDARD compression= none Name=“my first object” Object_size=10000 Object_prev=zebra.c1d2197420bd41ef24fc665f228e2c76e98dartg … vers:1335519328.091779 Object hash (replaced by merkel tree) Pointer to the previous Object Object default Object version

  47. MetadataSegment Data_size ------------------------------------- = Total Segment block_size*segment_size Segment element Segment Segment-1 Segment-id 1:16db0420c9cc29a9d89ff89cd191bd2045e47378 2:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c 3:158aa47df63f79fd5bc227d32d52a97e1451828c 4:1ee794c0785c7991f986afc199a6eee1fa4 5:c3c662928ac93e206e025a1b08b14ad02e77b29d … vers:1335519328.091779 … Block pos: integrity hash Version base on timestamp + Incremental useful for vector clock conflict resolution Python Dict

  48. Restfs ID Id Bucket Plain text DNS name Id Object UUID random generation Id segment and id block Base on the position of the content Chunck data on the storage SHA-1 hash of the concatenation of Bucket.object.segment.block_id Id Object is unique inside of the Bucket, with bucket name the id is a UUID

  49. Mounting Cell Bucket N Objects Cell Bucket N Objects

  50. Object Versioning Cell The segment contain the diff to upstream object Bucket N Objects Objects Objects Each object knows the previous and the next. The current object knows the previous and the last

More Related