Distributed File Systems

Distributed File Systems Group e

Topics to be covered • GlustesFS - Aditya Jayanta Undirwadkar • LusterFS - Nameeta Doshi • Swift - Deepthy Rathinasabapathy • Cloudy - Ananya Srinivas

GlusterFS

GlusterFs – How it is different than any other File system • Open Source • Scalability for Petabytes & Beyond • Software-only • Fully Distributed Architecture • No Metadata Server • Fuse based Client, POSIX compliant • High Availability • Self-healing mechanism • Replication to survive hardware failure • Stackable user-space • Can be placed between native file systems eg. • No kernel dependencies • Easy installation procedures • Adaptation as per the workload, e.g. Adaptive Least Usage, Non-Uniform Access

GlusterFS – Terminologies • Components • Brick • A storage filesystem assigned to a volume • Client • A machine that accesses a volume • Server • A machine that owns a volume • Sub-volume • A brick which is processed by a translator • Volume • Collection of sub-volumes after passing through all translators • Translators • Connects one or more subvolumes & does something with them, and offers a sub-volume connection • Distribute • Collects all sub-volumes and distribute the files across the sub-volumes

GlusterFS – Architecture Client Side Compatibility with MS Windows and other Devices GlusterFS Client GlusterFS Client GigE or 10GigE NFS/Samba over TCP/IP Clustered Volume Manager Clustered Volume Manager Storage Gateway Clustered I/O Scheduler Clustered I/O Scheduler NFS/Samba (GLFS Client) RDMA RDMA RDMA InfiniBand or GigE or 10GigE RDMA Server Side Storage Brick 2 Storage Brick 3 Storage Brick 1 Storage Brick N GlusterFS Volume GlusterFS Volume GlusterFS Volume GlusterFS Volume

GlusterFS - Architecture • GlusterFS Server • No metadata & master server as compared to HDFS, GFS • Flexible Backend • Clustered Storage • Global Namespace • Collection of bricks with underlying hardware such as commodity hardware • NFS disk-like layout • Dynamic Addition, Removal of bricks • GlusterFS Client • POSIX compliant • Built in Quota service • POSIX ACLs • Given through directories

GlusterFS – Volume Management • Storage volumes are added from the underlying hardware and can grow, shrink or migrated across the physical systems as per the need. • Storage volumes can be added or removed dynamically to balance the load across the cluster. • Types of Volume management • Distributed • Replicated • Distributed + Striped

GlusterFS – Volume Management • Distributed Volume • Files are distributed in multiple bricks across multiple servers • No striping, no replication Client File_2.txt File_1.txt File_2.txt Brick 1: Server01/data Brick 3: Server03/data Brick 2: Server02/data

GlusterFS – Volume Management • Replicated Volume • Replicates the data on two or more nodes • High availability & High reliability • Improved read performance • At least twice storage capacity is needed Client File_2.txt File_3.txt File_1.txt File_2.txt File_1.txt File_3.txt Brick 1: Server01/data Brick 3: Server03/data Brick 2: Server02/data

GlusterFS – Volume Management • Distributed Striped Volume • Stripes the data on two or more nodes • Files will become inaccessible if one brick fails • Recommended for files of large size and a highly concurrent environment Client File_1.txt (Block 2) File_1.txt (Block 1) File_1.txt (Block 3) File_2.txt (Block 3) File_2.txt (Block 2) File_2.txt (Block 1) File_3.txt (Block 1) File_3.txt (Block 3) File_3.txt (Block 2) Brick 1: Server01/data Brick 3: Server03/data Brick 2: Server02/data

GlusterFS – File Distribution • No metadata server • Hash Algorithm is used for translation for better performance • Unique hash-tag to each file • Faster reads – Low latency • Makes use of Translators and Distribute functions • Dynamically configure translators as per the need of the system

GlusterFS – Translators • Translators • Storage Translators • POSIX: Normal POSIX file system as a backend • BDB: Berkeley DB as a backend • Performance Translators • Read Ahead: Pre-fetching of data • Write Behind: Makes possible to write data in the background • I/O Threads: Background File I/O operations • I/O Cacher: Caches data that has been read • Booster: In some cases direct access to GlusterFS is possible without having need of Client S/w. Improved Performance • Miscellaneous Translators • Trace: For debugging • Rot13: Shows how encryption is done within GlusterFS • Transport Translators • TCP: Uses TCP protocols between client and server • IBVerbs: Uses Verbs API to use underlying InfiniBand for transportation

GlusterFS – Distribution & Load balancing • Distribution in GlusterFS • Type of Clustering Translator, decides how to store files! • Distribute: Collects information about heterogeneous H/W • Unify: Combines all available sub-volumes and forms a single storage space • Load Balancers • Adaptive Least Usage(ALU) • Disk-usage, Read-usage, Write-usage, Open-file-usage and Disk-speed-usage • Round Robin (RR): RR loop for each client • Random: Randomly selects the volumes for storage • NUFA (Non Uniform File Allocation): Can store files on local client if it is a part of server • Separate Namespace Volume for each client

GlusterFS – Fault Tolerance • Replication First, then Distribution • Glusterfs uses Automatic File Replication (AFR) for replication and DHT to distribute files across the volumes • File is fed to AFR to obtain a set of replicated blocks on which DHT is applied to distribute the blocks • Better Recovery performance • Geo Replication • Asynchronous Replication across volumes separated geographically • Disaster Recovery • Self-healing • Replicated blocks can self-heal

GlusterFS – Self Healing explained Cluster Native client module configured with replication touch test_file.txt Two phase commit protocol 1. Pre-op 2. Op 3. Post-Op Storage Server 1 Storage Server 2 Gluster Server Module Gluster Server Module Change log

GlusterFS – Self Healing explained Cluster Native client module configured with replication vim test_file.txt Look up operation Check file status, permission, change log Storage Server 1 Storage Server 2 Gluster Server Module Gluster Server Module

GlusterFS – Self Healing explained Cluster Native client module configured with replication vim test_file.txt Storage Server 1 Storage Server 2 Gluster Server Module Gluster Server Module Test_file.txt Test_file.txt

GlusterFS – Why? • Common solutions offered • Large scale File storage • File sharing • High Performance Computing Storage • IaaS Storage layer • Disaster Recovery • Backup & Restore • Private or Hybrid Cloud Solutions

References • www.gluster.org • GlusterFS Architecture Petascale Cloud Filesystem - AB Periasamy • Wikipedia • Concepts of Gluster - Boyer, Eric B. Broomfield, Matthew C., Perrotti, Terrell A.

Lustre File System

Features of Lustre File System • Open Source • High Performanceheterogeneous networking • High availability • Security • Interoperability • Object Based Architecture • Scalability • Supports Petabytes of storage • POSIX compliant • OSS addition • Internal monitoring and instrumentation interfaces

LustreFS Architecture • Single Metadata Server (MDS) • Has a single metadata target (MDT) • Stores namespace metadata, such as filenames, directories etc. • One or more Object Storage Servers (OSS) • Has Object Storage Targets(OST) • Typically 2-8 OSTs • Each OST manages a single local disk filesystem • Clients • Access and use the data

LustreFS Architecture MDT data is stored in a single disk filesystem and controls file access and informs the client nodes which object(s) make up a file OST is a dedicated object-base filesystem exported for read/write operations The capacity of a LustreFS is determined by the sum of the total capacities of the OSTs

Interaction between Lustre Subsystems

Lustre Networking (LNET) • LNET is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data • Support for many commonly-used network types such as InfiniBand and TCP/IP • Simultaneous availability of multiple network types with routing between them.

Lustre Storage Files stored on an MDT points to one or more objects associated with the data If MDT file points to more than one object then, the file data is striped across the objects (using RAID 0) and each object is stored on differnt OSTs Inode contains information of all the file attributes such as owner, access persmissions, lustre striping layout, access time and access control. Multiple filenames can point to the same Inode

Lustre Striping Strip data across multiple OSTs using a round robin algorithm Striping allows chunks of a data in a file to be stored on different OSTs RAID 0 pattern is used for striping Data is striped across number of objects The number of objects in a single file is called as ‘stripe_count’ When chuck of data written to a particular object exceeds the stripe_size, the next chunk is stored in the next object

Lustre I/O When a client opens a file, the file open operation transfers the file layout from the MDS to the client The clients then uses this information to perform I/O on the file Client directly interacts with the OSSs

Lustre Read • The client gets the information about a file from the MDS and directly interacts with the OSS. • Each OST runs a lock server and manages the locking for the stripes of data which reside on that OST • Read request for a file is serviced in two phases: • A lock request • Actual read request • After receiving a lock for the file object, the client can then read the data.

Lustre Writes • Two step procedure similar to the Reads • The write requests are only complete when the data is actually flushed to the OSTs • In busy clusters, this imposes a considerable delay on every file write operation • A possible solution is: Writeback Cache • For OSTs : file writes are committed asynchronously • For MDSs : updates would be first written to the cache and then flushed later to MDS. This improves latency of updates as seen by the clients

LustreFS Update • Two distinct types of operations • File system metadata updates on the MDS • Actual file data updates on the OSTs • File system metadata updates on the MDS • File System namespace operations are done on the MDS so that they do not impact the performance of operations that only manipulate actual object (file) data • Actual file data updates on the OST s

Recovery Mechanism For Metadata Server A backup MDS is maintained Clients incurs a timeout when accessing data It then queries an LDAP server to obtain information about the replacement server Infrastructure for Lustre Recovery Service Controller – Epoch Number Metadata Server – Incarnation Number Connections between client and other systems – Generation Number For Object Storage Target Generate Error when data cannot be accessed Luster will automatically adapt

Scalability

Performance

References • lustre.org • Wikipedia • Lustre : A Scalable High Performance File System • Understanding Lustre Filesystem Internals

Swift

Swift – How it is different than any other File system • Open Source, object based storage system • Not a block based Distributed File System • Scales as the cluster grows • No Single Point of Failure • Metadata is distributed and replicated across the cluster • Designed and optimized for the storage of small files • Multi-tenancy

Swift - Features • All objects stored in Swift have a URL • Object data can be located anywhere in the cluster • The cluster scales by adding additional nodes – without sacrificing performance, which allows a more cost-effective linear storage expansion vs. fork-lift upgrades • Data doesn’t have to be migrated to an entirely new storage system • New nodes can be added to the cluster without downtime • Failed nodes and disks can be swapped out with no downtime • Runs on industry-standard hardware, such as Dell, HP, Supermicro etc.

Swift – Architecture • Proxy Server - that handles all requests from the other server • Storage servers • Object Server - that stores objects (files less than 5 GB currently) • Container Server - that keeps track of the objects • Authorization Server - so that your cloud storage is contained and authorized • Account server - that keeps track of all the containers • Other Features The Ring - maps data to disk Replication - keeps the entire system consistent despite potential problems like network outages or drive failures. Updaters - process the failed updates Integrity audits - quarantines objects when the crawling auditor finds a problem, and replaces the bad file with a replica file.

Swift – Architecture Process incoming API

Swift - Architecture Swift Client • The OpenStack clients are command-line interfaces (CLIs) (OpenStackAPI calls) • The OpenStack APIs are RESTfulAPIs The Auth System • Client passes in an auth token with each request • Swift validates each token with the auth system and caches the result • The token does not change from request to request, but does expire

Swift - Replication • Asynchronous, peer-to-peer replicator processes. - balanced load across physical disks. • Replication uses a push model, with records and files generally only being copied from local to remote replicas. • Replica placement is handled by the ring. • Every deleted record or file in the system is marked by a tombstone • All objects stored are replicated 3x in as-unique-as-possible zones, which can be defined as a group of drives, a node, a rack etc.

Swift –Replication • All objects stored are replicated 3x in as-unique-as-possible zones, which can be defined as a group of drives, a node, a rack etc. • Two Types of Replication • DB Replication - which replicates accounts and containers • Object Replication - which replicates object data

Swift – Object Versioning • Object versioning in swift is implemented by setting a flag on the container to tell swift to version all objects in the container. Flag : X-Versions-Location • GET to a versioned object will return the current version of the object without having to do any request redirects or metadata lookups. • POST to a versioned object will update the object metadata as normal, but will not create a new version of the object. • DELETE to a versioned object will only remove the current version of the object. If you have 5 total versions of the object, you must delete the object 5 times to completely remove the object.

Swift – Fault Tolerance • Fault-tolerant Swift implements the concept of availability zones within a single geographic region • Eventually Consistent • Swift achieves high scalability by relaxing constraints on consistency. • Read-your-writes consistency for new objects • Reading an object that has been overwritten with new data may return an older version of the object data. • Ability for the client to request the most up-to-date version at the cost of request latency.

Swift – Load Balancing • A SwiftStack cluster consist of 3 basic tiers - a load balancing tier, a proxy tier and a storage tier. • Each proxy/storage node has its own IP, which is configured during the SwiftStack Node install process. • For each SwiftStack cluster, there will also be a Cluster API IP address, which will be used by the load balancer. • The Cluster API IP Address should be on the same network as each node’s outward-facing IP but be different from all nodes outward-facing IPs. • If using the built-in SwiftStack load balancer, this IP address will be automatically set up and you will not need to do any additional bindings to this IP on the nodes.

Summary – Why Swift • Swift is Scalable • can be scaled by just adding drives and nodes • Extremely Durable • Triple Replicate, 2 out of 3 writes ensured and defined failure zones • Uses REST API, which is similar to the Amazon.com S3 API and compatible with the Rackspace Cloud Files API • Can Be Deployed In-House or As-a-Service • Open Source Software • No vendor lock-in • Community support • Large ecosystem

References • COSBench: Cloud Object Storage Benchmark - Yaguang Wang, Jian Zhang, JiangangDuan Intel Asia-Paciﬁc R&D Ltd., • SwiftBench : http://swiftstack.com/training/operating-swift/benchmarking/ • OpenStack Swift, http://swift.openstack.org/ • Swift: Fast, Reliable, Loosely Coupled Parallel Computation -Yong Zhao,MihaelHategan,BenClifford,Ian Foster et al • Intercloud Object Storage Service: Colony Shigetoshi Yokoyama, Nobukazu Yoshioka

CLOUDY A MODULAR CLOUD STORAGE SYSTEM

Distributed File Systems