The Google File System

The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google SOSP 2003 (19th ACM Symposium on Operating Systems Principles )

Contents • Introduction • GFS Design • Measurements • Conclusion

Introduction(1/2) • What is File System? • A method of storing and organizing computer files and their data. • Be used on data storage devices such as a hard disks or CD-ROMs to maintain the physical location of the files. • What is Distributed File System? • Makes it possible for multiple users on multiple machines to share files and storage resources via a computer network. • Transparency in Distributed Systems • Make distributed system as easy to use and manage as a centralized system • Give a Single-System Image • A kind of network software operating as client-server system

Introduction(2/2) • What is the Google File System? • A scalable distributed file system for large distributed data-intensive applications. • Shares many same goals as previous distributed file systems • performance, scalability, reliability, availability • Different design points from traditional choices in GFS • Component failures are the norm rather than the exception • Files are huge by traditional standards • Multi-GB files are common • Most files are mutated by appending new data rather than overwriting existing data • Co-designing the application and the FS API benefits the overall system by increasing flexibility

Contents • Introduction • GFS Design • Measurements • Conclusion 1. Design Assumption 2. Architecture 3. Features 4. System Interactions 5. Master Operation 6. Fault Tolerance

GFS Design 1. Design Assumption • Component failures are the norm • A number of cheap commodity hardware that often fail but unreliable • Scale up VS scale out • Problems : application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. • Solutions : constant monitoring, error detection, fault tolerance, and automatic recovery Google Server Computer

GFS Design 1. Design Assumption • Files are HUGE • Multi-GB file sizes are the norm • Parameters for I/O operation and block sizes have to be revisited. • File access model: read / append only (not overwriting) • Most reads sequential • Large streaming reads and small random reads • Datastreams continuously generated by running application. • Many large, sequential writes that append data to files are also many in the workloads • Appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.

GFS Design 1. Design Assumption • Multiple clients concurrently append to the same file • Atomicity with minimal synchronization overhead is essential • High sustained bandwidth is more important than low latency • Co-designing the applications and the file system API benefits the overall system • Increasing the flexibility.

GFS Design 2. Architecture • GFS Cluster Component • 1. a single master • 2. multiple chunkservers • 3. multiple Clients

GFS Design 2. Architecture • GFS Master • maintains all file system metadata. • names space, access control info, mapping file to chunk, chunk (including replicas) location, etc. • enables the master to make sophisticated chunk placement and replication decisions using global knowledge. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state • needs to minimize operations to prevent bottleneck

GFS Design 2. Architecture • GFS Chunkserver • Files are broken into chunks. • Each chunk has a immutable globally unique 64-bit chunk-handle. • chunk-handle is assigned by the master at chunk creation • Chunk size is 64 MB (fixed-size chunk) • Pros • reduce interactions between client and master • reduce network overhead between client and chunkserver • reduce the size of the metadata stored on the master • Cons • small file in one chunk -> hot spot • Each chunk is replicated on 3 (default) servers

GFS Design 2. Architecture • GFS Client • be linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.

GFS Design 3. Features • Metadata • The master stores three major types of metadata • the file and chunk namespaces • the mapping from files to chunks • the locations of each chunk’s replicas • All metadata is kept in the master’s memory (less 64byte per 64MB chunk) • For recovery, first two types are kept persistent by logging mutations to an operation log and replicated on remote machines • Periodically scan through metadata’s entire state in the background. • Chunk garbage collection, re-replication for fail, chunk migration for balancing

GFS Design 3. Features • Operation log • A historical record of critical metadata changes • Defines the order of concurrent operations (identified by the logical times) • Critical! • Replicated on multiple remote machines • Respond to a client operation only after flushing the corresponding log record to disk both locally and remotely • Checkpoints • Recovering the file system state by replaying the operation log • The master checkpoints whenever the log grows beyond a certain size • Keeping a few older checkpoints and log files to guard against catastrophes

GFS Design4. System Interactions • Mutation • Changing the contents or metadata of a chunk • A write or an append operation • Performed at all the chunk’s replicas • Using leases to maintain a consistent mutation order across the replicas • Minimized management overhead • Lease • Granted by the master to one of the replicas to become primary • Primary picks a serial order of mutation and all replicas follow • Global mutation order is defined first by the lease grant order chosen by master • With in a lease, the serial numbers assigned by the primary • 60 seconds timeout, can be extended • Can be revoked

GFS Design 4. System Interactions • Client • Requests new file to write (1) • Master • Adds file to namespace • Selects 3 chunk servers • Designates primary chunk and grant lease • Replies to client (2) • Client • Sends data to all replicas (3) • Notifies primary when sent (4) • Primary • Writes data in order • Increment chunk version • Sequences secondary writes (5) • Secondary • Write data in sequence order • Increment chunk version • Notify primary write finished (6) • Primary • Notifies client when write finished (7) ※ Data write

GFS Design4. System Interactions • Atomic record append operation • Client specifies only data • In a traditional write, specifying the offset at which data is to be written • GFS appends at least once atomically • One continuous sequence of bytes • Return an offset of GFS’s choosing to the client • Snapshot • Making a copy of a file or a directory tree almost instantaneously, while minimizing any interruptions • Steps • Revokes lease • Duplicates metadata, pointing to the same chunks • When the client want to write to a chunk first after the snapshot operation , creates real duplicate locally

GFS Design 5. Master Operation • New chunk creation policy • created and placed by master • New replicas on below-average disk utilization • Limit # of “recent” creations on each chunkserver • Spread replicas of a chunk across racks • A chunk is re-replicated as soon as the number of available replicas falls below a user-specified goal • Re-replicates happen when a chunkserver becomes unavailable • Rebalancing • Periodically rebalance replicas for better disk space and load balancing • A new chunkserver is gradually filled up

GFS Design 5. Master Operation • Garbage collection • When a client deletes a file, master logs it like other changes and changes filename to a hidden file. • Master removes hidden files for longer than 3 days when scanning file system name space • metadata is also erased • During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. • Chunkserver removes these files on its own

GFS Design 6. Fault Tolerance • High Availability • Fast recovery • Master and chunkservers can restart in seconds • Chunk Replication • Multiple chunk servers on different rack • Master Replication • Log and checkpoints are replicated • Master failures? • Monitoring infrastructure outside GFS starts a new master process • DNS alias

GFS Design 6. Fault Tolerance • Data Integrity • Use checksums to detect data corruption • A chunk is broken up intro 64KB blocks with 32-bit checksum • No error propagation • For reads, the chunkserver verifies the checksum before returning • If a block does not match the recorded checksum, the chunk server returns error • The requestor will read from other replica • Record append • Incrementally update the checksum for the last block • Error will be detected when read • During idle periods, scanning and verifying the contents of inactive chunks

GFS Design 6. Fault Tolerance • Master Failure • Operations Log • Persistent record of changes to master metadata • Used to replay events on failure • Replicated to multiple machines for recovery • Flushed to disk before responding to client • Checkpoint of master state at interval to keep ops log file small • Master recovery requires • Latest checkpoint file • Subsequent operations log file • Master recovery was initially a manual operation • Then automated outside of GFS to within 2 minutes • Now down to 10’s of seconds

GFS Design 6. Fault Tolerance • Chunk Server Failure • Heartbeats sent from chunk server to master • Master detects chunk server failure • If chunk server goes down: • Chunk replica count is decremented on master • Master re-replicates missing chunks as needed • 3 chunk replicas is default (may vary) • Priority for chunks with lower replica counts • Priority for blocked clients • Throttling per cluster and chunk server • No difference in normal/abnormal termination • Chunk servers are routinely killed for maintenance

Measurements (1/5) • Micro-benchmarks • GFS cluster consists of • 1 master, 2 master replicas • 16 chunkservers • 16 clients • Machines are configured with • Dual 1.4 GHz PⅢ processors • 2GB of RAM • 80 GB 5400rpm disks x 2 • 100Mbps full-duplex Ethernet connection to an HP 2524 switch • The two switches are connected with 1 Gbps link.

Measurements (2/5) • Micro-benchmarks • Cluster A: • Used by over a hundred engineers. • Typical task initiated by user and runs for a few hours. • Task reads MB’s-TB’s of data, transforms/analyzes the data, and writes results back. • Cluster B: • Used for production data processing. • Typical task runs much longer than a Cluster A task. • Continuously generates and processes multi-TB data sets. • Human users rarely involved. • Clusters had been running for about a week when measurements were taken.

Measurements (3/5) • Micro-benchmarks • Many computers at each cluster • On average, cluster B file size is triple cluster A file size. • Metadata at chunkservers: • Chunk checksums. • Chunk Version numbers. • Metadata at master is small (48, 60 MB) • -> master recovers from crash within seconds.

Measurements (4/5) • Micro-benchmarks (performance metrics for two GFS clusters) • more reads than writes. • Both clusters were in the middle of heavy read activity. • Cluster B was in the middle of a burst of write activity. • In both clusters, master was receiving 200-500 operations per second -> master is not a bottleneck. • Killed a single chunk server in B • 15,000 chunks containing 600 GB of data • All chunks were restored in 23.2 minutes, at an replication rate of 440MB/s

Measurements (5/5) • Micro-benchmarks • Chunkserver workload • Bimodal distribution of small and large files • Ratio of write to append operations: 3:1 to 8:1 • Virtually no overwrites • Master workload • Most request for chunk locations and open files • Reads achieve 75% of the network limit • Writes achieve 50% of the network limit

Conclusion • GFS demonstrates how to support large-scale processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Most files are mutated by appending new data • GFS provides fault tolerance • Constant monitoring • Replicating data • Fast and automatic recovery • Checksumming • High aggregate throughput

The Google File System