G o o g l e File System

GoogleFile System 100062142陳仕融 100062118黃振凱 100062124林佑恩 z

Outline • Introduction • Design Overview • System Interactions • Master Operation • Fault Tolerance and Diagnosis • Summary

Introduction • A scalable distributed file system • For large distributed data-intensive application • Fault tolerance, running on cheap hardware • High performance for large number of clients

Introduction: GFS in Use • GFS is… • Widely deployed within Google • Provide hundreds of TBs of storage • For service, research and development use

Introduction • Observations of data usage in Google's application • Files are typically large; multiple GB files are common • Access pattern: most writes are appending while most reads are sequential(no need for caching) • Co-designing applications and file system benefits by increasing flexibility • Component failures are the norm

Design Overview: Assumptions • Cheap hardware often fail, and failure must be tolerated and recovered • Large files(GB~TB) must be managed efficiently • Large streaming reads, and small random reads • Large sequential append write • High sustained bandwidth, rather than response time

Design Overview: Interface • Synchronization & atomicity is encapsulated in the GFS library • Create • Delete • Open • Close • Read • Write • Snapshot • Record Append

Design Overview: Environment • Linux machine running user-level applications • One master, multiple chunkservers • Files are divided into fixed-sized(64MB) chunks • Each chunk has a 64-bit handle(or identifier) • By default, each chunk is replicated 3 times • HeartBeat messages – check if server is alive • Client and chunkserver do not cache file data

Design Overview: Data Flow

Design Overview: Master • File system metadata • Metadata are small enough too keep in memory, which simplify the design and gain performance • File namespace • File to chunk mapping table • System-wide operation • A GFScluster has only one master

System interaction • Goal: minimizes the master’s involvements in all operations • Leases and Mutation Order • Data Flow • Atomic Record Appends • Snapshot

System interaction • Leases and Mutation Order • Mutation: operations(EX: operations such as write or append) that changes the contents or metadata of chunks. • Mutations are performed at all the chunk’s replicas

System interaction • Lease(租契) • Master grants a chunk lease to one of the replica => primary • Primary than picks a serial order for all mutations to the chunk(without master’s intervention)

System interaction

System interaction • Data flow • fully utilize each machine’s network bandwidth : data is pushed linearly along a chain of chunkservers • Avoid network bottlenecks : each machine forwards the data to the closest network • Minimize the latency to push through all the data : pipelining the data transfer over TCP

System interaction • Atomic Record Appends • Client specifies only data. GFS appends the data to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client => guarantees replica is written at least once but does not guarantee all replicas are bytewise identical

System interaction • Snapshot • Makes a copy of a file or a directory • Snapshot implementation • Master receives a snapshot request • Master revokes leases on the chunks in the files it is about to snapshot • Master create a new copy of the chunk • Master logs the operation • Duplicates the metadata for the source file or directory tree • Newly created snapshot files point to same chunks as the source files

Master operation • Namespace management and locking • GFS represents its namespace as a lookup table mapping full pathnames to metadata • Each node in the namespace tree has an associated read-write lock • Locking scheme • Require no write lock on the parent directory => allows concurrent mutations in the same directory • Read lock on the directory name to prevent the directory from being deleted, renamed, or snapshotted

Master operation • Replica placement • Creation, Re-replication, Rebalancing • Chunk replicas are created for three reasons • Chunk creation • Re-replication • Rebalancing

Master operation • Garbage collection • After a file is deleted , GFS does not immediately reclaim the available physical storage. • The file is renamed to a hidden name that includes the deletion timestamp • Remove hidden files for more than three days • The file can still be read and undeleted until it is removed • Orphaned chunks • Replica not known to the master is garbage • Regular background activities of the master

Master operation • Stale replica detection • Master maintains a chunk version number for each chunk • Master removes stale replicas in its regular garbage collection • Client or the chunkserver always access up-to-date data

Fault tolerance and diagnosis • High availability • Fast recovery • Chunk replication • Master replication • Master state is replicated for reliability – its operation logs and checkpoints are replicated on multiple machines • Shadow master – provides read-only access even the primary master is down

Fault tolerance and diagnosis • Data integrity • Checksum • Each chunkserver independently verify the integrity of its own copy by maintaining checksums • A chunk is broken up into 64KB blocks. Each block has a corresponding 32 bits checksum • Chunkserver verifies the checksum. If mismatch, the requestor read from other replicas. Master clone correct replica and instructs the chunkserver to delete the false replica

Fault tolerance and diagnosis • Diagnostic tools • Diagnostic logs that record many significant events and all RPC(Remote Procedure Call) requests and replies

Summary • GFS is widely used within Google as the storage platform for research and development as well as production data processing • Google File System is no doubt one of the crucial pusher that pushes Google to the top search engine of the world!!

Thanks for listening

G o o g l e File System

G o o g l e File System

Presentation Transcript

G o o g l e Earth

G o o g l e

G o o g l e

G o o g l e

G o o g l e Now

G o o g l e

G o o g l e Speaks

Beyond G o o g l e ….

G o o g l e Search

G o o g l e Docs

G o o g l e

G o o g l e

G o o g l e +

G o o g l e AdSense

G o o g l e Docs

G o o g l e Docs