1 / 21

Understanding the Google File System: Design, Architecture, and Data Management

This presentation by Kim Youngjin delves into the intricacies of the Google File System (GFS), outlining its architecture, file handling, and data management strategies. It addresses common component failures and the prevalence of multi-GB files, emphasizing a design that optimally manages chunk replication, metadata, and consistency models. The presentation covers master operations, namespace management, and fault tolerance, illustrating how GFS maintains high availability and data integrity. Key concepts like atomic record appends and network bandwidth optimization are also examined.

avian
Télécharger la présentation

Understanding the Google File System: Design, Architecture, and Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Google File System • presenter : Kim, youngjin

  2. Introduction • Component failures are the norm • Multi-GB files are common • most files are mutated by appending new data rather than overwriting

  3. Interface • Create, delete, open, close, read and write • snapshot • record append

  4. Architecture

  5. Single Master • Simplify design • enable to make chunk placement and replication decisions using global knowledge • bottleneck -> minimize its involvement

  6. Chunk size • one of the key design parameters • 64MB • advantages vs disadvantages

  7. metadata • Three major type of metadata • the file and chunk namespace • the mapping from files to chunk • the locations of each chunk’s replicas

  8. metadata(cont’d) • In-memory Data structure • Chunk Location • Operation Log

  9. Consistency model • GFS has a relaxed consistency model • write • data to be written at an application-specified file offset • record appends • data to be appended atomically at least once

  10. System Interaction • Minimize the master’s involvement • Leases and Mutation Order • primary

  11. System Interaction(cont’d)

  12. Data flow • goal : To fully utilize each machine’s network bandwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data

  13. Atomic Record Appends • Traditional write vs Record append

  14. Snapshot • makes a copy of a file or a directory tree • to use a check point to roll back or commit

  15. Master Operation • Goal • Keeping chunk fully replicated • balancing load across all the chunkservers • Reclaiming unused storage

  16. Namespace Management and Locking • Use lock over regions of the namespace to ensure proper serialization • Read-Write lock per each namespace node • Allow concurrent mutations in the same directory

  17. Replica Placement • Maximize data reliability and availability , and maximize network bandwidth utilization • spread chunk replicas across racks

  18. Creation, Re-replication, Rebalancing • Chunk replicas are created for these 3 reasons • Creation • re-replication • Rebalancing

  19. Garbage Collection • rename file name to hidden name including the deletion timestamp • keep the file for 3 days • orphaned chunk -> garbage • advantage vs disadvantage

  20. Fault Tolerance and diagnosis • High Availability • Fast Recovery • Replication • chunk Replication • Master Replication

  21. Data integrity • Checksum is used by each chunkservers • For detecting corruption of stored data • kept in memory -> fast lookup / comparison • optimized for record append

More Related