1 / 50

Large Scale Sharing

Large Scale Sharing. The Google File System PAST: Storage Management & Caching. – Presented by Chi H. Ho. Introduction. A next step from network file systems. How large? GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines PAST: Internet-scale.

Télécharger la présentation

Large Scale Sharing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho

  2. Introduction • A next step from network file systems. • How large? • GFS: • > 1000 storage nodes • > 300 TB disk storage • Hundreds of client machines • PAST: • Internet-scale

  3. The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

  4. Goals • Performance • Scalability • Reliability • Availability • Highly tuned for: • Google’s back-end file service • Workloads: multiple-producer/single-consumer, many-way merging

  5. Assumptions • H/W: inexpensive components that often fail. • Files: modest number of large files. • Reads/Writes: 2 kinds • Large streaming: common case => optimized. • Small random: supported but need not be efficient. • Concurrency: hundreds of concurrent appends. • Performance: high sustained bandwidth is more important than low latency.

  6. Interface • Usual operations: create, delete, open, close, read, and write. • GFS specific: • snapshot: creates a copy of a file or a directory tree at low cost. • record append: allows concurrent appends to perform atomically.

  7. Architecture

  8. User-level process User-level process User-level process User-level process Architecture

  9. Architecture (Files) Each chunk is identified by an immutable and globally unique chunk handle assigned by the master at the time of chunk creation. Files are divided into fixed-size chunks, each is replicated at multiple (default 3) chunkservers as a Linux file. Read/Write data chunk specified by <chunk handle, byte range>

  10. Architecture (Master) • Maintains metadata: • Namespace • Access control • Mapping files chunks • Chunks’ locations • Controls system-wide activities: • Chunk lease mamagement • Garbage collection • Chunk migration • And Heartbeat messages

  11. Architecture (Client) Interacts with Master for metadata Communicates directly with chunkservers for data

  12. Architecture (Notes) • No data cache is needed: Why? • Client: ??? • Chunkservers: ???

  13. Architecture (Notes) • No data cache is needed: Why? • Client: most applications stream through huge files or have working sets too large to be cached. • Chunkservers: already have Linux cache.

  14. Single Master • Bottleneck? • Single point of failure?

  15. Single Master • Bottleneck? • Never read/write data thru. the master • Only ask the master for chunks’ locations • Prefetch multiple chunks  Cache • Single point of failure? • Master’s state is replicated on multiple machines. • Mutations of master’s state are atomic. • “Shadow” masters are temporarily used for read.

  16. Chunk Size • Large: 64 MBs. • Advantages: • Reduces client-master interaction. • Reduces network overhead (use persistent TCP). • Reduces size of metadata => kept in memory. • Disadvantages: • Small files (small #chunks) may become hot spots. • Solutions: • Small files => more replicas. • Read from clients.

  17. Metadata • Three major types: • file and chunk namespaces, • file-to-chunk mapping, • locations of each chunk’s replicas. } in master’s memory • Persistence issues: • Namespaces and mapping: operation log stored on multiple machines. • Chunks’ locations: polled when master starts and chunkservers joining, update by heartbeat msgs.

  18. Operation Log • In the heart of GFS: • The only persistent record of metadata, • The logical time line that orders concurrent ops. • Operations are atomically committed. • Recovery of master’s state is done by replaying operations in the log.

  19. Consistency • Metadata: solely controlled by the master • Data: consistent after successful mutations. • Same order of mutations is applied on all replicas. • Stale replicas (missing some mutations) are detected and eliminated. Consistent and clients see what the mutation writes in its entirety Clients see same data regardless which replica

  20. Leases and Mutation Order locate the lease or grant one if none exists. cache the locations wait for all to ack. ask for the lease holder of the chunk • Lease: high-level chunk-based access control mechanism, granted by the master. • Global mutation order = lease grant order + serial number within a lease, chosen by the primary (lease holder). • Illustration of a mutation store data in LRU buf. and ack. push data to all replicas locations of primary and secondary replicas assign serial no. to request write request forward write request reply (may be w/ errors) request completed

  21. Special Ops Revisited • Atomic Record Appends • Master chooses offset. • Up on failure: pad the failed replica(s), then retry. • Guarantee: the record is appended to the file at least once atomically. • Snapshot • Copy-on-write. • Used to make a copy of a file/directory quickly.

  22. Master Operations • Namespace Management and Locking, • To support concurrent master’s operations. • Replica Placement, • To avoid dependent failures; to exploit network bandwidth. • Creation, Re-replication, Rebalancing, • To better disk utilization, load balancing, fault tolerance. • Garbage Collection, • Lazy deletion: simple, efficient, and support undelete. • Stale Replica Detection • To eliminate obsolete replicas => garbage collected.

  23. Fault Tolerance Sum Up • Master fails? • Chunkservers fail? • Disks corrupted? • Network noise?

  24. Micro-benchmarks • Configuration: • 1 master, 2 replicas • 16 chunkservers • 16 clients • Each machine: dual 1.4GHz PIII, 2GB mem, 2x80GB 5400rpm, full-duplex 100Mbps NIC. } 1 switch 1Gbps 1 switch

  25. Micro-benchmarkTest and Results • N clients read simultaneously, randomly from a 320GB file set. • Each client read 1GB. • Each read is 4MB. • N clients write simultaneously to N distinct files. • Each client write 1GB. • Each write is 1MB. • N clients append simultaneously to one file.

  26. Cluster A: R&D of over 100 engrs. Typical task: Initiated by a human user and runs up to several hours. Read (MBs – TBs) + Processed + Write results back. Cluster B: Production data processing Tasks: Long lasting. Continuously generate and process multi-TB data sets. Only occasion human intervening Real World Clusters

  27. Real World Measurements • Table shows: • Sustained high throughput. • Light workload on master. • Besides: recovery • A full recovery of a chunkserver takes 23.2 minutes. • Prioritized recovery to a state that could tolerate 1 more failure: 2 minutes.

  28. Workload Breakdown

  29. Conclusion • Design too narrow for Google’s applications. • Most the challenges are implementing---more development component than research. • However, GFS is a complete, deployed solution. • Any opinions/comments?

  30. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel

  31. What is PAST? • An Internet-based, P2P global storage utility. • An archival storage and content distribution utility, not a general-purpose file system. • Nodes form a self-organizing overlay network. • Nodes may contribute storage. • Files are inserted and retrieved, handled by fileID and maybe a key. • Files are immutable. • PAST does not have a lookup service => built on top of one, such as Pastry.

  32. Goals • Strong persistence, • High availability, • Scalability, • Security.

  33. Background – Pastry • A P2P routing substrate. • Given (fileID, msg), route msg to the node whose nodeID is closest to fileID. • Routing Costs = ceiling(log2bN) steps. • Eventual delivery is guaranteed, unless floor(l/2) nodes with adjacent nodeID fail. • Per-node maps of (2b-1)*ceiling(log2bN) + 2l entries: nodeID IP address. • Node recovery’s done by O(log2bN) msgs.

  34. Pastry – A closer look… • Routing: • forward message with fileID to a node that (nodeID) shares more digits with fileID than the current node. • if no such node found, fwd to node with similar match, but numerically closer. • Other nice properties: • fault resilient, self-organizing, scalable, efficient. b = 2, l = 8

  35. PAST Operations • Insert • fileID := SHA-1(filename, pub key, salt) => unique • File certificate is issued. • Client’s quota is charged. • Lookup • Based on fileID. • Node returns file’s contents and certificate. • Reclaim • Client issues reclaim certificate for authentication. • Credit client’s quota; double checked by reclaim receipt.

  36. Security Overview • Each node and each user hold a smartcard. • Security model: • Infeasible to break the cryptosystems. • Most nodes are well-behaved. • Smartcards can’t be controlled by an attacker. • From smartcard, various certificates and receipts are generated to ensure security: • file certificates, reclaim certificates, reclaim receipts, etc.

  37. Storage Management • Assumptions: • Storage capacities of nodes differ by no more than 2 orders of magnitude. • Advertised capacity is the basis for the admission of nodes. • 2 conflicting responsibilities: • Balance free storage under stress, • Keep k copies of each file fileID at k nodes with nodeIDs closest to fileID.

  38. I) Load Balancing • What causes load imbalance? • Differences in: • #files / node (due to the dist. of nodeIDs and fileIDs). • Size distribution of inserted files. • Storage capacity of nodes. • What the solution aims for? • Blur the differences by redistributing data: • Replica diversion: on local scale (relocate a replica among leaf nodes). • File diversion: on global scale (relocate all replicas w/ a different fileID).

  39. N receive D SD size of file D FN free space of N tpriv primary threshold Store D Issue receipt (Fwd D to k-1) SD / FN > tpri No N’ = maxstorage{x | (x is N’s leaf) & (x’s fileID not in k-closest) & (not exist diverted replica)} Yes Replica diversion: choose diversion node N’ SD size of file D FN’ free space of N’ tdivdiversion threshold N’ not exist || SD / FN’ > tdiv Store D N  N’ (k+1)st  N’ No Yes File diversion

  40. II) Maintaining k Replicas • Problem: nodes join and leave. • On joining: • Add ptr  replaced node (~ replica diversion). • Gradually migrate replicas back (background job). • On leaving: • Each affected node picks a new kthclosest node, update its leaf set, and fwd replicas. • Notes: • Extreme condition: “expand” the leaf set to 2l. • Impossible to maintain k replicas if total storage decreases.

  41. Optimizations • Storage: file encoding • E.g.: Reed-Solomon encoding: • m replicas for each file  m checksum for n files. • Performance: caching • Goals: to reduce client-access latencies, maximize query throughput & balance query load. • Algorithm: GreedyDual-Size (GD-S) • Upon a hit: Hd = c(d) / s(d) • Eviction: • Evict file v where Hv min. • Subtract Hv from remaining H values.

  42. Experiments – Setup • Workload 1: • 8 web proxy logs from NLANR: • 4 mil entries • Reference 1,863,055 unique URLs • 18.7GBs of contents • Mean = 10517 Bs, median = 1,312 Bs, max = 138 MBs, min = 0 Bs. • Workload 2: • Combining file name and size information from several file systems: • 2,027,908 files • 166.6 GBs • Mean = 88,233 Bs, median = 4,578 Bs, max = 2.7 GBs, min = 0 Bs. • System: • k = 5, b = 4, N = 2250 • Space contribution: 4 normal distribution (click to see figure.)

  43. Experiment 0 • Disable replica and file diversions: • tpriv = 1 • tdiv = 0 • Reject upon first failure. • Results: • File insertions failed = 51.1%, • Storage utilization = 60.8%.

  44. Storage Contribution & Leaf Set Size • Experiment: • Workload 1 • tpriv = 0.1 • tdiv = 0.05 • Results: • Failures • Utilization • More leaves => better. • d2 best.

  45. Sensitivity of Replica Diversion Parameter tpri • Experiment: • Workload 1 • l = 32 • tdiv = 0.05 • tpri varies • Results: • As tpri • Successful insertion • Storage utilization

  46. Sensitivity of File Diversion Paramerter tdiv • Experiment: • Workload 1 • l = 32 • tpri = 0.1 • tdiv varies • Results: • As tdiv • Successful insertion • Storage utilization • tpriv = 0.1 and tdiv = 0.05 yields best result.

  47. Diversions File diversions are negligible as long as storage utilization is below 83% Acceptable overhead

  48. Insertion Failures w/ Respect to File Size Workload 2 tpriv = 0.1 tdiv = 0.05 Workload 1 tpriv = 0.1 tdiv = 0.05

  49. Experiments – Caching Replica diversion increase 99% load, still effective due to small files

  50. Conclusion • PAST achieves its goals • But: • Application specific • Hard to deploy: what is the incentive for the nodes to contribute storage? • Additional comments?

More Related