Agenda

Taming aggressive replication in the Pangaea wide-area file systemAuthors:Yasushi Saito, Christos Karamanolis, Magnus Karlsson, Mallik MahalingamPresented by: Ying Chen

Agenda • Introduction • Aggressive replication • A structural overview • Replica set management • Propagating updates • Failure recovery • System evaluation • Conclusion • Q&A

Introduction What is Pangaea? Pangaea is a wide-area file system that enables ad-hoc collaboration in multi-national corporations or in distributed groups of users.

Introduction • Goals • Speed • Availability and autonomy • Network economy • Key technique • Aggressive replication (pervasive replication) • Pangaea aggressively creates a replica of a file or directory whenever and wherever it is accessed. • replicas exchange updates among themselves in a p2p fashion.

Aggressive replication • Main advantages of aggressive replication • Provides fault tolerance, stronger for popular files; • Hides network latency; • Supports disconnected operations by containing a user’s working set in a single server. • Challenges: • Keeping track of a large number of files and replicas in a decentralized way; • Propagating updates reliably yet efficiently. How to implement pervasive replication?

Aggressive replication Strategies to implement aggressive replication • Graph-based replica management • A sparse, yet strongly connected and randomized graph of replicas for each file; • Used both to propagate updates and to discover other replicas during replica addition and removal. • Optimistic replica coordination • Let updates be issued on any replicas at any time; • Maximizes the availability, but sacrifices the consistency.

A structural overview NFS protocol handler receives requests from applications, updates local replicas, and generates requests for the replication engine. Log module implements transaction-like semantics for local disk updates via redo logging. The server logs all the replica-update operations using this service, allowing them to survive crashes. I/O request (application) Pangaea server Membership module maintains the status of other nodes, including their liveness, available disk space, the locations of root-directory replicas, the list of regions in the system, the set of node in each region, and a RTT estimate between every pair of regions. NFS protocol handler log Replication engine Replication engine accepts requests from the NFS protocol handler and the replication engine running on other nodes. It creates, modifies, or removes replicas, and forwards requests to other nodes if necessary. membership User space NFS client Kernel inter-node communication Figure 1: The structure of the Pangaea server. • Structure of a server

Notes: Pangaea has two types of replicas: gold and bronze. They can both be read and written by users at any time, and they both run an identical update-propagation protocol. Gold replicas play an additional role in maintaining the hierarchical name space. • Notes: • Node = server; • Replicates data at the granularity of files; • Directories are treated as files with special contents. A structural overview Notes: Each replica stores a backpointer that indicates its location in the file-system name space. A backpointer includes the parent directory’s ID and the file’s name within the directory. • Structure of a file system /joe Peer edge Bronze replica Gold replica Downlinks Backpointer /joe/foo Figure 2: An example of the Pangaea file system.

Replica set management • File creation • The creation of gold replicas; • The creation of backpointers and downlinks. • Replica addition • Find the gold replica in the directory entry during the name-space lookup; • Perform short-cut replica creation to transfer data; • Gold replicas act as starting points; • Integrate the new copy into the file’s replica graph.

Replica set management • Bronze replica removal • Server sends notices to the replica’s graph neighbors; • Each neighbor initiates a random walk to establish a replacement edge with another live replica. • Name-space containment • For every replica of a file, its parent directories should be also replicated on the same node; • It simplifies the conflict resolution of directory op, and supports disconnected op; • But this requirement increases the storage overhead by 1.5% to 25%.

Propagating updates • Optimistic replication brings 3 challenges: • Efficient and reliable update propagation; • Handling concurrent updates; • The lack of strong consistency guarantees. • Solutions to these challenges: • Optimizations for efficient update • Conflict resolution • Controlling replica divergence

Optimization • Delta propagation • Pangaea propagates only a small, semantic description of the change, called delta; • Each delta carries two timestamps. • Harbingers • Harbinger is a small message that only contains the timestamps of the update; • Harbinger is flooded along the graph edges; the update body is sent only when requested by other nodes;

B A C F D E Optimization • Exploiting physical topology B C F Pangaea dynamically builds a spanning tree whose shape closely matches the physical network topology. This can extremely reduces the use of wide-area networks. D E

Conflict resolution • Conflicts on the contents of a regular file. To solve the conflicts, we have 2 options: • The “last-writer-wins” rule; • Fixing the conflict by user manually. • But, conflicts regarding file attributes or directory entries are more difficult to handle; they fall into 2 categories: • Conflict between 2 directory-update operations; • Conflict between “rmdir” and any other operation.

Conflict resolution Example 1: example of rename-rename conflict. Example 2: example of rmdir-update conflict.

Conflict resolution • Solution: • Pangaea lets the “child” file have the final say on the conflict resolution using the “last-writer-wins” rule; • Implement directory operations as a change to the file’s backpointer(s);

Failure recovery • Recovering from temporary failures • Majority of failures are temporary; • The goal is to reduce the recovery cost; • A node retries logged updates upon reboot or after it detects another node’s recovery. • Recovering from permanent failures • The goal is to clean all data structures associated with the failed node so that the system runs as if the node had never existed in the first place; • Permanent failures are handled by a garbage collection module.

System evaluation Performance of personal workload in WANs

System evaluation The average time needed to read a new file in a collaborative environment.

System evaluation Availability analysis using a file-system trace.

Conclusions • Pangaea is a wide-area file system; it assumes trusted servers. • 3 design principles: • Pervasive replication to provide low-access latency and high availability; • Randomized graph-based replica management that adapts to changes in the system and conserves WAN bandwidth; • Optimistic consistency that allows users to access data at any time, from anywhere. • In heterogeneous environments, Pangaea outperforms existing systems in 3 aspects: access latency, efficient usage of WAN bandwidth, and file availability.

Q & A • Pangaea shares many goals ---- decentralization, availability and autonomy ---- with recent p2p data sharing systems, such as PAST. These p2p systems build flat distributed tables using randomization techniques. Could Pangaea also use this method? No. Pangaea should maintain a graph of replicas explicitly. Because in Pangaea: • Replicas are placed by user activity, not by randomization; • Files encounter frequent updates and are structured hierarchically.

Q & A • Why could Harbinger algorithm shrink the effective window of replica inconsistency? Harbinger-propagation delay is independent of the actual update size, so the chance of a user seeing stale file contents is greatly reduced.

Q & A • Conflict resolution using backpointers requires that each file can perform a (local or remote) update to a replica of the directory that the backpointer refers to. One approach, adopted in Pangaea's earlier implementation, is to embed pointers to (some of) the replicas of the parent directory in the backpointer and modify the parent directory using remote procedure calls. What’s the problem of this design? • This design turned out to be unwieldy: the backpointer is used to initiate a change in the directory, but its directory links must be changed when the directory’s replica set changes. Because of this circular control structure, we could not easily keep the information of the backpointer and the parent directory properly synchronized.

Thank you!

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA