340 likes | 453 Vues
Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France. Outline. DHT-based File Systems Pastis Performance evaluation. Distributed file systems.
E N D
Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France JTE HPC/FS
Outline • DHT-based File Systems • Pastis • Performance evaluation JTE HPC/FS
Distributed file systems architecture scalability (number of nodes) * uses a Distributed Hash Table (DHT) to store data JTE HPC/FS
Distributed Hash Tables 91 40 18 75 66 52 32 24 83 JTE HPC/FS
DHTs Asia Europe South America Australia 91 40 18 Overlay network 75 Asia 66 52 32 high latency,low bandwidth between logical neighbors North America Europe Asia 24 83 North America logical address space JTE HPC/FS
Insertion of blocks in DHT 04F2 04F2 E25A C52A 834B k = 8958 5230 8909 3A79 put(8959,block) 8BB2 C52A 3A79 k = 8959 AC78 8954 E25A 5230 8BB2 AC78 root ofkey 8959 8957 895D 8957 block 8954 895D 8909 replica 834B Address space JTE HPC/FS
PAST: Storage System • PAST: Cooperative, archival file storage and distribution • Layered on top of Pastry • Goals: • Strong persistence of the data • High availability • Scalability of the System • Reduced cost (no backup) • Efficient use of pooled resources JTE HPC/FS
Insertion of blocks in DHT 04F2 04F2 E25A C52A 834B k = 8958 5230 8909 3A79 put(8959,block) 8BB2 C52A 3A79 k = 8959 AC78 8954 E25A 5230 8BB2 replica AC78 root ofkey 8959 8957 895D 8957 block 8954 895D 8909 replica 834B Address space JTE HPC/FS
Insertion of blocks in DHT 04F2 04F2 E25A C52A 834B k = 8958 5230 8909 3A79 get(8959,block) 8BB2 C52A 3A79 k = 8959 AC78 8954 E25A 5230 8BB2 replica AC78 8957 895D 8957 block 8954 895D 8909 replica 834B Address space JTE HPC/FS
P2P File systems architecture open(), read(), write(), close(), etc. • files and directories • read-write access semantics • security and access control Ivy / Pastis FS block = get(key) put(key, block) DHT DHash / Past • block store (DHT) - scalability - fault-tolerance - self-organization • message routing JTE HPC/FS
DHT-based file systems • Ivy [OSDI’02] • log-based, one log per user • fast writes, slow reads • limited to small number of users • Oceanstore [FAST’03] • updates serialized by primary replicas • partially centralized system • BFT agreement protocol requires well-connected primary replicas DHT object DHT object DHT object User A’s log User B’s log User C’s log primary replicas secondary replicas JTE HPC/FS
Pastis JTE HPC/FS
Pastis design Design goals • simple • completely decentralized • scalable (network size and number of users) Pastis FS put(key, block) block = get(key) Past storage DHT Pastry routing JTE HPC/FS
Pastis data structures Data structures similar to the Unix file system • inodes are stored in modifiableDHT blocks(UCBs) • file contents are stored in immutable DHT blocks(CHBs) Inode key file inode CHB2 replica sets metadata block addresses file contents UCB CHB1 file contents UCB CHB2 CHB1 DHT address space JTE HPC/FS
Pastis data structures (cont.) • directories contain <file name, inode key> entries • use indirect blocks for large files directory inode file1 inode metadata block addresses file1, key1 file2, key2 … oldcontents metadata block addresses filecontents CHB CHB UCB CHB UCB indirectblock oldcontents filecontents CHB CHB CHB JTE HPC/FS
Content Hash Block (CHB) block key = Hash( block contents ) Content Hash Block • block has to be immutable Solution to check and prevent modification • block contents determine block key • can detect if block is modified block contents data block JTE HPC/FS
timestamp sign(KBpriv) User Certificate Blocks (UCBs) block key = Hash( KBpub ) UCBs are modifiable by the block owner. Question: How to check that the file is modified only by the owner? Protocol • (KBpub, KBpriv) associated to each block • The owner builds a signature of the block using KBpriv. Authentication • Verify signature of UCB using the KBpub inode contents UCB JTE HPC/FS
timestamp sign(KUpriv) expiration date sign(KBpriv) UCBs: Multiple User Edits block key = Hash( KBpub ) We want that multiple users can edit BUT we do not want to share the private Key. (KBpub, KBpriv) associated to each block (KUpub, KUpriv) associated to each user Certificate • grants write access to a given user (identified by KUpub) • issued by the file owner • expiration date allows access revocation Authentication • Verify signature of certificate using the storage key (KBpub) • Verify signature of UCB using the KUpub inode contents UCB KBpub KUpub certificate JTE HPC/FS
Pastis – Update handling File update • insert the new file contents (CHBs) • reinsert the file inode (UCB) • replace data blocks (CHBs) directory inode file inode file contents directory contents metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii metadata @CHB3 … … … … … foo file1 @UCB2 file2 @UCB3 file3 … … CHB3 CHB1 UCB1 UCB2 JTE HPC/FS
Pastis – Update handling File update • insert the new file contents (CHBs) • reinsert the file inode (UCB) • replace data blocks (CHBs) directory inode file inode file contents directory contents metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii metadata @CHB3 … … … … … foo file1 @UCB2 file2 @UCB3 file3 … … CHB3 CHB1 new file contents foo bar UCB1 UCB2 Insert new CHB into the DHT CHB4 JTE HPC/FS
Pastis – Update handling File update • insert the new file contents (CHBs) • reinsert the file inode (UCB) • replace data blocks (CHBs) directory inode file inode file contents directory contents metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii metadata @CHB4 … … … … … foo file1 @UCB2 file2 @UCB3 file3 … … CHB3 CHB1 new file contents foo bar UCB1 UCB2 Update file inode to point to new CHB CHB4 JTE HPC/FS
Pastis – Update handling File update • insert the new file contents (CHBs) • reinsert the file inode (UCB) • replace data blocks (CHBs) directory inode file inode file contents directory contents metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii metadata @CHB4 … … … … … foo file1 @UCB2 file2 @UCB3 file3 … … CHB3 CHB1 new file contents foo bar UCB1 UCB2 Reinsert inode UCB into the DHT CHB4 JTE HPC/FS
Pastis – Consistency Strict consistency → too expensive, requires too many network accesses Close-to-open consistency • open(): returns the latest version of the file commited by close() • between open() and close(): user only sees his own updates • defer writes until file is closed write ‘2’ is sent to the network (CHBs and UCB and inserted into the DHT) Client A open read ‘1’ write ‘2’ close a “close-to-open” path makes updates visible Client B open read ‘1’ close open read ‘2’ write is cached until close (CHBs and inode UCB are stored in a local buffer) Still quite expensive: an open requires retrieving the mostup-to-date inode replica B retrieves inode from the DHT JTE HPC/FS
Pastis – Consistency Read-your-writes consistency • relaxation of the close-to-open model • read() must reflect previous localwrites only • writes from other clients may or may not be visible A’s readmay not reflect B’s writes Client A open read ‘1’ close open read ‘1’ Client B open write ‘2’ close open read ‘2’ An opendoes not require retrieving the most up-to-date inode replica, just fetch one inode replica not older than those accessed previously read must reflect local previous writes JTE HPC/FS
Class materials and Bibliography Everything at http://www-sop.inria.fr/members/Frederic.Giroire/enseignement/p2p/ • Slides • Paper and technical report: Pastis: a Highly-Scalable Multi-User Peer-to-Peer File System, Busca, Picconi, and Sens, EuroPar 2005. JTE HPC/FS
Evaluation JTE HPC/FS
Evaluation Prototype • programmed in Java • Client interface : NFS, Fuse • Test program: Andrew Benchmark • Phase 1: create subdirectories • Phase 2: copy files • Phase 3: read file attributes • Phase 4: read file contents • Phase 5: make command Emulation • LAN with one DHT node per machine • DummyNet router emulates WAN latencies Simulation • discrete event simulator - LS3 • simulates overlay network latency JTE HPC/FS
Pastis performance with concurrent clients normalizedexecution time[sec.] Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object close-to-open consistency every userreading and writing to FS (each running an independent benchmark) Ivy’s read overhead increases rapidly with the number of users (the client must retrieve the records of more logs) JTE HPC/FS
Pastis consistency models execution time[sec.] Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object Pastis (close-to-open) Pastis (read-your-writes) NFSv3 (dirs) (write) (attr.) (read) (make) performance penalty compared to NFS(close-to-open) JTE HPC/FS
Evaluation: consistency models N = 32768, sphere topology, max. latency: 300 ms, k = 16 CTO RYW RYW with 10% stale UCB replicas JTE HPC/FS
Conclusion • Pastis • simple • completely decentralized (cf. Oceanstore) • scalable number of users (cf. Ivy) • good performance thanks to: • PAST-Pastry’s locality properties • relaxed consistency models (close-to-open, read-your-writes) • Future work • explore new consistency models • flexible replica location • evaluation in a wide-area testbed (Planetlab) JTE HPC/FS
Links Pastis : http://regal.lip6.fr/projects/pastis Pastry, Past : http://freepastry.rice.edu LS3 : http://regal.lip6.fr/projects/pastis/ls3 JTE HPC/FS
Questions? JTE HPC/FS
Blocks distribution root replication Internet Pastis FS Past / Pastry overlay Pastis design JTE HPC/FS