330 likes | 481 Vues
GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE. Graduate School of Information Science and Technology The University of Tokyo Nan Dun Kenjiro Taura Akinori Yonezawa. Today You may Have. Computing resources across different administration domains
E N D
GMount: An Ad-hoc and Locality-Aware Distributed File Systemby using SSH and FUSE Graduate School of Information Science and Technology The University of Tokyo Nan Dun Kenjiro Taura Akinori Yonezawa
Today You may Have • Computing resources across different administration domains • InTrigger (JP), Tsubame (JP), T2K-Tokyo (JP) • Grid5000 (FR), D-Grid (DE), INFN Grid (IT), National Grid Services (UK) • Open Science Grid (US) • Workload to run on all the available resources • Finding super nova • Gene decoding • Weather simulation, etc. CCGrid 2009, Shanghai, China
Scenario I How to share your data among arbitrary machines across different domains ? CCGrid 2009, Shanghai, China
Ways of Sharing • Option 1: Staging you data • Too troublesome: SCP, FTP, GridFTP, etc. • Option 2: Conventional DFSs • Ask your administrators! • Which one? NFS, OpenAFS, PVFS, GPFS, Lustre, GoogleFS, Gfarm • Only for you? Believe me, they won’t do so • Quota, security, policy? Headaches… • Configure and install, even if admins are supposed to do their job ... • Option 3: GMount • Build a DFS by yourself on the fly! CCGrid 2009, Shanghai, China
Scenario II You have many clients/resources, And you want more servers CCGrid 2009, Shanghai, China
Ways of Scaling • Option 1: Conventional DFSs • File servers are fixed at deploy time • Fixed number of MDS (Metadata Server) • Fixed number of DSS (Data Storage Servers) • Ask your administrators again! • Appendmore DSS • Option 2: GMount • No metadata server • File servers scale with the clients • As long as you have more servers, you have more DSS • Especially benefit if your workloads prefer large amount of local writes CCGrid 2009, Shanghai, China
Scenario III What happens when client access nearby files in the wide-area environments? CCGrid 2009, Shanghai, China
File Lookup in Wide-Area • High-Latency: DFSs with Central MDS • Central MDS is far away from some clients • Locality-Aware: GMount • Search nearby nodesfirst • Sending high-latencymessage only if targetfile can not be foundlocally CCGrid 2009, Shanghai, China
Impression of Usage • Prerequisite • You can SSH login some nodes • Each node has some export directory having the data you want to share • Specify a mountpoint via which DFS can be accessed • Simply make an empty directory for each node CCGrid 2009, Shanghai, China
Impression of Usage • Just one command, You are done! • gmnt /export/directory /mountpoint • GMount will create a DFS at mountpoint: a UNION of all export directories can be mutually accessed by all nodes Host002 Host001 export mount export mount dir1 dir2 dir1 dir2 dir1 dir2 dir1 dir2 dat1 dat1 dat2 dat3 dat4 dat2 dat3 dat4 dat1 dat2 dat3 dat4 Mutual Access CCGrid 2009, Shanghai, China
Enabling Techniques • Building Blocks • FUSE, SSHFS and SSHFS-MUX • To create basic userspace file system • To utilize existing SSH authentication and data transfer features • Grid and cluster shell (GXP) • To efficiently execute commands in parallel • Core Ideas • Scalable All-Mount-All algorithms • To enable all nodes hierarchically and simultaneously share with each other • Locality-Aware Optimization • To enable file access be aware of closer files CCGrid 2009, Shanghai, China
FUSE and SSHFS Magic • FUSE [fuse.sf.net] • Framework for quickly building userspace FS • Widely available(Linux version>2.6.14) • SSHFS [fuse.sf.net/sshfs.html] • Manipulate files on remote hosts as local files • $ sshfs myhost.net:/export /mount • Limitation: only can mount one host at a time CCGrid 2009, Shanghai, China
FUSE and SSHFS Magic (cont.) • Manipulate multiple hosts simultaneously • SSHFS-MUX A$ sshfsm B:/export C:/export /mount • Priority lookup • E.g. C:/export will be accessed before B:/export C’s /export A’s /mount dir1 dir2 dir1 dir2 dat3 dat3 dat2 dat1 dat2 B’s /export dir1 dir2 dat1 CCGrid 2009, Shanghai, China
Problem Setting Data to export at /export (E.g. 3 nodes) INPUT: export directory at each node: E DFS mounted directory /mount OUPUT: DFS mount directory at each node: M CCGrid 2009, Shanghai, China
A Straightforward Approach • Execution examples for 3 nodes1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/export 2:/export 3:/export /mount 3$ sshfsm 1:/export 2:/export 3:/export /mount What if we have 100 nodes? Scalability! 1 2 3 1 2 3 CCGrid 2009, Shanghai, China
Scalable Approach: Phase I • Phase I: One-Mount-All 1$ sshfsm 1:/export 2:/export 3:/export /mount 1 2 3 CCGrid 2009, Shanghai, China
Scalable Approach: Phase II • Phase II: All-Mount-One 2$ sshfsm 1:/mount /mount 3$ sshfsm 1:/mount /mount 1 2 3 CCGrid 2009, Shanghai, China
Comparison K is the number of children 2 3 1 2 3 1 VS. 2 3 1 2 3 CCGrid 2009, Shanghai, China
Further Optimization • Locality-Aware Lookup 1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/mount /mount 3$ sshfsm 1:/mount /mount 1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/mount 2:/export /mount 3$ sshfsm 1:/mount 3:/export /mount 2 3 2 3 1 1 2 3 2 3 CCGrid 2009, Shanghai, China
Hierarchical grouping, sharing and lookup • Nodes share with each other at the same level • Export their union to upper level • File lookup happened in local group first • Then lookup upward if not found Recursively and Hierarchically Constructing CCGrid 2009, Shanghai, China
How to execute many mounts in parallel ? • Grid and Cluster Shell [Taura ’04] • Simultaneously operate hundreds of nodes • Scalable and efficient • Across different administration domains • Install at one node, deploy to all node by itself • Also a useful tool for daily Grid interaction • Programmable parallel execution framework • In GMount • Efficiently execute SSHFS-MUX in parallel on many nodes CCGrid 2009, Shanghai, China
Summary of GMount Executions • Grab nodes by GXP • Assign the starting node as master, others as workers • Master gather info and make a mount plan • Get the number of nodes • Get the information of each node • Make a spanning tree and mount plan and send them to workers • Execute the plan • Workers execute the mount plan and send result back to master • Master aggregates the results and prompt user CCGrid 2009, Shanghai, China
Deal with Real Environments • Utilize Network Topology Information • Grouping nodes based on implicit/explicit network affinity : • Using IP address affinity • Using network topology information if available • NAT/Firewall • Overcome by cascade mount • Specify gateways as root ofinternal nodes and cascadeinside-outside traffic LAN 1 NAT, Firewall 2 3 4 5 6 7 CCGrid 2009, Shanghai, China
Evaluation • Experimental Environments • InTrigger, 15 sites distributed cluster of clusters in Japan • Experiments • Performance of building block (SSHFS-MUX) • I/O performance • Metadata performance • File system construction time on system size • Mount time • Umount time • I/O performance on spanning tree shape • Metadata performance on local accesses CCGrid 2009, Shanghai, China
InTrigger Platform http://www.intrigger.jp • Over 300 nodes across 12 sites • Representative platform inwide-area environments • Heterogeneous wide-area links • NAT enabled in 2 sites • Unified software environments • Linux 2.6.18 • FUSE 2.7.3 • OpenSSH 4.3p2 • SSHFS-MUX 1.1 • GXP 3.03 6 14 269 10 76 11 CCGrid 2009, Shanghai, China
File System Construction Time < 10 seconds for nation wide329 nodes Sites # (nodes #) CCGrid 2009, Shanghai, China
Parallel I/O Performance • Limited SSH transfer rate is primary bottleneck • Performance is also depends on tree shape CCGrid 2009, Shanghai, China
Metadata Operation Peformance • Gfarm: Wide-area DFS • Central meta server • Clients first query in meta server for file location • Clients may be distant from metadata server • Locality Awareness • Clients prefer to access files that stored in nodes close to it (within the same cluster/LAN) • Percent of Local Access where local access is the access to the nodes within the same cluster/LAN CCGrid 2009, Shanghai, China
Metadata: GMount in WAN Gfarm in WAN Locality-Aware: Saved Network Latency Gfarm in LAN CCGrid 2009, Shanghai, China
Highlights CCGrid 2009, Shanghai, China
Future Work • SFTP Limitations • Not Fully POSIX Compatible • Rename operation and link operation • Limited Receive Buffer [Rapier et al. ’08] • Low data transfer rate in long-fat network • SFTP extended attributed support • Piggybacking file location during lookup • Performance Enhancement • SSHFS-MUX local mount operation (Done!) • Fault Tolerance • Tolerance on connection drops CCGrid 2009, Shanghai, China
Available as OSS SSHFSMUX http://sshfsmux.googlecode.com/ Grid and Cluster Shell http://sourceforge.net/projects/gxp/ CCGrid 2009, Shanghai, China
Thank You! CCGrid 2009, Shanghai, China