1 / 33

GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE

GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE. Graduate School of Information Science and Technology The University of Tokyo Nan Dun Kenjiro Taura Akinori Yonezawa. Today You may Have. Computing resources across different administration domains

zurina
Télécharger la présentation

GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GMount: An Ad-hoc and Locality-Aware Distributed File Systemby using SSH and FUSE Graduate School of Information Science and Technology The University of Tokyo Nan Dun Kenjiro Taura Akinori Yonezawa

  2. Today You may Have • Computing resources across different administration domains • InTrigger (JP), Tsubame (JP), T2K-Tokyo (JP) • Grid5000 (FR), D-Grid (DE), INFN Grid (IT), National Grid Services (UK) • Open Science Grid (US) • Workload to run on all the available resources • Finding super nova • Gene decoding • Weather simulation, etc. CCGrid 2009, Shanghai, China

  3. Scenario I How to share your data among arbitrary machines across different domains ? CCGrid 2009, Shanghai, China

  4. Ways of Sharing • Option 1: Staging you data • Too troublesome: SCP, FTP, GridFTP, etc. • Option 2: Conventional DFSs • Ask your administrators! • Which one? NFS, OpenAFS, PVFS, GPFS, Lustre, GoogleFS, Gfarm • Only for you? Believe me, they won’t do so • Quota, security, policy? Headaches… • Configure and install, even if admins are supposed to do their job ... • Option 3: GMount • Build a DFS by yourself on the fly! CCGrid 2009, Shanghai, China

  5. Scenario II You have many clients/resources, And you want more servers CCGrid 2009, Shanghai, China

  6. Ways of Scaling • Option 1: Conventional DFSs • File servers are fixed at deploy time • Fixed number of MDS (Metadata Server) • Fixed number of DSS (Data Storage Servers) • Ask your administrators again! • Appendmore DSS • Option 2: GMount • No metadata server • File servers scale with the clients • As long as you have more servers, you have more DSS • Especially benefit if your workloads prefer large amount of local writes CCGrid 2009, Shanghai, China

  7. Scenario III What happens when client access nearby files in the wide-area environments? CCGrid 2009, Shanghai, China

  8. File Lookup in Wide-Area • High-Latency: DFSs with Central MDS • Central MDS is far away from some clients • Locality-Aware: GMount • Search nearby nodesfirst • Sending high-latencymessage only if targetfile can not be foundlocally CCGrid 2009, Shanghai, China

  9. Impression of Usage • Prerequisite • You can SSH login some nodes • Each node has some export directory having the data you want to share • Specify a mountpoint via which DFS can be accessed • Simply make an empty directory for each node CCGrid 2009, Shanghai, China

  10. Impression of Usage • Just one command, You are done! • gmnt /export/directory /mountpoint • GMount will create a DFS at mountpoint: a UNION of all export directories can be mutually accessed by all nodes Host002 Host001 export mount export mount dir1 dir2 dir1 dir2 dir1 dir2 dir1 dir2 dat1 dat1 dat2 dat3 dat4 dat2 dat3 dat4 dat1 dat2 dat3 dat4 Mutual Access CCGrid 2009, Shanghai, China

  11. Enabling Techniques • Building Blocks • FUSE, SSHFS and SSHFS-MUX • To create basic userspace file system • To utilize existing SSH authentication and data transfer features • Grid and cluster shell (GXP) • To efficiently execute commands in parallel • Core Ideas • Scalable All-Mount-All algorithms • To enable all nodes hierarchically and simultaneously share with each other • Locality-Aware Optimization • To enable file access be aware of closer files CCGrid 2009, Shanghai, China

  12. FUSE and SSHFS Magic • FUSE [fuse.sf.net] • Framework for quickly building userspace FS • Widely available(Linux version>2.6.14) • SSHFS [fuse.sf.net/sshfs.html] • Manipulate files on remote hosts as local files • $ sshfs myhost.net:/export /mount • Limitation: only can mount one host at a time CCGrid 2009, Shanghai, China

  13. FUSE and SSHFS Magic (cont.) • Manipulate multiple hosts simultaneously • SSHFS-MUX A$ sshfsm B:/export C:/export /mount • Priority lookup • E.g. C:/export will be accessed before B:/export C’s /export A’s /mount dir1 dir2 dir1 dir2 dat3 dat3 dat2 dat1 dat2 B’s /export dir1 dir2 dat1 CCGrid 2009, Shanghai, China

  14. Problem Setting Data to export at /export (E.g. 3 nodes) INPUT: export directory at each node: E DFS mounted directory /mount OUPUT: DFS mount directory at each node: M CCGrid 2009, Shanghai, China

  15. A Straightforward Approach • Execution examples for 3 nodes1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/export 2:/export 3:/export /mount 3$ sshfsm 1:/export 2:/export 3:/export /mount What if we have 100 nodes? Scalability! 1 2 3 1 2 3 CCGrid 2009, Shanghai, China

  16. Scalable Approach: Phase I • Phase I: One-Mount-All 1$ sshfsm 1:/export 2:/export 3:/export /mount 1 2 3 CCGrid 2009, Shanghai, China

  17. Scalable Approach: Phase II • Phase II: All-Mount-One 2$ sshfsm 1:/mount /mount 3$ sshfsm 1:/mount /mount 1 2 3 CCGrid 2009, Shanghai, China

  18. Comparison K is the number of children 2 3 1 2 3 1 VS. 2 3 1 2 3 CCGrid 2009, Shanghai, China

  19. Further Optimization • Locality-Aware Lookup 1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/mount /mount 3$ sshfsm 1:/mount /mount 1$ sshfsm 1:/export 2:/export 3:/export /mount 2$ sshfsm 1:/mount 2:/export /mount 3$ sshfsm 1:/mount 3:/export /mount 2 3 2 3 1 1 2 3 2 3 CCGrid 2009, Shanghai, China

  20. Hierarchical grouping, sharing and lookup • Nodes share with each other at the same level • Export their union to upper level • File lookup happened in local group first • Then lookup upward if not found Recursively and Hierarchically Constructing CCGrid 2009, Shanghai, China

  21. How to execute many mounts in parallel ? • Grid and Cluster Shell [Taura ’04] • Simultaneously operate hundreds of nodes • Scalable and efficient • Across different administration domains • Install at one node, deploy to all node by itself • Also a useful tool for daily Grid interaction • Programmable parallel execution framework • In GMount • Efficiently execute SSHFS-MUX in parallel on many nodes CCGrid 2009, Shanghai, China

  22. Summary of GMount Executions • Grab nodes by GXP • Assign the starting node as master, others as workers • Master gather info and make a mount plan • Get the number of nodes • Get the information of each node • Make a spanning tree and mount plan and send them to workers • Execute the plan • Workers execute the mount plan and send result back to master • Master aggregates the results and prompt user CCGrid 2009, Shanghai, China

  23. Deal with Real Environments • Utilize Network Topology Information • Grouping nodes based on implicit/explicit network affinity : • Using IP address affinity • Using network topology information if available • NAT/Firewall • Overcome by cascade mount • Specify gateways as root ofinternal nodes and cascadeinside-outside traffic LAN 1 NAT, Firewall 2 3 4 5 6 7 CCGrid 2009, Shanghai, China

  24. Evaluation • Experimental Environments • InTrigger, 15 sites distributed cluster of clusters in Japan • Experiments • Performance of building block (SSHFS-MUX) • I/O performance • Metadata performance • File system construction time on system size • Mount time • Umount time • I/O performance on spanning tree shape • Metadata performance on local accesses CCGrid 2009, Shanghai, China

  25. InTrigger Platform http://www.intrigger.jp • Over 300 nodes across 12 sites • Representative platform inwide-area environments • Heterogeneous wide-area links • NAT enabled in 2 sites • Unified software environments • Linux 2.6.18 • FUSE 2.7.3 • OpenSSH 4.3p2 • SSHFS-MUX 1.1 • GXP 3.03 6 14 269 10 76 11 CCGrid 2009, Shanghai, China

  26. File System Construction Time < 10 seconds for nation wide329 nodes Sites # (nodes #) CCGrid 2009, Shanghai, China

  27. Parallel I/O Performance • Limited SSH transfer rate is primary bottleneck • Performance is also depends on tree shape CCGrid 2009, Shanghai, China

  28. Metadata Operation Peformance • Gfarm: Wide-area DFS • Central meta server • Clients first query in meta server for file location • Clients may be distant from metadata server • Locality Awareness • Clients prefer to access files that stored in nodes close to it (within the same cluster/LAN) • Percent of Local Access where local access is the access to the nodes within the same cluster/LAN CCGrid 2009, Shanghai, China

  29. Metadata: GMount in WAN Gfarm in WAN Locality-Aware: Saved Network Latency Gfarm in LAN CCGrid 2009, Shanghai, China

  30. Highlights CCGrid 2009, Shanghai, China

  31. Future Work • SFTP Limitations • Not Fully POSIX Compatible • Rename operation and link operation • Limited Receive Buffer [Rapier et al. ’08] • Low data transfer rate in long-fat network • SFTP extended attributed support • Piggybacking file location during lookup • Performance Enhancement • SSHFS-MUX local mount operation (Done!) • Fault Tolerance • Tolerance on connection drops CCGrid 2009, Shanghai, China

  32. Available as OSS SSHFSMUX http://sshfsmux.googlecode.com/ Grid and Cluster Shell http://sourceforge.net/projects/gxp/ CCGrid 2009, Shanghai, China

  33. Thank You! CCGrid 2009, Shanghai, China

More Related