Building simple, easy-to-use grids with Styx Grid Services and SSH

Building simple, easy-to-use grids with Styx Grid Services and SSH Jon Blower, Keith Haines Reading e-Science Centre Environmental Systems Science Centre University of Reading

Motivation • Grid computing is “distributed computing performed transparently across multiple administrative domains” • Implies both ease of use and security • hard to get both simultaneously! • Ease of setup and maintenance also highly desirable • Difficulty in achieving this is major block to uptake of Grid computing • Currently hard for science projects to build their own Grids without very significant technical help • In this talk: • Ease of use (transparency) comes from Styx Grid Services • Security comes from SSH

Running jobs on a Grid • Basic use of a Grid boils down to: • uploading the required input files for a job • running the job • downloading the output • (More advanced use includes workflows, delegation, etc) • Many users ask – why not just use SSH? • File transfer with SFTP/SCP • Execution through SSH exec or SSH login

Advantages of SSH • Trusted and understood by system administrators • Very widely used, bugs get fixed quickly • Lots of implementations and good tools • e.g. WinSCP (Windows Explorer-like interface to remote systems) • Choice of authentication methods including: • password • public-private key pair • pluggable (e.g. GSI-SSH for Globus logins) • Can mount remote filesystems exposed through SSH • sshfs for Linux (analogous to NFS) • SftpDrive for Windows • Hence can work on a remote file without downloading all of it: potentially important in environmental sciences • Can execute remote programs with SSH • Hence SSH can be the nucleus of a simple Grid

What’s wrong with Globus security? • Globus uses X.509 certificates and time-limited proxies • Proxies can be used to temporarily delegate authority to a third party • Certificates have typical life-time of 1 year • High level of security but proven usability problem • Lots of certificate formats – different tools require different formats • Users can’t “remember” the certificate so need to have a copy on every computer they use (or on USB stick or shared disk) • Illegal sharing of certificates: “mine doesn’t work, can I use yours?” • Users known to run SSH as a grid job, then log on to that to get a familiar environment! • Therefore poor usability leads to poor security in practice • and annoys users no end • Conclusion – user certificates should be avoided if possible • MyProxy can help with these problems • cf. NERC DataGrid • but is an extra server to manage

Styx Grid Services • Simple, lightweight system for exposing executables as a service • Executable is installed on a service provider (host) • SGSs are executed just like local programs • myprog –i input.dat –o output.dat • (myprog is a wrapper script that masquerades as the original executable) • files transferred automatically, user doesn’t have to know where • Supports interactive use • including computational steering • But executables exposed through SGS must be non-graphical • “Workflows” can be constructed with shell scripts • data can be streamed directly between the services • extract | process | render • Supported by Taverna • Emphasis on ease of deployment and use, not feature completion

How SGS works • Server contains complete description of executable in XML • includes input and output files, command-line parameters • SGSRun program downloads XML description and parses the command line • Creates new service instance and uploads input files • Starts the service and monitors progress • Uploads stdin and downloads stdout and stderr as the service runs, redirecting them from and to the console • Downloads output files when the service finishes <gridservice name="gulp"> <params> <param type="unflaggedOption" name="inputfile"/> <param type="unflaggedOption" name="outputfile"/> </params> <inputs> <input type="fileFromParam" name="inputfile"/> </inputs> <outputs> <output type="fileFromParam" name="outputfile"/> <output type="stream" name="stderr"/> </outputs> </gridservice>

SGS and security • SGS server can be run in two modes: • Daemon mode: • Standalone server (a container for services) • Traffic optionally encrypted through Secure Sockets (SSL) • Authentication through custom protocol • need to maintain own user database • Jobs run as a generic user • Tunnelled mode: • Server process executed through Secure Shell (SSH) • Client and server communicate down the encrypted channel • Authentication through SSH • No separate user database – just need login on host system • Jobs run with permissions of the specific user • analogous to other systems e.g. Subversion • Client interface is the same in both cases • Choice is purely down to service providers

SGS + SSH = … • You can execute remote jobs with SSH alone, but only stdin, stdout and stderr are communicated down the line • Need to upload and download input and output files "manually" • Styx allows an arbitrary number of channels to be sent down the secure line … • Data streams • Input and output files • Progress and status messages • Steering messages • … through use of the Styx protocol for distributed systems • File-sharing protocol similar to NFS • We have pure-Java implementation of Styx (http://jstyx.sf.net) • Any resource can be represented as a URL: styx+ssh://myhost/myservice/instances/1/outputs/stdout

Demo 1: A basic Grid job • Remote execution of GULP (General Utility Lattice Program) • Julian Gale • Calculates lots of properties of crystal lattices • e.g. Helmholtz free energy • Reads input from stdin, prints output to stdout • gulp < infile • Running remote job exactly the same as running locally • Client-side stub and server-side SGS framework communicate through Styx messages on the secure channel SGS GULP Styx messages exchanged on SSH channel Client GULP stub

Demo 2: Condor job • SGS system can be installed on a Condor submit host • If user specifies a directory of input files instead of a single file, jobs are split across worker nodes in the pool • gulp inputs outputs • One job per file in the inputs directory • SGS system automatically creates Condor submit file and monitors progress • Progress is displayed on the client's console • Easy way to specify parameter sweep jobs, ensemble data processing etc. • Could apply to Sun GridEngine and other DRMs • Interactive use may not be possible depending on DRM Condor worker nodes Condor submit host GULP SGS SSH Client GULP stub

Submission to Globus resources • Two options: • Use GSI-SSH instead of SSH • SSH with Globus authentication • (thanks to CCLRC for Java code to GSI-SSHTerm) • doesn’t quite work yet… ;-) • Submit to Condor-G instead of Condor (right) • OxGrid uses Condor-G to submit jobs to National Grid Service • Very similar to normal Condor operation Globus resources Condor-G Submit host GULP SGS SSH Client GULP stub

Long-running jobs and robustness • Client might disconnect the SSH connection deliberately or accidentally • This might bring down the SGS server process! • Client would not be able to re-connect • (In daemon mode this is less of a problem as the server is persistent) • We have designed but not yet implemented a solution to this • A little coding and a lot of thinking and testing is required! • This is also needed to support workflows properly (services need to connect to one another to transfer data directly) • In progress!

Case study: GCEP project • Grid for Coupled-model Ensemble Prediction • Uses clusters in Reading, British Antarctic Survey and RAL • Run climate models (MPI jobs) then analyse output (single-machine jobs) • Focusses on ensembles, so want to run same program over different input • Scientists write programs in whatever language they like • Deploy on the GCEP servers and create the XML description • Anyone with SSH access to the servers can then run the programs through SGS as if they were local • programs can be run on clusters through Sun Grid Engine • Data transfers happen automatically • MPI jobs on clusters • Trivially parallel jobs on Condor pool of ordinary desktops (Reading Campus Grid)

Limitations • Robustness • Slow data transfers because encrypted • could use alternative transport • There are ways to improve this but need more testing • SGS does not provide a resource broker • But can use Condor-G for this • Users can't (yet) submit arbitrary executables • Complex executables (that spawn other exes) might be hard to deploy in SGS • But we haven't really tried yet • Can't deploy a GUI app as an SGS

Conclusions • To use SGS-SSH all you need is: • An SSH login to the remote system • The SGS software (5MB of pure Java libraries) • Users run Grid jobs securely just like ordinary local programs • Can submit to Condor, Globus and other DRMs • Can create "workflows" of Styx Grid Services with shell scripts • Data can be transferred directly between services • SGS already available: SGS-SSH needs more work • Version 0.2.0 of JStyx downloaded 218 times so far • (most of them probably just want Styx implementation, not SGS  )

Future work • Case studies! • Robustness • Optimize data transfer speed • GridSAM integration (possible) • already has framework for submission to various DRMs • but limited by JSDL limitation of “one job at a time” • Compare with my_condor_submit • From e-Minerals project

Acknowledgements and references • Thanks to… • David Wallom of OERC for helping to integrate with OxGrid • Tom Oinn of Taverna project for Taverna integration • Vita Nuova Holdings Ltd for technical help with Styx protocol • See also… • Reading e-Science Centre booth • Papers in AHM proceedings 2004,5,6 • http://jstyx.sf.net

Building simple, easy-to-use grids with Styx Grid Services and SSH