Setting Up & Running a Campus Production Grid

Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004

Outline • The Centre for e-Research Bristol & its place in national efforts • University of Bristol Grid • Available tool choices • Support models for a distributed system • Problems encountered • Summary

Centre for e-Research Bristol • Established as a Centre of Excellence in visualisation. • Currently has one full time member of staff with several shared resources. • Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.

NGS (www.ngs.ac.uk) UK National Grid Service • ‘Free’ dedicated resources accessible only through Grid interfaces, i.e. GSI-SSH, Globus Toolkit • Compute clusters (York & Oxford) • 64 dual CPU Intel 3.06 GHz nodes, 2GB RAM • Gigabit & Myrinet networking • Data clusters (Manchester & RAL) • 20 dual CPU Intel 3.06 GHz nodes, 4GB RAM • Gigabit & Myrinet networking • 18TB Fibre SAN • Also national HPC resources: HPC(x), CSAR • Affiliates: Bristol, Cardiff, …

The University of Bristol Grid • Established as a way of leveraging extra use from existing resources. • Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters. • Initially legacy OS though all now moving to Red Hat Enterprise Linux 3. • Based in and maintained by several different departments.

The University of Bristol Grid • Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS. • Central services all run on Viglen servers. • Resource Broker • Monitoring and Discovery System & Systems Monitoring. • Virtual Organisation Management • Storage Resource Broker Vault • myProxy Server • Choices of software to provide these was lead by personal experience & other UK efforts to standardise.

The University of Bristol Grid, 2 • Based in and maintained by several different departments. • Each system with a different System Manager! • Different OS’s, initiall just Linux & Windows, though others will come. • Linux versions initially legacy though all now moving to Red Hat Enterprise Linux.

The System Layout

System Installation Model Draw it on the board!

Middleware • Virtual Data Toolkit. • Chosen for stability and support structure. • Widely used in other European production grid systems. • Contains the standard Globus Toolkit version 2.4 with several enhancements.

Resource Brokering • Uses the Condor-G job distribution mechanism. • Custom script for determination of resource priority. • Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.

Accessing the Grid with Condor-G • Condor-G allows the user to treat the Grid as a local resource, and the same command-line tools perform basic job management such as: • Submit a job, indicating an executable, input and output files, and arguments • Query a job's status • Cancel a job • Be informed when events happen, such as normal job termination or errors • Obtain access to detailed logs that provide a complete history of a job • Condor-G extends basic Condor functionality to the grid, providing resource management while still providing fault tolerance and exactly-once execution semantics.

How to submit a job to the system

Limitations of Condor-G • Submitting jobs to run under Globus has not yet been perfected. The following is a list of known limitations: • No checkpoints. • No job exit codes. Job exit codes are not available. • Limited platform availability. Condor-G is only available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.

Resource Broker Operation

Load Management • Only defines the raw numbers of jobs running, idle & with problems. • Has little measure of relative performance of nodes within grid. • Once a job has been allocated to a remote cluster then rescheduling elsewhere is difficult.

Provision of a Shared Filesystem • Providing a Grid means it is beneficial to provide a shared file system. • Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space • System will have 1TB Storage Resource Broker Vault as one of the core services. • Take this one step further buy partitioning system drives on core servers, • Create virtual disk of ~400GB using spare space on then all! • Install SRB client on all machines so that they can directly access shared storage.

Automation of Processes for Maintenance • Installation • Grid state monitoring • System maintenance • User control • Grid Testing

Individual System Installation • Simple shell scripts for overall control. • Ensures middleware, monitoring and user software all installed in consistent place. • Ensures ease of system upgrades. • Ensures system managers have a chance to view installation method before hand.

Overall System Status and Status of the Grid

Ensuring the System Availability • Uses the Big Brother™ system. • Monitoring occurs through server-client model. • Server maintains limits settings and pings resources listed. • Clients record system information and report this to the server using secure port.

Big Brother™ Monitoring

Grid Middleware Testing • Uses the Grid Interface Test Script (GITS) developed for the ETF. • Tests the following; • Globus Gatekeeper running and available. • Globus Jobsubmission system • Presence of machine within the Monitoring & Discovery Service. • Ability to retrieve and distribute files through GridFTP. • Run within the UoB grid every 3 hours. • Latest results available on the Service webpage. • Only downside is that it also needs to run as a standard user not system.

Grid Middleware Testing

What is currently running and how do I find out?

Authorisation And Authentication on the University of Bristol Grid • Make use of the standard UK e-Science Certification Authority. • Bristol is an authorised Registration Authority for this CA. • Uses x509 type certificates and proxies for user AAA. • May be replaced at a later date dependant on the current system scaling model.

User Management • Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources. • Located in controlled disk space. • Important that for each resource that a user is expecting to use, his DN is mapped locally. • Distributing this is Organisation Management

Virtual Organisation Management and Resource Usage Monitoring/Accounting

Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2 • The server (previous) runs as a grid service using the ICENI framework. • Clients located on machines that form part of the Virtual Organisation. • Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal. • Coming in new versions from OMII.

Locally Supporting a Distributed System • Within the university first point of contact is always Information Services Helpdesk. • Given a preset list of questions to ask and log files to see if available. • Not expected to do any actual debugging. • Pass problems onto Grid experts who then pass problems on a system by system basis to their own maintenance staff. • As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.

Supporting a Distributed System • Having a system that is well defined simplifies the support model. • Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users. • Defines hardware support levels and availability. • Defines at a basic level the software support that will also be available.

Problems Encountered • Some of the middleware that we have been trying to use has not been a reliable as we would have hoped. • MDS is a prime examples where necessity for reliability has defined our usage model. • More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system. • Getting time and effort from some already overworked System Managers has been tricky with sociological barriers • “Won’t letting other people use my system just mean I will have less available for me?”

Notes to think about! • Choose your test application carefully • Choose your first test users even more carefully! • One use with a bad experience is worth 10 with good experiences • Grid has been very over hyped so people expect it all to work first-time every time!

Future Directions within Bristol • Make sure the rest of the University clusters are installed and running on the UoBGrid as quickly as possible. • Ensure that the ~600 Windows CPU currently part of Condor pools are integrated as soon as possible. This will give ~800CPU. • Start accepting users from outside the University as part of our commitment to the National Grid Service. • Run the Bristol systems as part of the WUNGrid.

Further Information • Centre for e-Research Bristol: http://escience.bristol.ac.uk • Email: david.wallom@bristol.ac.uk • Telephone: +44 (0)117 928 8769

Setting Up & Running a Campus Production Grid

Setting Up & Running a Campus Production Grid

Presentation Transcript

Dr. David Parsons

Dr. David Liu

Dr. David Liu

Dr David Quesada Presentation

Dr. David Prigg President

Dr. David Anthony Miranda

Dr David Baker

4C8 Dr. David Corrigan

Mentor: Dr. David Parsons

David Wallom Technical Director

Dr. David Delyser

Dr David Moore

Dr. David Wallom Technical Manager, OeRC Technical Director, NGS

From Dr. David Avery

Dr. David Wallom

Dr. David Lyon

Dr. David Fitzpatrick

Dr. David Crisp

Dr. David Crisp

Dr. David Toback

Dr. David Toback

Dr. David Jockers