350 likes | 441 Vues
Dr. David Wallom. Experience of Setting up and Running a Production Grid on a University Campus July 2004. Outline. The Centre for e-Research Bristol & its place in national efforts University of Bristol Grid Available tool choices Support models for a distributed system
E N D
Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004
Outline • The Centre for e-Research Bristol & its place in national efforts • University of Bristol Grid • Available tool choices • Support models for a distributed system • Problems encountered • Summary
Centre for e-Research Bristol • Established as a Centre of Excellence in visualisation. • Currently has one full time member of staff with several shared resources. • Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.
NGS (www.ngs.ac.uk) UK National Grid Service • ‘Free’ dedicated resources accessible only through Grid interfaces, i.e. GSI-SSH, Globus Toolkit • Compute clusters (York & Oxford) • 64 dual CPU Intel 3.06 GHz nodes, 2GB RAM • Gigabit & Myrinet networking • Data clusters (Manchester & RAL) • 20 dual CPU Intel 3.06 GHz nodes, 4GB RAM • Gigabit & Myrinet networking • 18TB Fibre SAN • Also national HPC resources: HPC(x), CSAR • Affiliates: Bristol, Cardiff, …
The University of Bristol Grid • Established as a way of leveraging extra use from existing resources. • Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters. • Initially legacy OS though all now moving to Red Hat Enterprise Linux 3. • Based in and maintained by several different departments.
The University of Bristol Grid • Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS. • Central services all run on Viglen servers. • Resource Broker • Monitoring and Discovery System & Systems Monitoring. • Virtual Organisation Management • Storage Resource Broker Vault • myProxy Server • Choices of software to provide these was lead by personal experience & other UK efforts to standardise.
The University of Bristol Grid, 2 • Based in and maintained by several different departments. • Each system with a different System Manager! • Different OS’s, initiall just Linux & Windows, though others will come. • Linux versions initially legacy though all now moving to Red Hat Enterprise Linux.
System Installation Model Draw it on the board!
Middleware • Virtual Data Toolkit. • Chosen for stability and support structure. • Widely used in other European production grid systems. • Contains the standard Globus Toolkit version 2.4 with several enhancements.
Resource Brokering • Uses the Condor-G job distribution mechanism. • Custom script for determination of resource priority. • Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.
Accessing the Grid with Condor-G • Condor-G allows the user to treat the Grid as a local resource, and the same command-line tools perform basic job management such as: • Submit a job, indicating an executable, input and output files, and arguments • Query a job's status • Cancel a job • Be informed when events happen, such as normal job termination or errors • Obtain access to detailed logs that provide a complete history of a job • Condor-G extends basic Condor functionality to the grid, providing resource management while still providing fault tolerance and exactly-once execution semantics.
Limitations of Condor-G • Submitting jobs to run under Globus has not yet been perfected. The following is a list of known limitations: • No checkpoints. • No job exit codes. Job exit codes are not available. • Limited platform availability. Condor-G is only available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.
Load Management • Only defines the raw numbers of jobs running, idle & with problems. • Has little measure of relative performance of nodes within grid. • Once a job has been allocated to a remote cluster then rescheduling elsewhere is difficult.
Provision of a Shared Filesystem • Providing a Grid means it is beneficial to provide a shared file system. • Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space • System will have 1TB Storage Resource Broker Vault as one of the core services. • Take this one step further buy partitioning system drives on core servers, • Create virtual disk of ~400GB using spare space on then all! • Install SRB client on all machines so that they can directly access shared storage.
Automation of Processes for Maintenance • Installation • Grid state monitoring • System maintenance • User control • Grid Testing
Individual System Installation • Simple shell scripts for overall control. • Ensures middleware, monitoring and user software all installed in consistent place. • Ensures ease of system upgrades. • Ensures system managers have a chance to view installation method before hand.
Ensuring the System Availability • Uses the Big Brother™ system. • Monitoring occurs through server-client model. • Server maintains limits settings and pings resources listed. • Clients record system information and report this to the server using secure port.
Grid Middleware Testing • Uses the Grid Interface Test Script (GITS) developed for the ETF. • Tests the following; • Globus Gatekeeper running and available. • Globus Jobsubmission system • Presence of machine within the Monitoring & Discovery Service. • Ability to retrieve and distribute files through GridFTP. • Run within the UoB grid every 3 hours. • Latest results available on the Service webpage. • Only downside is that it also needs to run as a standard user not system.
Authorisation And Authentication on the University of Bristol Grid • Make use of the standard UK e-Science Certification Authority. • Bristol is an authorised Registration Authority for this CA. • Uses x509 type certificates and proxies for user AAA. • May be replaced at a later date dependant on the current system scaling model.
User Management • Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources. • Located in controlled disk space. • Important that for each resource that a user is expecting to use, his DN is mapped locally. • Distributing this is Organisation Management
Virtual Organisation Management and Resource Usage Monitoring/Accounting
Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2 • The server (previous) runs as a grid service using the ICENI framework. • Clients located on machines that form part of the Virtual Organisation. • Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal. • Coming in new versions from OMII.
Locally Supporting a Distributed System • Within the university first point of contact is always Information Services Helpdesk. • Given a preset list of questions to ask and log files to see if available. • Not expected to do any actual debugging. • Pass problems onto Grid experts who then pass problems on a system by system basis to their own maintenance staff. • As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.
Supporting a Distributed System • Having a system that is well defined simplifies the support model. • Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users. • Defines hardware support levels and availability. • Defines at a basic level the software support that will also be available.
Problems Encountered • Some of the middleware that we have been trying to use has not been a reliable as we would have hoped. • MDS is a prime examples where necessity for reliability has defined our usage model. • More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system. • Getting time and effort from some already overworked System Managers has been tricky with sociological barriers • “Won’t letting other people use my system just mean I will have less available for me?”
Notes to think about! • Choose your test application carefully • Choose your first test users even more carefully! • One use with a bad experience is worth 10 with good experiences • Grid has been very over hyped so people expect it all to work first-time every time!
Future Directions within Bristol • Make sure the rest of the University clusters are installed and running on the UoBGrid as quickly as possible. • Ensure that the ~600 Windows CPU currently part of Condor pools are integrated as soon as possible. This will give ~800CPU. • Start accepting users from outside the University as part of our commitment to the National Grid Service. • Run the Bristol systems as part of the WUNGrid.
Further Information • Centre for e-Research Bristol: http://escience.bristol.ac.uk • Email: david.wallom@bristol.ac.uk • Telephone: +44 (0)117 928 8769