110 likes | 246 Vues
This document outlines the implementation of the SCS Program, initiated in January 2003, with a focus on developing cluster support services for scientific computing. It discusses the program's strategic goals, methodologies for support, staffing requirements, cost management, and the importance of standardization. Key success factors include effective steering, funding, and collaborative efforts to enhance the computing capabilities of scientists. The document also highlights challenges like scalability, technology integration, and fostering collaboration among users.
E N D
Building a Cluster Support ServiceImplementation of the SCS Program UC Computing Services Conference Gary JungSCS Project Managerhttp://scs.lbl.gov/ August 8, 2005
Agenda SCS Program • Overview • Implementation • Areas for collaboration UCCSC – August 8, 2005
Background • The 1990’s – Computing at the Desktop • The “Gap” between desktops and NERSC • 2001 - MRC Working Group • Large institutional system originally envisioned by working group • Users not ready to share large system • Recommendation to support Linux clusters • December 2002 - SCS Program approved • $1.3M Four-year program started January 2003 • Ten strategic science projects are selected • IT Division provides support for Linux Clusters • Goals • Enable our scientists to use and take advantage of computing • HPC that works. Avoid security incidents,lost time and expensive mistakes • More effective science UCCSC – August 8, 2005
Strategy • Planning • Formal project mgmt methods • Steering Committee • Support Methodology • Use proven technical approaches that enable us to quickly provide production capability • Adopt standards to facilitate scaling support to several clusters • Staffing • Need to develop a core of expertise to address changes in technology. (e.g. 64-bit Linux, kernel hacking, Cluster mgmt, Myrinet, MPI, schedulers) • Costs • Drive down support costs UCCSC – August 8, 2005
Support Methodology Balance Choice vs. Standardization • User has choice over components that are important to them (e.g. cpu, memory, interconnect.) • We standardize on the aspects that allow us to scale support and reduce costs • Leading, but not bleeding. No exotic stuff. (e.g. no Lustre yet) • On the other hand, tightly coupled, parallel systems are a must to push paradigm shift • Remember that the goal is a production system. • The real trick is in the integration. Making the correct choices so that they all work together and perform well UCCSC – August 8, 2005
Support Methodology The Standard • Hardware - ia32 or AMD64 • Interconnect – GigE, Myrinet, or Infiniband • Operating system - Red Hat Enterprise Linux or Centos • LBNL Warewulf Cluster Toolkit (http://warewulf-cluster.org) • MPI implementation - LAM-MPI • Scheduler - Sun Grid Engine, Torque • Monitoring – Nagios, Ganglia (http://metacluster.lbl.gov) • Cybersecurity – Host-based measures, PIX Firewall, OTP, specific user policies UCCSC – August 8, 2005
Staffing Staffing • Need team with specialized skills to meet technical expertise requirements • Limited funding, tight timeline. • Team roles – Division of responsibilities • Project mgmt, facilities planning • Technology and procurement • Cluster architect, OS, kernel, MPI expert • Scheduler expert • Cluster installation and support • 1.6 FTE total - 10 SCS clusters, 295 nodes UCCSC – August 8, 2005
Costs Driving Down Costs • Standardization of components critical • In-house integration reduces hardware costs and facilitates standards • Leverage relations with open source community • Outsourcing of various pieces - wiring, seismic • Long term planning for electrical infrastructure saves on cost • Develop lower cost staff - college interns • Competitive bid procurement • Benchmarking costs - other National labs, private industry UCCSC – August 8, 2005
Success Factors • Adherence to standards • Effective Steering Committee • Initial funding key to get started • Prominent scientists were our customers • Funding, visibility, ROI • Talented, motivated staff • Creativity allowed, but the focus is on production use • Transparent costing model UCCSC – August 8, 2005
Collaboration What do we have from this? • Methodology for cluster support • New Consulting Offerings • Cluster architecture • Procurement specification • Facilities planning • Development of cluster support business • Effort/cost analysis • Recharge model • LBNL Warewulf software • GPL, 20,000 downloads UCCSC – August 8, 2005
Collaboration Challenges • Larger systems • Scalability issues - e.g. parallel filesystems, power and cooling issues • Moving up the technology curve - Infiniband, PCI-E • Assessing integration risks – Will it work? How will it perform? • Harder problems to debug • Getting scientists to share a system • New services - User facilities, application support • Computer room space • Funding and funding models • Driving down costs further • Charting path forward UCCSC – August 8, 2005