100 likes | 123 Vues
Addressing crisis in computer centers with incompatible tools by developing standardized interfaces for effective management and utilization of terascale computational resources.
E N D
Scalable Systems Softwarefor Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL SNL LANL Ames NCSA PSC SDSC IBM Compaq SGI Scyld Intel Unlimited Scale www.scidac.org/ScalableSystems
The Problem Today www.scidac.org/ScalableSystems System administrators and managers of terascale computer centers are facing a crisis: • Computer centers use incompatible, ad hoc set of systems tools • Present tools are not designed to scale to multi-Teraflop systems • Commercial solutions not happening because business forces drive industry towards servers not HPC.
Scope of the Effort www.scidac.org/ScalableSystems Submit jobs To batch queue Resource & Queue Management Accounting & user mgmt Allocation management Allocation management Fault Tolerance Checkpoint restart Checkpoint restart Security Job Monitoring System Monitoring System Build & Configure Start parallel processes Job management
Goals www.scidac.org/ScalableSystems Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. The specification will proceed through a series of open meetings following a format similar to that used by the MPI forum. Produce a fully integrated suite of systems softwareand tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities. Research and development of more advanced versions of the components required to support the scalability, fault tolerance, and performance requirements of large science applications. Carry out a software lifecycle planfor support and maintenance of systems software suite.
Impact www.scidac.org/ScalableSystems • Fundamentally change the way future high-end systems software is developed and distributed • Reduced facility management costs • reduce need to support ad hoc software • better systems tools available • able to get machines up and running faster and keep running • More effective use of machines by scientific applications • scalable launch of jobs and checkpoint/restart • job monitoring and management tools • allocation management interface
Four Working Groupsto interact with www.scidac.org/ScalableSystems • Node build, configuration, and information service • Resource management, scheduling, and allocation • Proccess management, system monitoring, and checkpointing • Validation and Integration Electronic Notebooks keep WG on track A main notebook for general information & mtg notes And individual notebooks for each working group • Allows groups to keep track of other groups progress and comment on the items of overlap • Allows Center members and interested parties to see what is being defined and implemented
Interactions www.scidac.org/ScalableSystems Principle customers are sysadmin and supercomputer managers CCA looks to Scalable Systems to provide services to launch parallel components on large systems and provide event services for fault detection and monitoring. DOE Science GRID will be involved with the Scalable Systems through their integration of Grid tools with the monitoring and resource management services layer of the systems software Applications using the terascale SciDAC resources including climate, accelerator design, and astrophysics, etc. will be utilizing job submission, job monitoring, user assisted checkpointing, and allocation tools developed by the Center. Other organizations and vendors participating in the Scalable Systems effort even though not funded by SciDAC.