130 likes | 485 Vues
SCMS. Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer Engineering,Faculty of Engineering Kasetsart University Bangkok, Thailand Phone: (662) 942 8555 Ext.. 1416 Fax: (662) 5614621
E N D
SCMS Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer Engineering,Faculty of Engineering Kasetsart University Bangkok, Thailand Phone: (662) 942 8555 Ext.. 1416 Fax: (662) 5614621 Email: pu@smile.cpe.ku.ac.th An Extensible Cluster Management Tool for Beowulf Cluster
Motivation • Beowulf Cluster becomes one of the most widely used platform for high performance computing • Very large and complex Beowulf Cluster start to appear • System management is still a challenging task. There are needs for • The effective way to navigate and interact with cluster components. • Mechanism and tools to perform collective commands • Some services such as monitoring, fault detection and recovery • Special software tools that recognize special characteristics and needs of the cluster administration task
SCMS: An Extensible Cluster Management Tool for Beowulf Cluster • A collection of system management tools for Beowulf Cluster • Package includes • Portable real-time monitoring • Parallel Unix command • Alarm system • Large collection of graphical user interface tools for users and system administrator • Checking user status • Remote software installation • System disk space and process space status • Boot up and shutdown nodes • Change node configuration remotely • Web/VRML interface • Current version 1.1 only support RedHat Linux
Portable Real-time Monitoring • Provides a global access to node information • Interface to local OS and get node information • Collect the information to a single point • Provides heartbeat and node health diagnostic • Provides API for application to access the information. The API is available in C, Java, and TCL/TK . • System Architecture • Client/Server • Layered Architecture
Configuration Management Task Scheduling Performance Monitoring Parallel Unix command Resource Management API ( C, TCL, Java) SMA System Information Repository CMA CMA CMA CMA CMA HAL API HAL LOCAL OS (LINUX) System Architecture • CMA - Control and Monitoring Agent • Get system information from local operating system on each node • Portability is achieved using HAL (Hardware Abstraction Layer) • SMA - System Management Agent • Running on management node to collect information from CMA • RMI - Resource Management Interface • Library that provides interface to functionality of SMA
pps -aux command data data data command command ps -aux ps -aux ps -aux Parallel Unix Command • Parallel version of commonly used unix commands such as pps, pls, prm • Follows the scalable unix tool model (Lusk and Gropp 1994) • Graphical user interface for these commands • Ease of use • Filtering output data
Notification/action Config Alarm Manager Detector Detector Detector Detector Alarm System • Set of daemons that monitor important system parameters • Processor utilization, Memory usage, Main board temperature and more • User can specify the condition to alarm and action to be taken • Issues the alarm and shutdown some part of the system if needed • Notification is sent using email. Future release will include pager, ICQ and speech synthesis
SCMS Utilities SCMS Comes with many GUI utilities • Node status • Control Panel • Disk Space • Process Status • Shutdown/Reboot • Remote login • User status • Package Installation
Web Generator Web Tree System Config Web server VRML World Generator VRML World External Network Real time Monitoring KCAP Web and VRML based Interface for SCMS • Two versions of Web Interface are available • KCAP : Normal web interface • KCAP-VR : VRML Interface that allows you to walk and interact with your cluster • Java Applet is used to report real-time system information
Application Application MPI Node OS Node OS Node OS Node OS KSIX (Kasetsart System Interconnect eXecutive) Node Hardware Node Hardware Node Hardware Node Hardware Interconnection Network Future Works • KSIX: A frame work to support parallel tools and applications • Offer features such as • process control, signal delivery • Naming services • Event based communication
Remote Queue Task Node Allocator Task Queue Scheduler Cluster Nodes SQMS: SMILE Queuing Management System • Batch scheduler for sequential an parallel task • Static and dynamic load balancing • Reconfigurable scheduling policy • Auto docking between cluster Submitter