160 likes | 296 Vues
This document provides an in-depth look at managing high-performance computing (HPC) systems with the Cluster Systems Management (CSM) framework at the San Diego Supercomputer Center (SDSC). It outlines the functionalities of DataStar, including the software environment (AIX 5.2 ML3 and CSM 1.3.3.1), node setup, system monitoring, command-line management, and event-driven notifications. Additionally, it covers how to handle configuration synchronization, security monitoring, and condition setups for proactive system management.
E N D
High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007
Managing the HPC systems:DataStar • System Software: • AIX 5.2 ML3 • CSM 1.3.3.1 • RSCT 2.3.3.3 • System Management with CSM: • Node setup • Node Groups • Per frame • Per function (NPACI,TG,POE,login,batch)
CSM setup nodes • Configure Nodes • lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 • vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 • definenode -f /tmp/fr8_9 InstallOSName=AIX • systemid -p hmc hscroot • getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters • csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ • netboot –n ds100 • updatenode –n ds100
ds100: MAC_address=00096B34E093 adapter_duplex=full adapter_speed=100 cable_type=N/A install_server=192.168.236.31 interface_name=en0 location=U1.32-P1-H1/E1 machine_type=install netaddr= network_type=en subnet_mask= ds100: machine_type=secondary interface_name=sn1 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q2 ds100: machine_type=secondary interface_name=sn0 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q1 CSM_ADAPTERS_STANZA_FILE
Managing the HPC systems:DataStar • System Management with CSM: • Management through Command line • Rpower • Power on/off, query node status • Install node: netboot –n ds100 • Dsh • Install updates on nodes (installp,rpm,emgr) • Monitor processes on nodes
Managing the HPC systems:DataStar continued… • System Configuration Cfmupdatenode • Synchronize system configuration modification with nodes and system admins • Run pre/post scripts to capture security rsiks and send notification • System monitoring: Distributed Monitoring responds (GUI configured) • Event driven email notification for on-call personnel • GUI monitoring for operations personnel
CSM Event Monitoring • GUI Event Monitoring • Critical Conditions: • AnyNodeTmpFull • AnyNodeVarSpace • AnyNodeSwitchResponds • LoadLeverProcess • hostResponds see setting up ERRM Condition • Warning Conditions: • Processor State
CSM Event Monitoringsetting up ERRM Conditions • hostResponds ERRM condition (redbook SG24-6953 page 193) • mkcondition –r IBM.ManagedNode \ -e “Status!=1” –E “Status==1” \ -d “Node hostResponds down” \ -D “Node hostRsponds up” \ -m l hostResponds • mkresponse –n LogStatustoFIFO \ -s /usr/local/bin/LogStatusData \ -E STATUS_FILE=/var/adm/spmondata” LogStatusData • mkcondresp “hostResponds” “LogStatusData”
Warning Event email ===================================== Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Rearm email: ===================================== Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Event notification
CSM Information • CSM Guide for the PSSP Systems Administrator SG24-6953 • Useful scripts for ERRM conditions • Command cross reference • IBM CSM for AIX 5L Administration Guide SA22-7918 • CSM error messages • Web Sites • http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm