160 likes | 256 Vues
Explore Tim Smith's HEPiX project at JLab focusing on Alarm Recovery, System Monitoring, and Future Prospects. The project includes modules on Configuration, Collection, Transport, and Repository management. Tim aims to provide end-to-end service views, link alarms to corrective actions, and establish a uniform monitoring infrastructure. Discover the monitoring architecture, correlation engine, event handling, and future perspectives towards integrating the system into GRID technologies.
E N D
Performance and Exception Monitoring Project Tim Smith CERN/IT
Overview • Motivation • Objectives • Analysis and Design • Prototyping • Perspective and Future Tim Smith: HEPiX @ JLab
Alarm Recovery action Monitoring System Local Remote Process killer Console Resource planning Accounting Security Inventory Independent systems No single overview Duplicated collection Host based: Want Service Perceived problems not real Scalability Motivation Tim Smith: HEPiX @ JLab
Alarm Recovery action Monitoring System Local Remote Console Resource planning Accounting Security Inventory Motivation • Configuration • Collection • Transport • Repository mgmt • Display Tim Smith: HEPiX @ JLab
Objectives • To provide tools in which the alarms and displays are orientated to the overall service provided: • User end-to-end views, Quality of service views • Managerial views of resource usage / evolution / failure rates • Service provider views, and detailed machine views • Link the alarms to both the monitoring and corrective actions • To provide service level metrics • To provide a uniform monitoring infrastructure • Coordinated central repositories + Common logging format • Averaging and archiving of logged information • Correlations between logged information • Multiple input routes; extensible moni. clients • Modular tools; demonstrated scalability Tim Smith: HEPiX @ JLab
Process • Analysis • User Requirements Document • Current Tools survey • Enterprise/Cluster mgmt, Pub domain, other labs, building blocks, DAQ, Run Control, Slow Control • Goal / Question / Metric formalism • System Requirements Document • Design • Interfaces Document • Prototyping Tim Smith: HEPiX @ JLab
Goal / Question / Metric • Ensure quality of Interactive Service • Sufficient nodes? • Low enough load? • Slow to respond to commands? • Contactable via network • Network daemons alive • No nologin • Free ptys • Connection test from remote node Tim Smith: HEPiX @ JLab
PEM Architecture 1 1..n 1 1..n 1 Monitoring Agent Monitoring Broker Measurement Repository 1..n 1 1 1..n Outside PEM 1 1..n Configuration Repository Correlation Engine 1 1..n 1 1 1..n User Interface Access Server Tim Smith: HEPiX @ JLab
Configuration Repository Loading the DB <TAG> </TAG> Parser <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> XML-DBMS RDBMS jdbc XML Schema Host, Host type Metrics, Services XML-DBMS freeware (Tried XSU from Oracle) Viewers Xerces From Apache Tim Smith: HEPiX @ JLab
Configuration Repository Querying the DB <TAG> </TAG> Parser <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> XML-DBMS RDBMS jdbc jdbc XML DB Configuration Items Java Objects Tim Smith: HEPiX @ JLab
Correlation Engine • To correlate metrics from the MRS according to configuration in the CRS • Metric collections: trends + multiple machines • Samplings: Union for read efficiency from MRS • Example Java Classes: • Correlation coordinator • Sampling cache • Evaluators • Timers Tim Smith: HEPiX @ JLab
Events • Publish / Subscribe : Java RMI • Interfaces Document Monitoring Agent Monitoring Broker Measurement Repository metricstream metricvalue Configuration Repository Correlation Engine exception configuration User Interface Access Server Tim Smith: HEPiX @ JLab
Monitoring Agent/Broker I • SNMP • extended existing infrastructure • Multithreaded broker loading DB • JMX / JDMK • JMX public specification: managed resources • Plugable agents • Reported several important bugs • Demo at JavaOne conference • Linux/NT remote reset • Netlogger instrumentation • Opened up license negotiations Tim Smith: HEPiX @ JLab
C Low overhead Monitoring Agent/Broker II • Not yet … DMTF • DMI, CMI SNMP Spool /proc netlogger Script Monitoring Process Spool Manager Monitoring Broker Tim Smith: HEPiX @ JLab
PEM Futures • Today: CERN CC needs it • Prototype for ALICE MDC III in January • Tomorrow: Tier-0 RC / GRID node need it • More complete management solutions • Integrate into the Fabric Management WP • ‘GRIDification’ • Rapidly evolving technologies • Lots of middleware • Lots of companies wanting collaboration • still need framework Tim Smith: HEPiX @ JLab
PEM in Perspective Configuration Management Monitoring Alarm Recovery Actions Inventory Resource Planning Security Application Inst/Update OS Configuration/Update OS Installation/Update Power Mgmt/Remote Reset Console Mgmt PC Hardware Tim Smith: HEPiX @ JLab