1 / 10

Lemon

Lemon. Computer Monitoring at CERN Miroslav Siket CERN-IT/FIO-FS. Outline. Lemon – what it is? Structure Functionality Metrics Alarms Web visualization. Lemon – LHC Era Monitoring.

danil
Télécharger la présentation

Lemon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemon Computer Monitoring at CERN Miroslav Siket CERN-IT/FIO-FS

  2. Outline • Lemon – what it is? • Structure • Functionality • Metrics • Alarms • Web visualization Sysadmin Introduction at CERN

  3. Lemon – LHC Era Monitoring • Lemon is a software package containing tools for monitoring status and performance of the computers (currently limited to Linux and Solaris OS) • Contains following components: • Sensors (they measure individual metrics [values]) • MSA (Monitoring Sensor Agent) • Monitoring Repository (a daemon that receives the metrics) • Monitoring Repository Backend (storage) • LRF (Lemon RRD tool framework – caching and web presentation tools) • Correlation Engines • Lemon Client (tool for retrieving data) • LAG (Laser Alarm Gateway – tool for passing alarms to Laser system) • See http://cern.ch/lemon for more info Sysadmin Introduction at CERN

  4. Repository backend SQL RRDTool / PHP Correlation Engines SOAP SOAP apache TCP/UDP HTTP Monitoring Repository Monitoring Agent Nodes Lemon CLI Web browser Sensor Sensor Sensor User Lemon - schema Sysadmin Introduction at CERN

  5. Sensor (MS) and Sensor Agent (MSA) • Sensor measures the data based on the requests from MSA • MSA receives the data from sensor through the pipe • MSA sends the data to the Monitoring Repository (MR) through the UDP socket • Typical communication between the two: • MSA forks sensor system • MSA: INI 1 LoadAvg • MSA: GET 1 • Sensor: PUT 1 0.42 • MSA: sends UDP packet to MR • MSA controls the frequency and status of individual sensors (several of them) • You can write sensors yourself (bash, c++, perl,…) Sysadmin Introduction at CERN

  6. Metrics • Measured metrics(about 255): • Status: OS, disk DMA, RPM ok?, ethlink,… • Daemons: sshd, ntpd, syslogd, friod,… alive • File size of files: /etc/nologin, /afs/cern.ch,… • Security: sshd md5chksum,… • Performace: CPU utilization, memory utilization, network bandwidth use,… • Misc: virtual organization number of jobs, smart status, temperature,… (see the list at http://cern.ch/lemon-status/metric_descriptions.php) • Status of the MSA can be seen in the /var/log/edg-fmon-agent.log file on each machine (log file to edg-fmon-agent daemon) Sysadmin Introduction at CERN

  7. Lemon at CERN • Lemon monitors about 2100 computer within 100 clusters • On average it collects about 70 metrics from each host • Part of the ELFms • Integrated with Sure alarm system • Collecting about 1GB/day • Integrated with CDB Node Configuration Management Node Management Sysadmin Introduction at CERN

  8. Sure system • Sure sensor checks values of the individual metrics with reference values and rises an alarms when the conditions are met • Examples: • Loadavg > 20 – raises Load_high alarm • # of sshd daemons < 1 – raises sshd_dead alarm • # of Smart failure in /var/log/messages > 0 – raises smart_failure alarm • Alarms are sent to the Sure servers • Operators acknowledge alarms, log them and if unable to resolve, notify responsible person • Sysadmins receive ITCM tickets – for each alarms there are procedures how to handle them • Special case – NO_CONTACT alarm Sysadmin Introduction at CERN

  9. Web visualization and framework • LRF pre-process part of the data from Monitoring Repoistory and stores them into the RRD files for fast visualization • Groups the logical units (nodes) into clusters based on: • CDB [configuration database] definition • user defined clusters • HW type • Racks • Php based web interface displays preprocessed data on demand and gives together with CDB and status information general overview • Check it at http://cern.ch/lemon-status Sysadmin Introduction at CERN

  10. Summary • Lemon serves to provide monitoring information about the computers in the Computer Center at CERN • Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems • In connection to CDB it allows easier overview of services and visualization of their performance • In connection to Remedy (ITCM) allows overview of the problems for the given service Sysadmin Introduction at CERN

More Related