1 / 16

Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure. P.Andrade , L.Cons , I.Fedorko , B.Fiorini , A.Iribarren , V.Lefebure , G.Mccance , O.Pera , M.Paladin , I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop.

nairi
Télécharger la présentation

Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera, M.Paladin, I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop

  2. CERN Agile Infrastructure Monitoring HEPiX Spring 2012 • High Level Architecture • View of shared architecture Lemon – LHC Era Monitoring System • Is Lemon only about “performance monitoring”? • Why architecture evolution rather than replacement by existing monitoring tool(s)? Agile Infrastructure for Monitoring • Shared Infrastructure • Use cases: Data store, Visualization • Event processing and management • Status of the components 2

  3. Lemon LHC Era Monitoring System In-house developed, multi-components, client/server-based monitoring system Measurement Repository RRD tool / Python Repository Backend Application Server Apache/ PHP Chain of tools based on DB backend Oracle Database User Interfaces Node Monitoring TCP/UDP HTTP Monitoring Agent SQL Local Cache Lemon CLI (command line tool to access data) Web Browser Lemon-host-check (command line tool node exceptions) Sensor Sensor Sensor Individually configurable nodes withautonomous recovery actions 3

  4. LemonPerformance, application and facility monitoring • Time-series processing • Hierarchy clustering • Cluster • Sub-cluster • Node Node monitoring e.g. CPU Load On behalf monitoring Historical data export Smart Power Distribution Units 4

  5. LemonService availability and alarming e.g. Service Level Status Monitoring repository export with guaranteed reliability and data processing • Node monitoring • Disk occupancy • Number of processes • Log file parse matched System administrator Support ticket • Correction action on the node • Run script locally to clean vardir • After 3rd attempt var occupancy > 90% var_ful/ alarm 5

  6. Lemon Monitoring @ Large scale • Experience • No single solution replacement • Requirements • Tools chain • e.g. data mining interface different from time series trending • Flexible migration • e.g. compatible with lemon node client • Large scale ready • Current system: • ~11k monitored entities • ~150 metrics/entity • Expected scale: ~300k entities 6

  7. Agile Infrastructurewith performance monitoring Planned Components Views Data store Operations Visualization and correlation Ticketing SMS gateway Dashboard Cluster processing High load for >50% of cluster Message Bus Lemon to messaging Custom script Lemon agent Monitoring XYZ 7

  8. Storing and visualization Possible options Visualization Lemon web R&D on-going NoSQL Oracle RRD visualization Splunk Message Bus Data mining (batch processing) Data mining Visualization Correlation 8

  9. NoSQL-based data store for monitoring Example from Data Storage Service Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS Shared infrastructure 9

  10. Splunk for data mining/visualization Under testing High precision data mining in the current system solved by dedicated exports ~1.5 year of Lemon raw data (~4.5 TB in Oracle)  ~2.5 TB Splunk data with metadata information (~43 billion entries) One year period of basic metrics on node  on the fly browsing capability with high time granularity 10

  11. Example of Splunk Dashboard Lemon data with entity cluster hierarchy Under testing Metric - Time - Match entity name Sum of running jobs over time split by entities 11

  12. Event processing and management concept Service Now Incident ticket Incident process Event record Event process prototype Ticketing system Monitoring infrastructure Event processing e.g. Heartbeat checking e.g. Load over cluster Node monitoring Metrics Metric correlation 12

  13. Possible use of Splunkfor event processing Alarming  on the fly information processing in time windows Splunk SplunkAutomate Monitoring Notification Aggregated Notification if counter >3 event time 5 min time window In production for backup TSM service @CERN 13

  14. Configuration status and transition period prototype Metric Management Lemon metric management Quattor configuration Puppet configuration Puppet Quattor managed node Puppet managed node Lemon application server (one/data centre) AI monitoring 14

  15. Component status prototyping/testing/using planned/R&D on-going Data store Hadoop Operations Visualization and correlation Splunk Ticketing SMS gateway Dashboard Cluster processing High load for >50% of cluster Apollo Lemon to messaging Custom script Lemon agent Monitoring XYZ 15

  16. Summary • No single solution replacement of the current Lemon system • Shared Agile Infrastructure  Modular concept • covering all the CERN Computer Centre monitoring domains • continuous development and deployment • Transition plan in place • Steady progress in implementation 16

More Related