1 / 12

Lemon/LAS for System Administrators

Lemon/LAS for System Administrators. Overview Miroslav Siket http://cern.ch/lemon CERN-IT/FIO-FD. RRDTool / PHP. apache. HTTP. TCP/UDP. Application Server. Oracle Database. Monitoring Agent. Web browser. Sensor. Sensor. Sensor. Lemon - schema. Repository backend. SQL. Nodes.

Télécharger la présentation

Lemon/LAS for System Administrators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemon/LAS for System Administrators Overview Miroslav Siket http://cern.ch/lemon CERN-IT/FIO-FD

  2. RRDTool / PHP apache HTTP TCP/UDP Application Server Oracle Database Monitoring Agent Web browser Sensor Sensor Sensor Lemon - schema Repository backend SQL Nodes Lemon CLI Lemon-host-check User Lemon Tutorial

  3. Lemon/LAS building blocks • Oracle DB server • running LAS logic and storing LAS data - PL/SQL • Lemon-server – application server • Inserting exceptions to Oracle DB • Web server • Providing access to LAS data from Oracle DB to LAS GUI (business logic) and lemon-cli • Remote monitoring – ping, http • SURE gateways for UIMON/AFS Lemon Tutorial

  4. Lemon/LAS hardware • Two independent instances • Primary • Oracle DB – lemonrac (dbsrvd102,dbsrvd103) • Application server (lemon-server) – lxmred0803/0603 • Web server – lemonweb (lxmrec1601) • Secondary • Oracle DB and OraMon – lemondb2 • Web server – lemonweb02 • Remote monitoring machines • Lxmred0803 and lxmred0603 Lemon Tutorial

  5. lemon-cli • Command line tool for extracting raw (un-interpreted) data from lemon. • Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server • Limitations • local cache is limited to seven days worth of history (purged everyday by the agent) • local cache contains much more information then is recorded at the server • Why? smoothing!! • Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers • If the information you want is < 7 days use the local cache!! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml Lemon Tutorial

  6. lemon-cli (II) - Examples • Resolving a metric id to a name • lemon-cli –m syslog • Displays all the metrics whose name contains ‘syslog’ • Referencing time periods (--end, --start), e.g. • 1h = 1 hour • 2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds • Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!) • If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m 10005 –n lxb[0001-1000] --server • All alarms can be seen on the machine using • lemon-cli –class “alarm.exception” • 1 005, 1 135 and 1 000 are alarms • lemon-host-check interprets all the codes for you!! Lemon Tutorial

  7. lemon-host-check (I) • Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon-agent. • Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) • Makes sure that the information reported to you is up to date (fresh!!) • Checks that all sensors are running, and that 1 and only 1 agent processing is running. • Must be logged in as root! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial

  8. lemon-host-check (II) - Examples • Check for active alarms on the machine • lemon-host-check • Disable alarms “syslogd and klogd” • lemon-host-check –disable "30023,30032“ • Show me alarms even if they are disabled • lemon-host-check –force • Disable all alarms for the next 1 hour 30 minutes and 23 seconds • lemon-host-check –disable-all –duration 1h30m23s “demo intervention” • View a list of all disabled alarms • lemon-host-check –list • Enable all alarms • lemon-host-check –enable-all • Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again. Lemon Tutorial

  9. lemon-host-check (III) • Pre-alarms • Recent concept added to lemon. • Aims at dealing with transient alarms. • Real Use Case: • high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. • Not visible by default in lemon-host-check • Caution: • If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). • Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) • If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!) Lemon Tutorial

  10. lemon-host-check (IV) • Common errors: No monitoring agent process running / Too many monitoring agent processes running • service edg-fmon-agent restart • If that fails project-elfms-lemon@cern.ch Possible false exception • lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) • Why? • The agent maybe too busy to answer lemon-host-check • Maybe some sensors have failed to retrieve the necessary information • Solution • re-run lemon-host-check again • Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. • Try again • Still failing, contact service manager and CC project-elfms-lemon@cern.ch Lemon Tutorial

  11. FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? • Linux (lemon agent, ping, http check) • Solaris (lemon agent, UIMON) • Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? • AFS machines have their own monitoring tools – no information available • UIMON monitored machines – running UIMON process and multiplexer to send alarms Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? • Yes, HA for lemon-server on lxmred0803 and lxmred0603 • Oracle RAC (dbsrvd102/103) • Two independent instances (lemondb2/lemonweb02 and lemonrac/lxmred0803/0603) Lemon Tutorial

  12. FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? • Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? • Full documentation with examples athttp://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml • Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80 A mean to detect when a node started to be "alarmed" and when this stopped. • /var/log/ncm/component-setodesiredstate.log* log file on the machine in question What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? • No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml Lemon Tutorial

More Related