280 likes | 288 Vues
Service Level Status Overview project Sebastian Lopienski CERN, IT/FIO HEPiX meeting, Jefferson Lab, October 10 th , 2006. Agenda. Overview of the project Concepts service, subservice, metaservice availability vs. status Key Performance Indicators Demonstration Your own SLS instance?.
E N D
Service Level Status Overview project Sebastian Lopienski CERN, IT/FIO HEPiX meeting, Jefferson Lab, October 10th, 2006
Agenda • Overview of the project • Concepts • service, subservice, metaservice • availability vs. status • Key Performance Indicators • Demonstration • Your own SLS instance?
The need • What is current availability of the CVS service? • Which services are still affected by the power cut last night? • If my service is in maintenance, what other services will be affected? • What is overall status of all services used by ATLAS experiment?
Service Level Status Overview (SLS) • Aim: • To provide a web-based tool that dynamically shows availability, basic information and statistics about IT services, as well as dependencies between them. • For whom? • service users • department and CERN management • other service providers • manager of the given service
Features • collecting and displaying service information,status and availability • dependencies and reverse dependencies • service incidents, scheduled interventions • hierarchical structure of services • configurable views of services • charts of availability trends over time • statistics of availability (and other values) • Key Performance Indicators (KPIs)
Architecture we collect and display information but we don’t generate it!
Agenda • Overview of the project • Concepts • service, subservice, metaservice • availability vs. status • Key Performance Indicators • Demonstration • Your own SLS instance?
What is service availability? • Service availability indicates to what extent a given service is accessible and useful for its users • Services should be monitored from users’ point of view • a user doesn’t care about alarms on machines running the service • In SLS, service availability is a number N: 0 ≤ N ≤ 100
Service availability and status Service fully (100%) available Service available at 95%, still marked as fully available • above the highest threshold Service available at 87%, marked as affected • below the highest threshold Service available at 50%, marked as degraded • below the medium threshold Service available at 13%, marked as not available • below the lowest threshold Service info expired, update not available Scheduled outage or maintenance Different status thresholds mean different status for services with the same availability (more at http://cern.ch/SLS/help.php)
Key Performance Indicators • KPIs are metrics that indicate whether a service meets its requirements (performance or other) • Examples of Key Performance Indicators: • % of availability of CPU servers (how many machines in production out of total) • % of AFS volumes and servers available,also breakdown by VO • CPU delivered to VO as compared to quota,% of usage from Grid • KPI is a pair of two values: measured and target
Agenda • Overview of the project • Concepts • service, subservice, metaservice • availability vs. status • Key Performance Indicators • Demonstration • Your own SLS instance?
SLS instance at CERN • http://cern.ch/SLS(NICE password required) • all availabilities shown there are real and up to date • inline SLS view for a given service(e.g. at http://cern.ch/CVS)
SLS instance at CERN Most IT services are covered by SLS: • Administrative applications • Windows, Mail, Web services • AFS, lxbatch, lxplus, Backup, Tapes, Remedy, Lemon • CVS services, J2EE Public Service, EDMS • databases • LCG Tier-0 and 1 sites • Indico, CDS, CRBS, VRVS etc. Metaservices and views: • logical structure, group structure, VO-oriented structure
Agenda • Overview of the project • Concepts • service, subservice, metaservice • availability vs. status • Key Performance Indicators • Demonstration • Your own SLS instance?
Setting up an SLS instance • Simple installation from an RPM • for SLC3 and SLC4 • see: https://twiki.cern.ch/twiki/bin/view/FIOgroup/SLSAdminDocumentation • No CERN-specific dependencies • Requirements • Apache, Python, PHP (with DOM and OCI8 extensions) • Xerces-C >= 2.3 • JpGraph and GD library • cx_Oracle (for the database functionality) • Comes with one service predefined – SLS itself • Released under the EU DataGrid software license • a BSD-style license
Adding a new service • Service manager has to: • have an idea how to measure service availability • and a piece of code that calculatesthe availability percentage value (0..100) • Then, follow the two simple steps: • prepare a static service description XML fileand send it to us (once) • make service update XMLs available via HTTP • SLS Manual for Service Managers provides detailed instructions, and many examples of XMLs: https://twiki.cern.ch/twiki/bin/view/FIOgroup/SLSManualForSM
Minimal static XML example <?xml version="1.0" encoding="UTF-8"?> <service xmlns="http://sls.cern.ch/SLS/XML/static"> <id>DFS</id> <fullname>DFS (Distributed File System)</fullname> <datasource> <url> https://websvc02.cern.ch/winservices-soap/... </url> </datasource> </service> Example of static service description XML with more information: https://twiki.cern.ch/twiki/bin/view/FIOgroup/SLSManualForSM#Static_XML_with_more_information
Service managers … <servicemanagers> <servicemanagermain="true" login="ungil"> Carlos Ungil </servicemanager> <servicemanager> Maciej Stepniewski </servicemanager> <servicemanager login="wtomlin"> William Tomlin </servicemanager> </servicemanagers> … Contact data from LDAP
Service dependencies … <dependencies> <dependencylevel="dependson">AFS</dependency> <dependencylevel="uses">Castor</dependency></dependencies> … • Two different levels of dependency: • dependson - means that the service will not work if AFS is down • uses- means that the service uses Castor (for example for backup), but will work fine (or almost fine) even if Castor is not available
0 30 70 80 100 Status thresholds … <availabilitythresholds> <thresholdlevel="available">80</threshold> <thresholdlevel="affected">70</threshold> <thresholdlevel="degraded">30</threshold> </availabilitythresholds> …
Minimal update XML example <?xml version="1.0" encoding="utf-8"?> <serviceupdate xmlns="http://sls.cern.ch/SLS/XML/update"> <id>CVS</id> <availability>100</availability> <timestamp> 2006-03-14T14:20:27+01:00 </timestamp> </serviceupdate> Example of availability update XML with more information: https://twiki.cern.ch/twiki/bin/view/FIOgroup/SLSManualForSM#Update_XML_with_more_information
Making update XML accessible via http • Generate update XMLs with any server-side language / technology / platform: • PHP, Perl, Python, CGI, ASP • .Net: C#, J2EE: Servlets, JSP • or: Refresh periodically (from a cron) a fileand make it available via http • or: Write a Lemon sensor providing service availability • Advice and examples in the SLS Manual for Service Managers
Observations • Trusting service managers • there is no way to cross-check availability figures provided by services • User expectations • Is it really real-time? • My mailbox/CVS repository/J2EE container doesn’t work, but the service is green! • Surprisingly, convincing service managers to join in was not that difficult
Summary • SLS shows availability and status of services as seen by users • SLS is a flexible and informative display covering the entirety of computing services • SLS collects and displaysinformation provided by the services • SLS is available for use outside CERN
Thank you! SLS instance at CERN (password protected)http://cern.ch/SLS Sebastian.Lopienski@cern.ch Questions?