1 / 11

The P erformance and E xception M onitoring Project

The P erformance and E xception M onitoring Project. Tim Smith IT/PDP. Contents. Requirements current systems inadequacies Views + global metrics GQM + correlations Framework Scalabilty issues Project Status Tools survey Details from Alessandro…. Current systems inadequacies.

mimis
Télécharger la présentation

The P erformance and E xception M onitoring Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ThePerformance and ExceptionMonitoring Project Tim Smith IT/PDP

  2. Contents • Requirements • current systems inadequacies • Views + global metrics • GQM + correlations • Framework • Scalabilty issues • Project Status • Tools survey • Details from Alessandro… Tim Smith: FNAL workshop

  3. Current systems inadequacies • Independent alarm/monitoring systems • System snapshot requires multiple displays • Independent agents which: monitor local / monitor remote / restart /alarm • Calculate same info multiply and use differently • Host based – no correlations • Hosts complain about perceived problem not real one • Operator only follows precise instructions • Automation! (+ manual Remedy entry) • Separate static config DBs for alarms and machines Tim Smith: FNAL workshop

  4. Visions of the Future • One tool, many purposes…Views: • End-to-end, user, sysadmin, resource planning • 1000’s of PCs per cluster • Living with failures + scalable solutions! • Assure a service; Quorum of machines NOTfull complement • High level correlations; impact on a service • Quality of Service measures; Global Metrics Tim Smith: FNAL workshop

  5. Global Metrics • Honour Service Definitions • “Availability of usable 3000 CUs batch” • Machines up + FATMEN + LSF + lic. Serv. • “Availability of an interactive facility” • ASIS available + low trivial response time • “Job turnaround time expectations” • “Time to service tape request” + Disk/Network bandwidths + CPU/Memory utilisations Tim Smith: FNAL workshop

  6. Goal / Question / Metric • PDP Services e.g. Monitor quality of Interactive Service • Sufficient nodes? • Low enough load? • Slow to respond to commands? • Contactable via network • Network daemons alive • No nologin • Free ptys Tim Smith: FNAL workshop

  7. Correlations • Examples: • Web server on “SUN cluster” • Interactive Service Tim Smith: FNAL workshop

  8. Framework Diagram Tim Smith: FNAL workshop

  9. Scalability • Avoid bottlenecks by allowing for multiplicity of all components • Guiding principle: to avoid the PEM design being constrained by “possible” performance worries Tim Smith: FNAL workshop

  10. Project Status • Approval as divisional project • Interest in EFF and GRID projects • Documents Produced: • User Requirements • Tools survey • Goal / Question / Metric • Analysis (end April) • Design (end May) • http://cern.ch/proj-pem > Progress > Analysis Tim Smith: FNAL workshop

  11. Tools Survey • Enterprise / Cluster Management • Tivoli, UnicenterTNG, Patrol, PCP, SCADA, Alinka, SCMS, MosixMON • Public Domain Tools • MAT, GAP, Ranger (SLAC), VAMOS (DESY), rls (IN2P3) • Building blocks • SNMP (Scotty, Advent, MRTG, UCD), JDMK • PIKT, NetLogger, bonobo Tim Smith: FNAL workshop

More Related