Monitoring: Grid, Fabric, Network

Monitoring:Grid, Fabric, Network Jennifer M. Schopf, Argonne National Lab PPDG Review 28 April 2003, Fermilab

Monitoring and PPDG • Many monitoring tools currently available • Different use cases • Different strengths • Legacy systems • Much of PPDG monitoring work is by non-funded collaborators • Les Cotrell, SLAC, IEPM-BW • Iosif Legrand, Cal Tech, MonALISA • Brian Tierney, LBNL, NetLogger, PyGMA, NTAF • Wisconsin Group, Hawkeye J. Schopf, PPDG Review

Tools in a nutshell J. Schopf, PPDG Review

PPDG Role in Monitoring • Deployment and evaluation • Use on production testbeds • Requirements back to developers • Additional information sources • Realistic use cases • Furthering of interoperability goals • GLUE schema • Common interfaces J. Schopf, PPDG Review

Deployment J. Schopf, PPDG Review

Interoperability between Efforts X – Currently available, U – Under consideration J. Schopf, PPDG Review

Overview • Examples of interfacing between tools • STAR use of Ganglia/MDS • Ganglia extension in ATLAS • Mona Lisa interfaces to Hawkeye and MDS in CMS • Scalability analysis • Some future steps J. Schopf, PPDG Review

Ganglia –MDS InterfaceSTAR efforts and use Stratos Efstathiadis, BNL • Developed a modified version of the Ganglia IP • Perl basis • Match the current CE-GLUE Schema • Can connect to the Ganglia Meta Daemon or the Ganglia Monitoring daemon • Simpler and more flexible • Currently being tested at PDSF and BNL J. Schopf, PPDG Review

Ganglia Extensions in ATLAS Monitor Cluster health The added information through Ganglia creates an additional level combining different clusters into a “metacluster” J. Schopf, PPDG Review

MonALISA in CMS • MonALISA (Caltech) • Dynamic information/resource discoveryusing intelligent agents • Java / Jini with interfaces to SNMP, MDS, Ganglia, and Hawkeye • WDSL / SOAP with UDDI • Aim to incorporate into a “Grid Control Room” Service • Integration with MDS and Hawkeye J. Schopf, PPDG Review

Scalability Comparison of MDS, R-GMA, Hawkeye • Zhang, Freschl and Schopf, “A Performance Study of Monitoring and Information Services for Distributed Systems”, to appear in HPDC 2003 • How many users can query an information server at a time? • How many users can query a directory server? • How does an information server scale with the amount of data in it? • How does an aggregator scale with the number of information servers registered to it? J. Schopf, PPDG Review

Overall Results • Performance can be a matter of deployment • Effect of background load • Effect of network bandwidth • Performance can be affected by underlying infrastructure • LDAP/Java strengths and weaknesses • Performance can be improved using standard techniques • Caching; multi-threading; etc. J. Schopf, PPDG Review

MonaLisa Performance CPU Usage Dell I8100 ~ 1GHz Test : A large snmp query (~200 metrics values) on a 500 nodes farm every 60 s.~ 1600 metrics values collected per second from 1 MonaLisa service Threads IO “lxshare” cluster at CERN ~ 600 ndoes J. Schopf, PPDG Review

Future: OGSA and Monitoring • Open Grid Services Architecture (OGSA) defines standard interfaces and behaviors for distributed system integration, especially: • Standard XML-based service information model • Standard interfaces for push and pull mode access to service data • Notification and subscription • Every service has it’s own service data • OGSA has common mechanism to expose a service instance’s state data to service requestors for query, update and change notification • Monitoring data is “baked right in” J. Schopf, PPDG Review

OGSA-Compatible Monitoring • MDS3 • Part of OGSA reference implementation GT3 • Release will include full data in the GLUE schema for CE; Service data from RFT, RLS, GRAM; GridFTP Server data, SW version and path data • Simplest higher-level service is the caching index service • Much like the GIIS in MDS 2.x • MonALISA • will be compatible with OGSI-spec registration/subscription services • plans to have adapters that can interface to the OGSI service data • LBNL tools also adapting OGSI spec J. Schopf, PPDG Review

Future Work - Interoperability • Efforts will continue to make tools interoperate more • Many tools have the hooks to do this, it’s just a matter of filling in the slots • We need a better understanding of the requirements from the applications J. Schopf, PPDG Review

Summary • Many monitoring solutions are in use by different experiments • Additional experience is leading towards common uses and deployments • Ongoing work towards the use of common tools, common schema and naming conventions • Still need better identification of requirements and the involvement of application groups to work together on a common/consistent infrastructure J. Schopf, PPDG Review

Additional Details

GLUE-Schema Effort • Part of HICB/JTB GLUE framework • To address need to common schemas between projects • framework independent • something to translate into, not a requirement within fabric layer • Mail list: glue-schema@hicb.org • www.hicb.org/glue/glue-schema/schema.html J. Schopf, PPDG Review

Glue Schema Status • Compute Element schema: • Currently being used in EDG (MDS) and MDS2 • Found a couple minor things missing, which will be added to the next version • Will be in MDS-3 • SE schema: • Lots of good discussion to finalize this at CHEP • Will start to use this in EDG (R-GMA) testbed 2 later this month • NE schema: • Merged Ideas from EDG (UK group) with Datatag (Italian group) • GGF NM-WG is now working on this too. J. Schopf, PPDG Review

Globus MDS2Monitoring and Discovery Service • MDS has been accepted as core software for monitoring and presentation of information at the Grid level • GIIS set up as part of collaboration with iVDGL • Presents overall picture of the state of the Grid sites • Work continuing to interface it to local monitoring systems • Each site/experiment has preferred local solutions • Needed GLUE schema to make this happen J. Schopf, PPDG Review

MDS-3 in June Release • All the data currently in core MDS-2 • Full data in the GLUE schema for CE • Service data from RFT, RLS, GRAM • GridFTP Server data, SW version and path data • Simplest higher-level service is the caching index service • Much like the GIIS in MDS 2.x • Will have configurablity like an GIIS hierarchy • Will also have PHP-style scripts, much as available today J. Schopf, PPDG Review

MonaLisa Current Status • MonaLisa is running for several months at all the US-CMS production sites and at CERN. It proved to be stable and scalable ( at CERN is monitoring ~600 nodes) • It is used to monitor several major internet connections (CERN-US , CERN-Geant, Taiwan – Chicago, DataTag link … ) • MonaLisa is a prototype service under development. It is based on the code mobility paradigm which provides the mechanism for a consistent, dynamic invocation of components in large, distributed systems. • http://monalisa.cern.ch/MONALISA J. Schopf, PPDG Review

Hawkeye • Developed by Condor Group • Focus – automatic problem detection • Underlying infrastructure builds on the Condor ClassAd Tech. • Condor ClassAd language to identify resources in a pool • ClassAd Matchmaking to execute jobs based on attribute values of resources to identify problems in a pool • Schema-free representation allows users to easily add new types of information to Hawkeye • Information probes run on individual cluster nodes and report to central collector • Easy to add new information probes J. Schopf, PPDG Review

Hawkeye Recent Accomplishments • Release candidate for version 1.0 has been released. • Used to monitor USCMS testbed • Used to monitor University of Wisconsin-Madison Condor pool. J. Schopf, PPDG Review

PingER PIs: Les Cottrell SLAC PingER novel ideas • Low impact network performance measurements to most of the Internet connected world providing delays, loss and connectivity information over long time periods • Network AND application high throughput performance measurements allowing comparisons, identification of bottlenecks • Continuous, robust, measurement, analysis and web based reporting of results available world wide • Simple infrastructure enabling rapid deployment, locating within an application host, and local site management to avoid security issues Milestones/Dates/Status Impact and Connections • Infrastructure development Mon/Yr DONE • - develop simple window tuning tool 08/01 08/01 • - initial infrastructure developed 12/01 12/01 • - infrastructure installed at one site 01/02 01/02 • - improve and extend infrastructure 06/02 • - deploy at 2nd site 08/02 • - evaluate GIMI/DMF alternatives 10/02 • - extend deployment to PPDG sites 03/03 • •Develop analysis/reporting tools • - first version for standard apps 02/02 • Integrate new apps &net tools • - GridFTP and demo 05/05 • - INCITE tools 08/02 • - BW measure tools (e.g. pathload) 01/03 • • Compare & validate tools • - GridFTP 09/02 • - BW tools 04/03 • IMPACT: • increase network and Grid application bulk throughput over high delay, bandwidth networks (like DOE’s ESnet) • provide trouble shooting information for networkers and users by identifying the onset and magnitude of performance changes, and whether they appear in the application or the network • provide network performance data base, analysis and navigateable reports from active monitoring • CONNECTIONS: • SciDAC: High Energy Nuclear Physics, Bandwidth Estimation, Data Grid, INCITE • Base:Network Monitoring, Data Grid, Transport Protocols High-Performance Network Research- SciDAC/Base PingER: Active End-to-end performance monitoring for the Research and Education communities Tasks: -develop/deploy simple, robust ssh based active end-to-end measurement and management infrastructure -develop analysis/reporting tools -integrate new application and network measurement tools into the infrastructure -compare & validate various tools, and determine regions of applicability www-iepm.slac.stanford.edu Date Prepared: 1/7/02

IEPM-BW Status • N measuring to about 55 sites (mainly Grid, HENP and major networking sites) • 10 measuring sites in 5 countries, 5 are in production • Data and analyzed results are available at http://www.slac.stanford.edu/comp/net/bandwidth-tests/antonia/html/slac_wan_bw_tests.html • PingER results have been plugged into MDS • IEPM-BW and PingER data available via web services, we are aligning the naming with GGF NMWG and emerging GGF schemas • We will incorporate and evaluate different tests (e.g. tsunami, GridFTP, UDPmon, new bandwidth estimators, new quick iperf) • We are also focusing on making the data useful, working with the Internet2 PiPES project, on long and short term predictions and trouble-shooting. J. Schopf, PPDG Review

SAM-Grid J. Schopf, PPDG Review

Monitoring: Grid, Fabric, Network

Monitoring: Grid, Fabric, Network

Presentation Transcript

Grid Monitoring

NetJobs: Network Monitoring Using Grid Jobs

Monitoring Grid Services

Grid Monitoring Discussion

Network Monitoring Using Grid Jobs EGEE SA2

Grid Monitoring

Network Performance Monitoring for the Grid

Grid Network Performance Monitoring for e-Science

GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites

Grid Network Monitoring in the European DataGRID project

Grid Monitoring Services

Network and Grid Monitoring

Grid Monitoring Tools

Grid Infrastructure Monitoring

Federated Network Performance Monitoring for the Grid

Grid Infrastructure Monitoring

Network Monitoring Using Grid Jobs EGEE SA2

Network and Grid Monitoring

Network Performance Monitoring for the Grid

Grid Monitoring

Monitoring and Fabric Management

Monitoring: Grid, Fabric, Network

Sea Ice

Sea Ice