Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31st - Feb 1st 2005
Overview • GOC Database • Monitoring Tools • Accounting • Issues • Future Plans
GOC Database • What features? • Configuration of monitoring tools • Security • Organisations • Administrative Roles • Replication • What role will it play in the future? • New site registration procedure • BDII generation
GRID Configuration Database Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Maintenance Bit • Monitoring Services • Operations Maps • Configure other Tools • Resource Provider • Organisation Structures • Secure services • - Site News • Self Certification • Accounting GOCDB GridSite MySQL SERVER SQL https Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … bdii ce GOC DB can also contain information that is not present in the IS such as: Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps. se rb RC
EGEE ROC Structure USA • EGEE is made up of regions. • Each region contains many computing centres. • Regional Operational Centres are a focus for operational activities.
Organisational Structures • Developed a tool to manage organisational structures. • Modelled on GridPP Tier1/2 Structure Materialised Path Encoding • Provide ROCs with a package to monitor the resources in the region • Tailored Monitoring • Administrative roles to the coordinators in GOCDB
GOC DB Site info • Gstat Data • Site Functional Tests • GOC Hourly Tests Total List of all sites RGMA GOC Bit Sites pass core tests Black List Trusted Sites BDII White List Adaptive Job Brokering Based on the Monitoring System Generation of BDII configuration file via feedback into IS Monitoring Services 100’s of Sites Environments Production, VO, GridPP, … • Total List of all sites is derived from GOCDB (via RGMA) • GOC bit: sites which have opted out e.g. scheduled maintenance • White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site • Core tests are a subset of the site functional tests run by CERN every day • Black List: Sites that are not trusted
How Are New Sites Added? Site ROC GOCDB EGEE • JSPG have written a “Site Registration Policy & Procedure” Document • https://edms.cern.ch/document/503198/ • New GOCDB portal to streamline the site registration process. Site and ROC liaise   “candidate” site  “uncertified” Site  “certified” Site  Site installs middleware  Certification Testing
Replication Two replicas, each one has a different security considerations • “Services” replica managed by Taipei • Direct connections to the database by the monitoring tools from known hosts • “Users” replica to be setup at IN2P3 • Web portal based on X.509 certificates • CIC on duty
Monitoring Tools • What are the main tools that are used in the day-to-day operations of the LCG Grid? • GPPMON • GSTAT • Site Functional Tests • Other monitoring tools exist, but I won’t discuss them here • GridIce
Operations Map – Job Submission Tests GPPMON Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
Operations Map – Certificate Lifetime GPPMON Displays the results of tests against sites. Test:Certificate Lifetime Many grid services require a valid certificate for security. By probing the host certificates on CEs and SEs at sites with a simple SSL client service, we can identify certificates which are due to expire and send an early warning to them. A predictive tool!
GIIS Monitor • Developed by MinTsai (GOC Taipei) • Tool to display and check information published by the site GIIS (sanity checks, fault detection) • http://goc.grid.sinica.edu.tw/gstat/ Regional Plot: http://map.gridpp.ac.uk
Site Certification Service • In terms of middleware, the installation and configuration of a site is quite a complicated procedure. • When there is a new release, sites don’t upgrade at the same time • Some upgrades don’t always go smoothly • Unexpected things happen (who turned of the power?) • Day-to-day problems; robustness of service under load? • Its necessary to actively hunt for problems • Site certification testing is by CERN deployment team on a daily basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE. • Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
Certification Test Results http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi
Syndication of Monitoring Information GOC generates RSS feeds which clients can pull using an RSS aggregator. How can we integrate feeds and ticketing systems? Aggregator RSSReader (Windows Client)
Real Time Grid Monitor http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html Why are jobs failing? Why are jobs queued at sites while others are empty? A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs.
Problems with Existing Tools • Lots of monitoring tools around which have things in common:- - all the information which they generate is hidden away or difficult to access - limited interfaces: the data can only be accessed in specific ways • Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data. • The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all. • Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community. • How much CPU in UKI ROC • How much in GridPP? • How much in each Tier2? => Integrate data from different sources to provide this information
Monitoring Paradigm Communities VOs ROCs EGEE Sites Organisations GOC Services Monitoring GSTAT Testing Accounting Self Certification Information Repository (RGMA) ROC Services CIC Services A Better way to unify monitoring information. GOC Services collect information and publish into an archiver. ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community.
Use Cases • Monitoring services which use RGMA as the backbone for data transport and data location via the registry service. • Grid Event Monitoring System • “Site Functional Test” Reporting Tool • Accounting
UseCases - GEMS • Grid Event Monitoring System • List of resources to monitor is provided by GOCDB • Alert system that uses RGMA • Looks for changes of state in the monitoring data tables • Generates an alert and displays on the GEMS console. • Notification features • Event filtering
Reporting Tool Prototype Organisational Identities taken from GOCDB
Accounting • Information collected at each site from batch logs, gatekeeper logs etc • Information joined at site level to select grid jobs and stored in database on R-GMA MON box at site. • Information published through R-GMA and collected centrally in an R-GMA archive at GOC • Web site presents various views of this data for presentation • Information schema based on GGF Usage Group • Structure of Grid taken from GOC DB – the grid configuration database. • Only normalised cpu time collected (at the moment)
http://goc.grid-support.ac.uk/gridsite/accounting/index.html GOC Accounting Services Each Site, per VO, per Month On Demand Services to EGEE Community Simple interface to customise views of data: VO, time frame and Region (default = EGEE) BaseCpuSeconds Aggregated across EGEE Each Region, per VO, per Month Other Distributions Normalised CPU # Jobs
Select date range Select VOs (Default = All) Web form to apply selection criteria on the data Aggregate data across an organisation structure (Default= All ROCs)
Summed CPU (Seconds) consumed by resources in selected Region VO Index Selected Date Range
List of Sites Belonging to the Selected ROC A breakdown of the resource usage per Site, per VO, per Month
Deployment • Package was released to LCG in August 2004 and certified soon afterwards. • There was no LCG release after that until LCG2_3_0 on 18th December 2004 • Today there are still very few 2_3_0 sites. There are 28 sites producing accounting records today. • The 2_3_0 release has some bugs which are fixed in a new release that is available on the accounting home page • Recommend that sites upgrade accounting to version APEL 3.4.40 available on the accounting homepage http://goc.grid-support.ac.uk/gridsite/accounting/index.html
Future Plans • Support for the LSF batch system. • Understand Normalisation issues; do we have faith in the numbers we present? • Extend accounting schema to include information about the worker node, Job efficiency and globalJobID. • Integrate the LCG schema with de-facto grid accounting standards, namely GGF • Share data with other Grid Communities • NorduGrid, Grid03
Summary • GOCDB to take a more important role in operation environment • A shift in the monitoring paradigm which relies on sharing data through RGMA • Accounting Information gathering infrastructure and reporting web site • Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels. • Development of Visualisation tools to enhance our understanding of the grid. • Adaptive Job brokering based on the monitoring system