360 likes | 373 Vues
This presentation discusses the OSG operations model and implementation, including monitoring instruments, support workflows, and conclusions.
E N D
Operations and Support for Grid Environments Leigh Grundhoefer Indiana University leighg@indiana.edu
Agenda • Introduction to OSG Operations model and implementation • Monitoring Instruments • Support Workflows • Conclusions leighg@indiana.edu
Defining Grid Support • What kind of infrastructure? • Definition of “instrumentation” software • Deployment policies and procedures • Error handling methods • What is the structure for the support? • Try to reduce duplication of effort • Integration of grid support to a variable set of existing resource provider support mechanisms • Interfacing support staff and grid experts leighg@indiana.edu
Integrating grid support NOC Facility Machine Operators Support Security Czar Grid ops Network Admins System Admin Resources leighg@indiana.edu
Ops Storage Security Integration site admins Activity OSG landscape VOs & apps TG Mon&Info TG Policy Arch MIS Policy OSG deployment TG Storage Support Centers Technical Group oversees Operations Activity (Ops) TG Security TG Support Centers Chairs leighg@indiana.edu
Operations Scope • Runs the grid-wide services including provisioning and installation of middleware and operational support for those services, resource providers and VO's running on OSG. • Coordinates with other Grids and between support organizations. • Applies Users and Service Agreements • Provides a repository for collected registrations and agreements of participating organizations leighg@indiana.edu
Engineering • Maintained grid-controlled software packages and cache • Provide common grid software support through VDT • Verify software compatibilities • Provision releases of the OSG middleware and services • Troubleshoot service failures • Deployment guidance and assistance • liason to other service support centers • Monitor status of grid resources • Publish monitoring information for grid resources leighg@indiana.edu
Infrastructure • Trouble Ticketing system and interface • Monitoring tools development and maintenance • Accounting services • Discovery services • Identity services • Grid information index • Grid Catalog • VO-level services for monitoring services • Knowledge base • Mailing Lists • Formal and collaborative web information repositories leighg@indiana.edu
Provisioning Tasks • Set up the pre-release candidates for production installation tests • Add version control to production release • Deploy and validate auxiliary services • Adjust middleware configuration setup • Pre-release testing of production installation • Pre-release test of services • Full documentation preparation • Installation Manuals • Releases Notes, Change Logs, Patches, Upgrades • Description of the services provided for the release and access information leighg@indiana.edu
Support Services • Coordinates and Tracks: • problems for service providers • Security incidents • Requests for assistance • Schedule grid service and middleware changes • Monitor policy compliance • Detailed later in this talk leighg@indiana.edu
Agenda • Introduction to OSG Operations model and implementation • Monitoring Instruments • Support Workflows • Conclusions leighg@indiana.edu
OSG Service Integration • Grid Catalog -- GridCat • MonaLisa • MIS-CI • New “core” monitoring services leighg@indiana.edu
Integrated Monitoring Framework • Globus Meta Directory System (LDAP directory) • MonALISA, Monitoring Agents in Large Integrated Service Architecture (Pub/Sub) • MonALISA repository (WS/WAP) • Ganglia performance monitoring (Multicast/Hierarchical) • Job Monitoring System at the Advanced Center for Distributed Computing (non invasive archive) • The Grid Site Status Cataloging System at iGOC (human/automatic managed DB) leighg@indiana.edu
Grid Telemetry • Information • Site list • Test result • Load • Jobs running • Jobs queued • Heterogeneous • Redundant leighg@indiana.edu
What is GridCat ? • AGridSiteStatusCataloging System - A Web App. • High level simple status map : • Computing Resource Information Collector/Presenter • Static and dynamic information about all sites • Simple grid status presentation on the web • Identifies site readiness • A web application easy to develop and deploy • Displays disk space and CPU slots • Parallel information collecting, storing, and archiving among sites (BE) • Web pages: In templated html+php+js(FE) leighg@indiana.edu
iGOC GridCat View leighg@indiana.edu
iGOC MonaLisa View leighg@indiana.edu
iGOC MonaLisa View (partial) leighg@indiana.edu
OSG MIS advancement • Monitoring and information Services - Core Infrastructure (MIS-CI) • MIS Compute Element and Storage Element • Discovery Service • Consumer Interface • Resource Repository • Information Gathering leighg@indiana.edu
MIS-CI Resource Throttling Discovery Service User SQL-Lite Local Historical Repository SQL-Lite Remote VO Historical Repository Cron jobs Schema Eval gridftp Resource Gatekeeper diskspace • jobmanager-mis • -profile (default) • jobs • gridftp • diskspace • -accounting • policy • environment • software • VO • statistics • security SQL-Lite Database accounting MIS-CI Consumer Throttling Self- Monitor policy environment software VO SQL-Lite Database Backup statistics MIS-CI Profiles security ? Custom Remote Repository Grid Scheduler OSG Auditing/Accounting MIS-CI Architecture leighg@indiana.edu
Agenda • Introduction to OSG Operations model and implementation • Monitoring Instruments • Support Workflows • Conclusions leighg@indiana.edu
Grid Operations Center Operations Indiana iGOC VO Support Centers Service Support Centers • Provisioning • Ops procedures • Coordination Resource Provider Support Centers leighg@indiana.edu
Leveraging the NOC • Global NOC at Indiana University • The Global NOC provides 24x7 network engineering and operations services for research and education networks and international interconnections, including Internet2 Abilene, National LambdaRail, TransPAC and AMPATH networks, the STAR TAP and MANLAN layer 3 international exchange points, and the STAR LIGHT optical exchange. In addition, the Global NOC supports activities of the iVDGL Grid Operations Center and the REN-ISAC cybersecurity Watch Desk. By virtue of the R&E network, grid, and cybersecurity activities, the Global NOC possesses a unique and embracing view of R&E cyberinfrastructure. leighg@indiana.edu
NOC Grid Systems and Services(run every 15m) Trouble Tickets • Ticket 894 GOC NOC Mon Nagios Contact DB Monitoring the GOC services leighg@indiana.edu
http://www.ivdgl.org/grid3 leighg@indiana.edu
Problem to Trouble Ticket • Scope • A single resource / Multiple resources • Application wide • VO wide • Grid wide • Operations Resource/ Operations Service • Severity • Critical, High, Elevated, Normal • Problem Owner • Problem Contact • Problem Description leighg@indiana.edu
Monitoring Event GOC Site Fails Grid Catalog Test (run every 5 hours) Trouble Tickets NOC Monitors Grid Catalog Map • Ticket 854 Grid Experts GOC Mon GridCat MonaLisa Contact DB Security/Incidence Handling Resource VO support Or Facility Resouce Resource leighg@indiana.edu
Trouble Tickets • Ticket 803 • Ticket 823 • Ticket 833 • Ticket 843 Reactive Support workflow igoc@ivdgl.org GOC Web form & Telephone Grid Experts Web Docs Developers Contact DB User/Admin Application Failure Planned Outages Security problems Installation help Configuration assistance Identity management Authorization problems Other Support Centers Security/Incidence Handling leighg@indiana.edu
Agenda • Introduction to OSG Operations model and implementation • Monitoring Instruments • Support Workflows • Conclusions leighg@indiana.edu
Operations Enables Applications • Provide operational services that provide Applications with the “instruments” to: • Publish site policies and environment • Know the status of grid middleware on sites • Know the job queue for compute resources • Know the status and load of grid resources • Access historical monitoring information • Manage grid services • Keep apprised of security incidents in the collaborative leighg@indiana.edu
Lessons Learned • Configuration management efforts in the development and deployment areas are rewarded many times over during production. • A monitoring infrastructure allows a significant problem solving advantage, esp. redundant monitoring. • Establishment of clear communications between resources providers, users and Virtual Organizations is hard. leighg@indiana.edu
More Lessons Learned • Human interactions in grid building costly • Keeping resource provider requirements light lead to heavy loads on gatekeeper hosts ( monitoring framework ) • Diverse set of resource configurations made jobs requirements exchange difficult • Troubleshooting: efficiency for submitted jobs was not as high as we’d like. leighg@indiana.edu
Upcoming Challenges • Shared problem handling with application-centric and VO centric support structures • Ticket passing to and from other Grid environments • Establishing a working monitoring framework for distributed storage resources and virtual data cataloging infrastructure leighg@indiana.edu
Thank You - leighg@indiana.edu