1 / 15

Grid Operations Lessons Learned

Discover the lessons learned at the Open Science Grid Operations Center (GOC), including critical infrastructure support, communication hub, security response, and central software caches.

stefanias
Télécharger la présentation

Grid Operations Lessons Learned

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University

  2. Outline • How We Operate • Lessons Learned • Lessons Not Yet Learned R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  3. The Open Science Grid Operations Center (GOC) • Critical Infrastructure Support • Communication Hub • Security Response • Central Software Caches R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  4. Monitoring/Status VORS MonALISA GridCat VOMS Monitor CEMon/BDII Integrated Information Server Services VOMS (6 VOs) RSS News Feed GOC Informational Pages Trouble Ticketing Exchange with Peering Grids and Support Centers Scheduled Downtime Tool OSG Software Cache Registration DB Duplicate Infrastructure for the OSG ITB OSG Operations Infrastructure R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  5. Communication Hub • Operator Available 24/7/365 to Receive Call/Email and Open/Route Ticket • Trouble Ticketing • ~3500 Tickets since GOCs Inception • ~30 New Tickets Opened Per Week • Automated exchanging of tickets with GGUS, FNAL, VDT, ATLAS, CMS • Weekly Operations Call • OSG-Operations Mailing List R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  6. Security Response • Technician on-call 24/7/365 to evaluate security incidents. • Critical Incidents are Immediately Addressed with OSG Security Officer • security@, incident@, abuse@ opensciencegrid.org • 24/7/365 phone availability R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  7. Software Caches • OSG and ITB Caches • Compute Element • Configuration of Condor, PBS, LSF, SGE • Worker Node Client • Client • VOMS • GUMS R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  8. Lessons One • Event: Release of OSG 0.4.0 Software • Situation: Winter 2006, OSG Software stack has been validated and is ready for release, however documentation is in horrible shape. • Solution: 3 people work non-stop for 2 weeks to get baseline documents in shape. • Lesson: Documentation is as important as Validation, Integration, and Deployment. • Corollary: Incorrect documentation is often worse than no documentation. R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  9. Lesson Two • Event: A java service is using resources poorly • Situation: MonALISA monitoring used on a large group of grid resources takes tremendous amounts of I/O • Solution: GOC is asked to beef up the hardware • Lesson: The fix for poor software performance is better hardware. • Wait: that didn’t work!!! • Real Lesson: A bigger hammer will still not drive nails into rocks. R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  10. Lesson Three • Event: DZero Production Run • Situation: DZero has 250 million events to process and merge. • Solution: OSG Resources are urged to support the DZero VO and troubleshooting team works with application developers. Original Goal: ~3M events/day. Up to ~7.7M events/day processed. • Lesson: There's nothing you can't do if you have a Swiss Army Knife, a roll of duct tape, and your wits. • Actual Lesson: The resources are available on OSG, but there is still effort needed to coordinate large runs. R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  11. Lesson Four • Event: Joint WLCG/OSG/EGEE Operations Meeting • Situation: We need a way to seamlessly exchange problems between peering grids. • Solution: Develop a translator between EGEE GGUS ticketing and OSG Foot Prints System. • Lesson: Communication is the key to grid interoperability. • Alternate Lesson: If you can’t be at the World Cup, Geneva is the next best option. R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  12. Lesson Five • Event: High Level Collaborator Resigns • Situation: OSG Collaborator running a critical status availability service suddenly resigns and service is turned off. • Solution: Several developers design equivalent services. • Lesson: Critical services should have multiple administrators and be located centrally, or co-located at the GOC. • Alternate Lesson: If a potential security incident happens during a first date and pulls you away, there will probably not be a second. R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  13. Lesson Six • Event: Chicago Marathon • Situation: 13.1 miles to go, halfway point… spectator is offering bananas and tequila. • Solution: Take some of both. • Lesson: Sometimes motivation comes in the most unlikely form. • Corollary: If someone offers you tequila no matter what the situation… drink it! R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  14. Lessons That Need to Be Learned • How to accurately advertise VO support • How to efficiently interoperate with peering grids • How to understand and advertise site policy to users • What services are necessary to provide users with all of the information they need to effectively use the OSG • How to handle an explosion in user base R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

  15. Thank You • Special Thanks GOC Team: John Rosheck, Tim Silvers, Kyle Gross, and Arvind Gopu • www.opensciencegrid.org • www.grid.iu.edu R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007

More Related