1 / 9

GOCDB failover status and plans

GOCDB failover status and plans. COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna. Assessment and progress. Last week's outage at RAL a good (!) usecase for testing our procedures and listing improvements DNS aspect new DNS machine at CNAF. Last RAL outage. Timeline

keely
Télécharger la présentation

GOCDB failover status and plans

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GOCDB failoverstatus and plans COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna

  2. Assessment and progress • Last week's outage at RAL • a good (!) usecase for testing our procedures and listing improvements • DNS aspect • new DNS machine at CNAF

  3. Last RAL outage • Timeline • 5:20 UTC - power glitch at RAL. • 8:00 – Start failover process • 9:20 - DNS switch complete. • 10:00 - Failover working properly. • 13:25 - reverse DNS switch

  4. Post mortem • good things • failover worked • DNS swap quick, efficient and transparent • Good synchronisation • CNAF IRC channel was useful • encountered problems • Problems with CNAF DB schema • DB Connection from ITWM to RAL • SSL issues • The overall process to swap completely took a rather long time (2h)

  5. Proposed improvements (1) • Improve manual process • Reduce the number of needed people. we need to allow different people to carry on the whole chain alone. • Create scripts to reduce number of actions • Sort out CNAF schema issue • Improve current synchronisation mechanism • Contacts and documentation • Keep somewhere a list of phone contacts, or alternative mail addresses to use in case main mail system does not work • Document all processes

  6. Proposed improvements (2) • Regular tests • Test CNAF replica DB • ITWM web interface • All possible scenarios • Configuration improvements • Simplify configuration file • have the service publish itself the fact that it is in read-only mode. • Automation • Work with OAT monitoring group • Automate DB switch • Automate portal switch the same way

  7. Actions list (1) • Doc and processes • Gilles to draft process + test documentation • Christian to add goc@itwm tests to ITWM procedures • All: provide contacts (phone, alternate mail, etc.) • Access to machines • Christian to give failover team access to gocdb@itwm • Gilles to give failover team access to gocdb@ral- Gilles to write goc portal • Scripting • Gilles to write scripts to change GOC portal conf • Peter/Ale to write DNS configuration scripts

  8. Actions list (2) • Improvements on CNAF-RAL DB sync • Gilles to provide a dump to CNAF whenever the schema changes • Peter/Ale/Gilles to study encryption solution to secure the dump • Gilles to check the dump solution is valid • Peter/Ale to implement new procedures • Ale to do speed tests in different scenarios

  9. Actions list (3) • Test • Test again • Re-test • Test • Test • Test (if there is some time left)

More Related