1 / 9

D0 Grid Data Production Initiative: Coordination Mtg 10

Version 1.0 (meeting edition) 13 November 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 10. Outline. Summary and News Deployment “Feature List” No change since last week Task Status (4 slides)

borka
Télécharger la présentation

D0 Grid Data Production Initiative: Coordination Mtg 10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Version 1.0 (meeting edition) 13 November 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 10 D0 Grid Data Production

  2. D0 Grid Data Production Outline • Summary and News • Deployment “Feature List” • No change since last week • Task Status (4 slides) • Focus on individual task status, what is needed • Deployment 1 Plan • Focus on overall schedule, task order

  3. D0 Grid Data Production Summary and News • Summary • Initiative Deployment 1 Planning Mtg held Monday • Jobs successfully run through FWD4, FWD5 • Deployment scheduled for today-Monday. Running about 3 days behind this, but no roadblocks • News and Notes: • D0 Collaboration Meeting this week • ITSM all-day Workshops last Thu-Fri, this Tue-Fri, next Tue-Fri • Rob K. co-leads series next week, will not be here • Joe B. vacation next Mon-Tue

  4. D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services (NO CHANGE) • Time frame: November 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging • 2. Fwd4 deployed (w/o virtualization) • 3. Fwd5 deployed • 4. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 5. New SAM Station (moved off of FWD1) • 6. Condor 7 via “new” 1.10.1m official release from UWisc • 7. FileMax increase on all Fwd nodes to handle large nJob actions • 8. D0Runjob Upgrade for Data Production: Prerequisite for deploying new SAM-Grid release • Deployment 2: Optimize Data and MC Production Configurations (NO CHANGE) • Time frame: December 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. New SAM-Grid Release with support for new Job status value at Queuing node

  5. D0 Grid Data Production Task Status (1 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.1 Forwarding Node 4 (Fwd4) • <Snip some completed tasks> • 1.1.1.15 Fwd4: Few Jobs OpenFileMax=As-Is Single Job Test AL JB Mon 11/3/08 Mon 11/10/08 6d • 1.1.1.9 Fwd4: Few Jobs OpenFileMax=As-Is Large-Scale Tests AL "JB,MD,JS" Tue 11/11/08 Wed 11/12/08 2d • 1.1.1.14 Fwd4: Increase OpenFileMax to 16k AL FEF Tue 11/11/08 Wed 11/12/08 2d • 1.1.1.10 Fwd4: Pre-Deployment OpenFileMax=16k Large-Scale Test AL"JS,MD,JB"Thu 11/13/08Fri 11/14/08 2d • 1.1.1.11 Milestone: Fwd4 Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • 1.1.2 Forwarding Node 5 (Fwd5) • <Snip some completed tasks> • 1.1.2.11 Fwd5: Few Jobs OpenFileMax=As-Is Single Job Test AL JB Mon 11/3/08 Mon 11/10/08 6d • 1.1.2.7 Fwd5: Few Jobs OpenFileMax=As-Is Large-Scale Tests AL "JB,MD,JS" Tue 11/11/08 Wed 11/12/08 2d • 1.1.2.12 Fwd5: Increase OpenFileMax to 16k AL FEF Tue 11/11/08 Wed 11/12/08 2d • 1.1.2.8 Fwd5: Pre-Deployment OpenFileMax=16k Large-Scale Test AL"JS,MD,JB"Thu 11/13/08Fri 11/14/08 2d • 1.1.2.9 Milestone: Fwd5 Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product • <Snip some completed tasks> • 1.1.8.6 Umbrella Product: Update FWD Installation Procedure AL JB Wed 11/19/08 Thu 11/20/08 2d • Notes: RDK – will enter the monitoring related JIRA tasks into schedule as well, not critical path IMHO since can be done in parallel with testing functionality.

  6. D0 Grid Data Production Task Status (2 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product • 1.1.8.10 Umbrella Product: Update QUE Installation Procedure AL JB Wed 11/19/08Thu 11/20/08 2d • 1.1.8.13 Umbrella Product: FWD and QUE Installation Proc. archived AL REX Fri 11/21/08 Mon 11/24/08 2d • 1.1.3 Queuing Node 2 (Que2) • <Snip some completed tasks> • 1.1.3.10 Que2: Jim_Client 2-QUE Support: Client Deployment AL REX Wed 11/5/08 Wed 11/5/08 1d • 1.1.3.8 Que2: Regression Test w/1-QUE Client AL JB Thu 11/6/08 Fri 11/7/08 2d • 1.1.3.9 Que2: Integration Test w/2-QUE Client AL JB Mon 11/10/08 Fri 11/14/08 5d • 1.1.3.11 Milestone: Que2 Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • 1.1.5 New Distinct Sam Station • <Snip some completed tasks> • 1.1.5.4 SAM Station: Install and Setup Station AL RI Thu 11/6/08 Thu 11/6/08 1d • 1.1.5.10 SAM Station: Request and Install SRM-related certs AL RI Fri 11/7/08 Wed 11/12/08 4d • 1.1.5.5 SAM Station: Pre-Deployment Test AL RI Thu 11/13/08 Thu 11/13/08 1d • 1.1.5.6 SAM Station: Deactivate old station, Activate new station AL AL Fri 11/14/08 Fri 11/14/08 1d • 1.1.5.7 Milestone: SAM Station Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • Notes: RDK – will enter the monitoring related JIRA tasks into schedule as well, not critical path IMHO since can be done in parallel with testing functionality. • JIRA “Figure out what to do with SRMs” contains “Request and Install SRM-related certs”

  7. D0 Grid Data Production Deployment 1 Tasks (3 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.6 Deployment Stage 1 • 1.1.6.1 Deployment 1: Plan: Split Data/MC Production Svcs AL ALL Mon 11/10/08 Wed 11/12/08 3d • <Pre-deployment Testing in progress> • 1.1.6.2 Deployment 1: Execute AL REX Wed 11/19/08 Thu 11/20/08 2d • 1.1.6.3 Deployment 1: Monitor AL REX Fri 11/21/08 Mon 11/24/08 2d • 1.1.6.4 Deployment 1: Sign-off AL REX Tue 11/25/08 Tue 11/25/08 1d • 1.1.6.5 MILE 1: Deployment 1 Completed AL Tue 11/25/08 Tue 11/25/08 0d • Meeting on Monday 10 November produced a schedule (next slide), need more fall-back planning • 17 November 2008 end of day was the drop-dead date to be deployed, allows 7 days to observe. • The soonest we can deployment IMHO is Wed-Thu 19-20 November, allows 4 days to observe. • We cannot deploy later than 20 November (Thursday)… no deploy on Friday or holiday week. • Past Notes: • SAM Station – prefer downtime for switch. Prefer to drain-off for install (and un-install if it comes to that) • New Condor is in this deployment too. THIS is a major risk. (Will be on FWD1-3,QUE1 as well) • Full wipe/re-install versus marginal application-only install. Treat existing nodes differently than new nodes? • How does umbrella product installation interact with existing products on old nodes (assuming not a full wipe/re-install)? • Take working nodes and possibly making them non-working and without fall-back versions. • Alternative, upgrade old nodes in Dec… perhaps via rolling upgrades. • (new) Marginal application-only install on existing nodes in Nov. and a full wipe/re-install in Dec.? Appliance view.

  8. D0 Grid Data Production Mon 10 Nov 2008 Finish FWD4,5 tests, Finish QUE2 tests Tues 11 Nov 2008 Finish FWD4,5 tests, Finish QUE2 tests Jim Client tweak (per JS,MD) - unnecessary SAM/SRM certs received (12 Nov) Forewarn of SAM/SRM changes coming Wed 12 Nov 2008 Jim Client redeploy - unnecessary OpenFileMax change on FWD4,5 Large-scale FWD4,5 tests 2-queue node tests: submit to both w/qual Test SAM/SRM w/new SRM certs Thu 13 Nov 2008 SAM/SRM Installation, tests done. Fri 14 Nov 2008 SAM Station move off of FWD1 Deactivate SAM station on FWD1 Activate new SAM station Downtime-less change-over Fall-back: use old config, same way Sat/Sun 15-16 Nov 2008 Test FWD4,5, QUE2 w/ new OpenFileMax Drain off SRMs connected to old SAM station Observe new SAM station in production Mon 17 – Thu 20 Nov 2008 NB: GCC Power Out sometime this week? Alt Resources to start this on Mon? Mon 17 Nov: : Need a Depl Plan Mtg 2, 9am? Tue 18 Nov: FWD2 certs expire. FWD1-3 wipe/install via umbrella package; Increase OpenFileMax FWD2 first, then FWD3, then… Do FWD1 last in case SRMs still connected QUE1 “~wipe”/install via umbrella package (QUE1 has brokering, web page) AL: Be careful NOT to wipe state of old jobs… Brokering, Web page should not be touched? We have not fully tested the new deployment of these. Prepare for sign off Tues 25 Nov. Thu 20 Nov 2008: Deadline for SRM switch Validate the configuration, finish deployment. Fri 21 – Mon 24 Nov 2008 Observe system in production Post-Deployment Work: move context server? Tues 25 Nov 2008 Sign-off on D0 Grid Production System Deployment 1 Schedule(include more fall-backs and specific personnel assignments)

  9. D0 Grid Data Production Task Status (4 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info • 1.3.1.1 "Use ""Same"" Proxy for Gridftps" GG PM by Fri 11/21/08 • 1.3.1.2 New Job Status Value at QUE Node GG PM by Fri 11/21/08 • Work needs to restart in order to be ready for December deployment • 1.3.2 Slow Fwd-CAB Job Transition • Note: FileMax change requires a schedd restart (ST). Work into deployment plans. • 1.3.3 Improved H/w Uptime • 1.4 Metrics • nSubmissions plot for Sep ’08 Mike?

More Related