150 likes | 270 Vues
WP2: Infrastructure and Service Management. Status Report ETICS All-Hands – 23 October 2006 CERN: Marian Zurek INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo. Personnel News. Changes @ UW-Madison Tolya Karp replaced by Andy Pavlo and Becky Gietzel
E N D
WP2: Infrastructure and Service Management Status Report ETICS All-Hands – 23 October 2006 CERN: Marian Zurek INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo
Personnel News • Changes @ UW-Madison • Tolya Karp replaced by Andy Pavlo and Becky Gietzel • Peter still here :) • Carlos to join WP2 @ CERN in November • Much needed sysadmin help for Marian!
Deliverables • D2.2 - Infrastructure installation and usage documentation (PM06) • Delivered (a little late -- PM07) • D2.3 - Status of certification, integration and validation testbed setup (prototype) (PM12) • Document not yet started -- but will contain positive news: prototype testbeds are up and have been operational for >6 months.
Major Tasks Performed • Certification, Integration and Validation Infrastructure Expansion: CERN Facility • Due entirety to Marian’s ongoing hard work, WP2 has expanded the NMI Build/Test Facility at CERN and improved its operation. • etics.cern.ch: official ETICS WS/submission node, production host • 19 CPUs: ia32, x86_64, ia64, ppc • SLC3, SLC4, RHES3, Deb3, FC3, FC4, FC5, WinXP, MacOS • 2500+ jobs (as of 17 October 2006) vs. 1300+ jobs (as of 22 May 2006) • etics-test.cern.ch: test submission node • a few machines with SLC3,SLC4 on ia32 • 2200+ jobs (as of 17 October 2006) vs. 450+ jobs (as of 22 May 2006) • etics-dev.cern.ch: development node, non-stable • a few machines with SLC3, SLC4 on ia32 • 1650+ jobs (as of 17 October 2006) • etics-hd.cern.ch: new host for SLC4 WS/submission node prototype • Operational setup • WNs status page: http://etics.cern.ch/nmi/?page=pool/index • Job status page: http://etics.cern.ch/nmi/?page=results/overview
Major Tasks Performed • Certification, Integration and Validation Infrastructure Expansion (Cont.) • INFN Facility • Thanks to Matteo, WP2 has also expanded NMI Build/Test Facility at INFN • etics-01.cnaf.infn.it: ETICS WS/submission node • 5 CPUs: ia32, x86_64, ppc • SLC3/SLC4/CentOS4/MacOSX • 330+ jobs • UW-Madison Facility • 100+ CPUs, 43+ platforms, and still growing… • Thanks to Becky, local ETICS WS currently being deployed
Major Tasks Performed • Parallel Testing Feature Delivered • Allows co-scheduling of multiple heterogeneous resources, e.g. to dynamically deploy a custom tested for testing client/server or p2p s/w. • Originally an end of Q4 goal, delivered ~5 months early in response to to gLite demands • D2.2 Infrastructure Installation and Usage Document Completed • Thanks to all of WP2 for content & reviewers for helpful feedback • gLite System Testing Prototype • To be described in detail by Marian tomorrow… • Continued Improvements to NMI Infrastructure • Many a result of Marian & Matteo’s feedback & experiences setting up facilities at CERN and INFN. • Additional NMI documentation • New NMI website (http://nmi.cs.wisc.edu) • LISA ‘06 NMI paper, etc.
Major Tasks Performed • Implemented short-term solution for root-level testing @ CERN • Initial approach is only loosely integrated with NMI • To be replaced by future NMI virtual machine capability? • Participation in OMII-Europe • Continued involvement to ensure infrastructure harmony • Cross-site job migration is also a top OMII-Europe goal • And last but not least: Boring system administration … every day • OS updates/upgrades, reboots, backups, disk space mgmt., disappearing WNs, crashes, power outages, filesystem failures, etc. • As CERN is the facility with the most usage, most of this falls onto Marian • The etics.cern.ch service is highly available. No significant downtime was caused by the WP2 infrastructure
Issues • Capacity Planning / Scalability • Marian: “How many more needed?” • Good question! I have no idea. • Major new users/projects may need to provide new resources. • We need to better understand how easy/quick it is to add resources to an existing facility, and how many can be added in the same manner before new scalability issues arise. • NMI has been demonstrated to scale to 100’s of nodes, and Condor to 1000’s… but ETICS + NMI + Condor? It also depends on specific workload… • Additional ETICS Testbeds for Development • Marian: “Does every developer need their own ETICS installation?” • Combined deployment of NMI submit node + ETICS WS is not trivial or fully automated (no simple RPM or “plug’n’play”) • WP2 needs help from other WPs to better automate their deployment
Issues • Uneven Facility Utilization • Was an issue in May, still an issue today • 3/3 sites set up, 1/3 in use • CERN facility set up, already in use, production-ready • INFN facility set up, butlesser used • UW facility set up, but not yet in regular use by ETICS • Why? Two reasons: • Minor: CERN facility known to work, other facilities less stress-tested. • Major: inconvenience of submitting to multiple ETICS sites with multiple DBs & WS interfaces • Upcoming cross-site job migration capabilities should largely address both issues -- if jobs automatically migrate, users don’t need to think about it, and all three pools will be exercised • To be described in more detail by Andy tomorrow…
Issues • Communication • Evening in Europe == Morning in Madison • Bi-weekly calls stopped happening over summer • I’ve been slow to address the problem • Matteo in May: • “I think we need more coordination among the three sites. It is quite difficult for us at INFN to understand what are the urgent operations to be done.” • Marian in October: same complaint! • Sysadmin Work • Only one person @ CERN • Frequent OS updates/upgrades • Reboots • because of the power-cut (too hot), kernel update/upgrade, HW failure • Marian: “I know it is not interesting for you, but this must work !! !! !!” • Heterogeneous clusters inherently harder to manage than homogenous clusters of the same size • Complex s/w stack: ETICS client -> ETICS WS -> NMI -> Condor -> OS
Workplan • Q4 Top Priorities • Develop/deploy/test cross-facility job migration capability. • …and increase utilization of INFN and UW-Madison pools as a result. • Keep up with increasing sysadmin demands -- keep infrastructure running smoothly for ETICS users & developers • Responding to Hardware/OS/Service issues • Automation of currently manual tasks • Deployment of new systems & services • Scalability work • Prepare D2.3 report on infrastructure status.
Workplan • Q4/Q5 Unprioritized (next steps and/or resources unclear): • Hardware Virtualisation • WoD (WindowsOnDemand) service, VMWare and/or Xen • Service Monitoring (Service Level Status) • see already http://sls.cern.ch/sls/service.php?id=ETICS • Your feedback is needed • Security issues • Passwords present in the CVS • Public / private resource allocation • A project wants to use ETICS and brings in its private nodes and wants its full power to be private • Steering the jobs to this node, preventing from others landing there • Already supported by NMI/Condor, needs to be documented/customized for ETICS • Steering jobs to/identifying nodes with specific resources • Already supported by NMI/Condor, needs to be documented/customized for ETICS • Documentation • Needs to be updated & improved • ETICS-generic WS installation & configuration docs • CERN/INFN/UW facility-specific configuration & administration docs • Extracting info from Savannah issue DB
Metrics • Bugs, jobs, tasks • 15 open NMI/Condor bugs/issues • 14 closed/addressed bugs/issues • Details available at: • bugs: https://savannah.cern.ch/bugs/?group=etics and select category=NMI • 5 open tasks, 1 closed • Details available at: https://savannah.cern.ch/task/?group=etics select category=NMI
Conclusion • Discussion/Questions/Etc.