1 / 16

WP4 report

WP4 report. Plans for testbed 2 Olof.Barring@cern.ch. Summary. Reminder on how it all fits together What’s in R1.2 (deployed and not-deployed but integrated) Piled up software from R1.3, R1.4 Timeline for R2 developments and beyond A WP4 problem Conclusions. Other Wps. WP4 subsystems.

damita
Télécharger la présentation

WP4 report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP4 report Plans for testbed 2 Olof.Barring@cern.ch

  2. Summary • Reminder on how it all fits together • What’s in R1.2 (deployed and not-deployed but integrated) • Piled up software from R1.3, R1.4 • Timeline for R2 developments and beyond • A WP4 problem • Conclusions

  3. Other Wps WP4 subsystems How it all fits together (job management) ResourceBroker(WP1) Grid InfoServices(WP3) Grid User - Submit job - Optimized selection of site • Authorize • Map grid  local credentials FabricGridification - publish resource and accounting information Data Mgmt(WP2) • Select an optimal batch queue and submit • Return job status and output Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage(WP5) (Mass storage, Disk pools)

  4. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Automation Installation &Node Mgmt How it all fits together (system mgmt) • Remove node from queue • Wait for running jobs(?) - Node malfunction detected • Put back node in queue - Node OK detected - Update configuration templates - Repair (e.g. restart, reboot, reconfigure, …) - Trigger repair

  5. How it all fits together (node autonomy) Central (distributed) Monitoring Measurement Repository Correlation engines Buffer copy Automation Monitoring Buffer Node mgmt components Configuration Data Base Node profile Cfg cache Local recover if possible (e.g. restarting daemons)

  6. What’s in R1.2 (and deployed) • Gridification: • Library implementation of LCAS

  7. What’s in R1.2 but not used/deployed • Resource management • Information provider for Condor (not fully tested because you need a complete testbed including a Condor cluster) • Monitoring • Agent + first prototype repository server + basic linuxproc sensors • No LCFG object  not deployed • Installation mgmt • LCFG light exists in R1.2. Please provide us feedback on any problems you have with it.

  8. Piled up software from R1.3, R1.4 • Everything mentioned here is ready, unit tested and documented (and rpms are built by autobuild) • Gridification • LCAS with dynamic plug-ins. (already in R1.2.1???) • Resource mgmt • Complete prototype enterprise level batch system management with proxy for PBS (see next slide). Includes LCFG object. • Monitoring • New agent. Production quality. Already used on CERN production clusters sampling some 110 metrics/node. Has also been tested on Solaris. • LCFG object • Installation mgmt • Next generation LCFG: LCFGng for RH6.2 (RH7.2 almost ready)

  9. Enterprise level batch system mgmt prototype (R1.3) job 1 job 2 job n Grid Gatekeeper (Globus or WP4) Local fabric Scheduler JM 1 JM 2 JM n submit scheduled jobs new jobs exec job Runtime Control System queues resources Batch system: PBS, LSF, etc. get job info user queue 1 user queue 2 stopped, visible for users started, invisible for users execution queue Globus components move move job RMS components PBS-, LSF-Cluster

  10. Timeline for R2 developments • Configuration management: complete central part of framework • High Level Definition Language: 30/9/2002 • PAN compiler: 30/9/2002 • Configuration Database (CDB): 31/10/2002 • Installation mgmt • LCFGng for RH72: 30/9/2002 • Monitoring: Complete final framework • TCP transport: 30/9/2002 • Repository server: 30/9/2002 • Repository API WSDL: 30/9/2002 • Oracle DB support: 31/10/2002 • Alarm display: 30/11/2002 • Open Source DB (MySQL or PostgreSQL): mid-December 2002

  11. Timeline for R2 developments • Resource mgmt • GLUE info providers: 15/9/2002 • Maintenance support API (e.g. enable/disable a node in the queue): 30/9/2002 • Provide accounting information to WP1 accounting group: 30/9/2002 • Support Maui as scheduler • Fault tolerance framework • Various components already delivered • Complete framework by end of November

  12. Beyond release 2 • Conclusion from WP4 workshop, June 2002: LCFG is not the future for EDG (see WP4 quarterly report for 2Q02) because: • Inherent LCFG constraints on the configuration schema (per-component config) • LCFG is a project of its own and our objectives do not always coincide • We have learned a lot from LCFG architecture and we continue to collaborate with the LCFG team • EDG future: first release by end-March 2003 • Proposal for a common schema for all fabric configuration information to be stored in the configuration database, implemented using the HLDL. • New configuration client and node management replacing LCFG client (the server side is already delivered in October). • New software package management (replacing updaterpms) split into two modules: an OS independent part and an OS dependent part (packager).

  13. Global schema tree system sw cluster hardware CPU harddisk memory …. hostname architecture partitions services …. packages known_repositories edg_lcas …. sys_name interface_type size …. hda1 hda2 …. edg_lcas …. Component specific configuration size type id version repositories …. The population of the global schema is an ongoing activity http://edms.cern.ch/document/352656/1

  14. Global schema example SW repository structure (maintained by repository managers): /sw/known_repositories/Arep/url = (host, protocol, prefix dir) /owner = /extras = /directories/dir_name_X/path = (asis) /platform = (i386_rh61) /packages/pck_a/name = (kernel) /version = (2.4.9) /release = 31.1.cern /architecture = (i686) /dir_name_Y /path = (sun_system) /platform = (sun4_58) /packages/pck_b/name = (SUNWcsd) /version = 11.7.0 /release = 1998.09.01.04.16 /architecture = (?)

  15. Problem • Very little of delivered WP4 software is of any interest to EDG application WPs, possibly with the exception of producing nice colour plots of the CPU loads when a job was run…  • This is normal, but… • Site administrators do not grow on trees. Because of the lack of good system admin tools, like the ones WP4 tries to develop, the configuration, installation and supervision of the testbed installations require a substantial amount of manual work. • However, thanks to Bob new priority list the need for automated configuration and installation has bubbled up on the required features stack to become absolutely vital for assuring good quality.

  16. Summary • Substantial amount of s/w piled up from R1.3, R1.4 to be deployed now • R2 also includes two large components: • LCFGng – migration is non-trivial but we already perform as much as the non-trivial part ourselves so TB integration should be smooth • Complete monitoring framework • Beyond R2: LCFG is not future for EDG WP4. First version of new configuration and node management system in March 2003

More Related