1 / 12

Service planning and monitoring in T2 - Prague

This overview provides information on service planning, current status, capacities, networking, personnel, monitoring, hardware and software, middleware, and service in T2 - Prague GDB meeting.

tresaj
Télécharger la présentation

Service planning and monitoring in T2 - Prague

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Service planning and monitoring in T2 - Prague GDB meeting BNL, MIlos Lokajicek

  2. Overview • Introduction • Service planning and current status • Capacities • Networking • Personnel • Monitoring • HW and SW • Middleware • Service • Remarks

  3. Introduction • Czech Republic’s LHC activities • ATLAS, target 3% of authors -> activities • ALICE, target 1 % • TOTEM, much smaller experiments, relative target higher. • (non LHC – HERA/H1, TEVATRON/D0, AUGER) • Institutions (mention just big groups) • Academy of Sciences of the Czech Republic • Institute of Physics • Nuclear Physics Institute • Charles University in Prague • Faculty of Mathematics and Physics • Czech Technical University in Prague • Faculty of Nuclear Sciences and Physical Engineering • HEP manpower (2005) • 145 people • 59 physicists • 22 engineers • 21 technicians • 43 undergraduate students a PHD students

  4. Service planning • Table based on LCG MoU for ATLAS and Alice and our anticipated share • Project proposals to various grant systems in the Czech Republic • Prepare bigger project proposal for CZ GRID together with CESNET • For the LHC needs • In 2010 add 3x more capacity for Czech non-HEP scientists, financed fro state resources and structural funds of EU • All proposals include new personnel (up to 10 new persons) • Today, regular financing, sufficient for D0 • today 250 cores, 150 kSI2k, 40 TB disk space, no tapes

  5. Networking • Local connection of institutes in Prague • Optical 1 Gbps E2E lines • WAN • Opticla E2E lines to Fermilab, Taipei new FZK (from 1 Sept 06) • Connection Prague – Amsterodam now through GN2 • Planning further lines to other T1s Sima @ CEF Networks workshop Prauge, May 30th, 2006

  6. Personnel • Now 4 persons to run T2 • Jiri Kosina – middleware (leaving, looking for replacement), Storage (FTS), monitoring • Tomas Kouba – middleware, monitoring • Jan Svec – basic HW, OS, storage, networking, monitoring • Lukas Fiala - Basic HW, networking, web services • Jiri Chudoba – liason to ATLAS and ALICE, running the jobs and reporting errors, service monitoring • Further information is based on their experience

  7. Monitoring • HW and basic SW • installation and test of new hardwarenormally choose proven HWHW - installation by delivery firminstall operating system and solve problems with delivery firmsinstall middlewaretest it for some time outside the production service • Nagiosworking nodes access via pingdisks – how the partitions are fullload averageif pbs_mom process is runningnumber of running processesif ssh demon is runninghow full is the swap…. • Limits for warning and error • Distribution of mails or SMS to admins – fixing problems remotely • Regular check of nagios web page for red dots • Regular automatic (cron) checks and restarts for some daemons

  8. Monitoring • PBS • job count (via RRD and mrtg) • Local tools for monitoring of number of jobs/machine/per chosen period • Apel • not much useful, might be setup for more useful info • Gridice • ATLAS • Checks and statistics from ATLAS database • ALICE - Mona Lisa - very useful • Monitor pool accounts and actual user certificates • Networking • Network traffic to FZK, SARA, CERN in certain ip range • With the help of ipaccounting (utility ipac-ng)http://golias100.farm.particle.cz/ipac/ • SFT – site functional tests – very useful

  9. outgoing to fzk1 Max: 37M Average: 6M Total: 129G outgoing to internet Max: 61M Average: 8M Total: 164G

  10. Updates and patches • YAIM + automated updates on all farm nodes using simple BEX script toolkit (takes care of upgrading the node which was switched off at the deployment/upgrade phase ... keeps all nodes in sync automatically)ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz, info in README file

  11. Service monitoring • Using higher described checks and their combinations • Rely on centrally/by experiments supported useful monitors • We would appreciate to receive early warning if jobs on some site/working_nodes start quickly fail after submission • Service requirements for T2s in “extended”working hours • No special plan today • Try to provide architecture that responsible people can even travel and do as much as possible remotely (e.g. network console access) • Future computing capacities will probably require new arrangements

More Related