370 likes | 564 Vues
CASTOR Monitoring. Olof Bärring, Jan van Eldik, Miguel Marques Coelho Dos Santos, Ignacio Reguero CERN Castor operations team. CASTOR F2F 2007, CNAF Bologna, October 29, 2007. Outline Monitoring. Support Infrastructure Common Operational Tasks Tools for Disk Server Draining and Cleanup
E N D
CASTOR Monitoring Olof Bärring, Jan van Eldik, Miguel Marques Coelho Dos Santos, Ignacio Reguero CERN Castor operations team CASTOR F2F 2007, CNAF Bologna, October 29, 2007 1
Outline Monitoring Support Infrastructure Common Operational Tasks Tools for Disk Server Draining and Cleanup Tools for Debuging Checkreplicas Filters File Consistency Checks 2
Support Infrastructure HelpDesk, GGUS Operator Service Manager On Duty SysAdmin CASTOR Service Expert CASTOR Developer 3
System-level alarms • 1st level alarm handling • 24 x 7 coverage on site • Driven by procedures Operator • 2nd level alarms handling • 24 x 7 coverage, on-call out of working hours • Problem determination • Manage hardware repairs SysAdmin • Service responsible • Applies software upgrades, configuration changes and provides procedures • Manages disruptive interventions • Handles problematic situations CASTOR Service Expert 4
User support • 1st level user support • Handle common user questions • Triage HelpDesk, GGUS • 2nd level user support • Handle common problems • Procedure driven Service Manager On Duty CASTOR Service Expert • Handle uncommon and complex problems • Provide procedures 5
Number of calls per week Plus ~10 calls via support lists, direct e-mails, phone calls, … HelpDesk, GGUS Operator 127 18 Service Manager On Duty SysAdmin 6 5 Castor Service Expert 0.5 Castor Developer 6
Management tools: Quattor/LEAF • Automated node installation, configuration • We have developed a set of Quattor templates that describe all production servers • We reuse many existing configuration components (incl. yaim, grid mapfiles, LSF7). We have developed only Castor-specific components, in particular • Castorconf • Castor2_migHunter • Exploiting namespaces to implement stages • Kind of directories /prod /test /pps • Useful for preproduction testing • Disk server state management with LEAF • SMS status liked to Castor status Sms set standby draining ‘Vendor Call’ lxffrc4041 • Failing servers remove themselves from production • The management of host certificates has been largely automated Aim: box management done by SysAdmins 7
Management tools: Monitoring • Monitoring metrics in Lemon • Generate operator alarms, run automatic recovery actions, display performance • In addition we have developed Castor-specific Lemon metrics, probes, etc. • New: Castor service views with new version • Per service class, stager, all • In test • Short demo of https://lemonweb03.cern.ch/lemon-web/info.php?entity=c2atlas/default&cluster=1&type=host&detailed=yes • Also: we have developed internal Castor monitoring in PLSQL • To be taken over by castor-dev 8
Management tools • Service Level Status Display • Availability information for end users • Service manager provides functional test • Sometime too honest • Display of config and other information • Link to Lemon • Short Demo of https://lemonweb.cern.ch/sls/index.php 9
Common operational tasks • Disk pool reconfigurations • Adding disk servers to a service class • The SysAdmin team already handle disk server installations • Castor Service Expert only handles central Castor configuration • Draining disk servers for OS upgrade, repairs, … • Interventions requiring only temporary downtime are now handled by the SysAdmin team • Removing a server and its files from the disk cache requires Castor Service Expert • Repair procedures on disk cache • Regular ‘cleaning’ of the database, repairing tape recalls/migrations, consistency checks • Problems diminishing as software matures • Requesting this functionality as part of the system 12
Tools for Disk Server Draining • Need to take servers out of production while preserving user files. For repair, upgrade… • disknodeShutdown tool copies files not yet migrated to tape to other servers in the same service class • Implemented moving the server to “recovery” svc class and requesting files from their original class. • This triggers disk to disk copies • The number of active replications is limited • In the case of disk only (D1T0) service classes all files have to be replicated 13
Tools for Disk Server Draining • Options provided to • Choose status of files to be replicated • Exclude faulty filesystems • reuse the list of files to be replicated from a previous run • Typical usage disknodeShutdown -n node001 -s default -v 'CANBEMIGR|STAGED‘ • As goal to integrate tool with monitoring infrastructure prepared for hands-off operation: • Extensive validation of the results • If needed a list of problem files produced for further analysis 14
Tools for Disk Server Draining • Fixstageout <diskserver …> generates script + DB commands to clean and recover if possible files in STAGEOUT for the most common cases • This may be due to “mover” not finished correctly but also to other problems • Example case: correct ns, correct DB, do putDone # http://savannah.cern.ch/bugs/?29491 nssetfssize $nsfilepath $rfdirfilesize update castorfile set filesize=$rfdirfilesize where filesize=$nslsfilesize and id=$castorfile; call putDoneFunc($castorfile,$rfdirfilesize,0); commit; 15
Tools for Disk Server Draining • checkdiskserver <diskserver…> summary status with number of files on each state lxfsrc3802.cern.ch default DISKSERVER_PRODUCTION STAGED=5831 CANBEMIGR=257 STAGEOUT=780 INVALID=2 WAITDISK2DISKCOPY=2 16
Tools for Debuging • checkreplicas <fileid|castorpath…> displays file status with • Nameserver status • Migration status • Fileclass • Stager status • Diskcopies • Filesystem/fileserver status for the diskcopies • Recall status if file being recalled • Sugested fix if file in STAGEOUT 17
Checkreplicas: Staged 2 copies [c2cmssrv02] ~ > ./checkreplicas /castor/cern.ch/cms/archive/ecal/h4tb.pool-SM06/h4b.00016915.A.0.0.root NSfileid=97095676 Castorfile=764005331 : nsls -l --class: 8 mr--r--r-- 1 ecalhsm zh 100341562 Oct 17 2006 /castor/cern.ch/cms/archive/ecal/h4tb.pool-SM06/h4b.00016915.A.0.0.root File migrated nsls -T: - 1 1 I50204 715 0001f9c3 100341562 102 /castor/cern.ch/cms/archive/ecal/h4tb.pool-SM06/h4b.00016915.A.0.0.root FileClass ID=8 NAME=cms Tape fileclass with 1 copy Tape volume status = I50204 I50204 IBM_LIB1 500GC aul cmsfamily_new1 0B 20071025 FULL DiskcopyId=766238217 DISKCOPY_STAGED lxfsrc3804.cern.ch:/srv/castor/02/76/97095676@castorns.766238217 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default DiskcopyId=764005334 DISKCOPY_STAGED lxfsrc3801.cern.ch:/srv/castor/01/76/97095676@castorns.764005334 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default 18
Checkreplicas: Disk only file [c2cmssrv02] ~ > ./checkreplicas /castor/cern.ch/cms/store/unmerged/CSA07/2007/10/15/thewholestew/GEN-SIM-DIGI-RAW/0000/5E36BD8A-997E-DC11-9410-001617C3B6C4.root NSfileid=159587975 Castorfile=754819735 : nsls -l --class: 58 -rw-r--r-- 1 cmsprod zh 42327603 Oct 20 09:27 /castor/cern.ch/cms/store/unmerged/CSA07/2007/10/15/thewholestew/GEN-SIM-DIGI-RAW/0000/5E36BD8A-997E-DC11-9410-001617C3B6C4.root File not migrated FileClass ID=58 NAME=temp No tape fileclass DiskcopyId=754819739 DISKCOPY_STAGED lxfsre1906.cern.ch:/srv/castor/02/75/159587975@castorns.754819739 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=t0export 19
Checkreplicas: Disk only file NSfileid=160893627 Castorfile=773391840 : nsls -l --class: 58 -rw-r--r-- 1 ceballos zh 26124698 Oct 29 12:56 /castor/cern.ch/cms/store/unmerged/Generators/2007/10/29/RelVal-3j_80_140-alpgen-1193640863/GEN-SIM/0080/0EF7A4D4-0D86-DC11-B382-0016177CA778.root File not migrated FileClass ID=58 NAME=temp No tape fileclass DiskcopyId=774012855 DISKCOPY_STAGED lxfsrc3805.cern.ch:/srv/castor/04/27/160893627@castorns.774012855 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default DiskcopyId=774060478 DISKCOPY_STAGED lxfsrd1002.cern.ch:/srv/castor/03/27/160893627@castorns.774060478 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default DiskcopyId=773391841 DISKCOPY_STAGED lxfsrc5005.cern.ch:/srv/castor/01/27/160893627@castorns.773391841 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=cmsprod 20
Checkreplicas: Not in stager [c2cmssrv02] ~ > ./checkreplicas /castor/cern.ch/user/s/stage/ignatestcms/pruigna File (133761510) /castor/cern.ch/user/s/stage/ignatestcms/pruigna not in stager nsls -l --class: 95 mrw-r--r-- 1 stage st 183 Jul 06 14:05 /castor/cern.ch/user/s/stage/ignatestcms/pruigna File migrated nsls -T: - 1 1 I06216 998 0006330c 183 0 /castor/cern.ch/user/s/stage/ignatestcms/pruigna FileClass ID=95 NAME=largeuser Tape fileclass with 1 copy Tape volume status = I06216 I06216 IBM_LIB1 700GC aul user_new 0B 20071022 FULL 21
Checkreplicas -r: STAGEOUT NSfileid=133581711 Castorfile=528965656 : DiskcopyId=528965659 DISKCOPY_STAGEOUT lxfsrc3804.cern.ch:/srv/castor/04/11/133581711@castorns.528965659 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default rfdir: -rw------- 1 stage st 286214755 Jul 06 12:13 lxfsrc3804.cern.ch:/srv/castor/04/11/133581711@castorns.528965659 # file in STAGEOUT # filesize 0 in ns and not 0 in diskserver./nssetfssize 286214755 update castorfile set filesize=286214755 where filesize=0 and id=528965656; commit; call putDoneFunc(528965656,286214755,0); commit; 22
Stuck Tape Recall File migrated nsls -T: - 1 1 I04171 2924 001a07f5 104886582 100 /castor/cern.ch/user/a/akyriaki/SingleParticle/CMSSW131/Pi0/Ebin_50_100/Pi0_Emin_50_Emax_100_9.root Warning: NSLS SIZE=405362521 SEGMENT SIZE=104886582, FileClass ID=95 NAME=largeuser Tape fileclass DiskcopyId=467771903 DISKCOPY_FAILED lxfsrc3803:/srv/castor/03/57/127675657@castorns.467771903 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default DiskcopyId=469785764 DISKCOPY_WAITTAPERECALL lxfsra1205:/srv/castor/03/57/127675657@castorns.469785764 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default Tape volume status = I04171 I04171 IBM_LIB2 700GC aul user_new 0B 20070611 FULL Tape queue status = nothing in showqueues for VID=I04171 Tape ID=467771910 VID=I04171 linked with unprocessed segments STATUS=2 TapeCopy ID=469785766 STATUS=4 Segment ID=469785768 STATUS=0 SubRequest STATUS COUNT(*) 4 1 SubRequest ID=467771891, STATUS=4 DiskcopyId=467234297 DISKCOPY_CANBEMIGR lxfsra1208:/srv/castor/01/57/127675657@castorns.467234297 DISKSERVER_DRAINING FILESYSTEM_PRODUCTION DiskPool=default 23
Good Tape Recall [c2cmssrv02] ~ > diskServer_qry lxfsrc3805.cern.ch | grep RECALL | awk '{print $2}' | ./disktofileid | xargs ./checkreplicas NSfileid=153850073 Castorfile=775044419 : nsls -l --class: 95 mrw-r--r-- 1 bellan zh 3421229897 Sep 23 04:55 /castor/cern.ch/user/b/bellan/ZReco/CMSSW_1_6_0/Zmumu_146.root File migrated nsls -T: - 1 1 I06663 847 0006f65b 3421229897 102 /castor/cern.ch/user/b/bellan/ZReco/CMSSW_1_6_0/Zmumu_146.root FileClass ID=95 NAME=largeuser Tape fileclass with 1 copy Tape volume status = I06663 I06663 IBM_LIB2 700GC aul user_new 86.32GB 20071030 RDONLY DiskcopyId=775044431 DISKCOPY_WAITTAPERECALL lxfsrc3805.cern.ch:/srv/castor/01/73/153850073@castorns.775044431 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default Tape queue status = Q 3592B2 I06663 R 16211526 (stage,st)@c2cmssrv02.cern.ch 16276 Tape ID=669090436 VID=I06663 STATUS=2 TapeCopy ID=775044429 STATUS=4 Segment ID=775044430 STATUS=0 pointing to tape ID=669090436 VID=I06663 SubRequest STATUS COUNT(*) 4 1 SubRequest ID=775044410, STATUS=4 24
Stuck recall with disabled FS ./checkreplicas 123608654 Fileid=3D123608654 : mrw-r--r-- 1 dmangeol zh 215539641 May 23 04:24 /castor/cern.ch/user/d/dmangeol/HLTSamples/HLT_LMD1_NoPU.root DiskcopyId=3D453961697 DISKCOPY_STAGED lxfsra1208:/srv/castor/03/54/123608654@castorns.453961697 DISKSERVER_DRAINING FILESYSTEM_DISABLED DiskcopyId=3D465439868 DISKCOPY_WAITTAPERECALL none :none 54/123608654@castorns.465439868 DISKSERVER_NONE FILESYSTEM_NONE Tape volume status =3D I04168 I04168 IBM_LIB2 700GC aul user_new 387.56GB 20070530 RDONLY Tape queue status =3D nothing in showqueues for VID=3DI04168 TapeCopy ID=3D465439875 STATUS=4 SubRequest STATUS COUNT(*) 5 13 4 1 9 3 SubRequest ID=3D465440547, STATUS=4 26
Disk to Disk Copy (replication) [c2cmssrv02] ~ > diskServer_qry lxfsrc3805.cern.ch | grep DISK2 | awk '{print $2}' | ./disktofileid | xargs ./checkreplicas NSfileid=150952768 Castorfile=774606128 : nsls -l --class: 8 mrw-r--r-- 1 mlmiller zh 4882993267 Sep 05 15:40 /castor/cern.ch/cms/store/Production/2007/9/3/PreCSA07-BtoJpsi-A1/0001/B03305DA-AD5B-DC11-A45C-000423D9506C.root File migrated nsls -T: - 1 1 T05921 89 000e13d1 4882993267 102 /castor/cern.ch/cms/store/Production/2007/9/3/PreCSA07-BtoJpsi-A1/0001/B03305DA-AD5B-DC11-A45C-000423D9506C.root FileClass ID=8 NAME=cms Tape fileclass with 1 copy Tape volume status = T05921 T05921 SL8600_0 500GC aul cms_t10k 0B 20071030 FULL DiskcopyId=776153609 DISKCOPY_WAITDISK2DISKCOPY lxfsrc3805.cern.ch:/srv/castor/02/68/150952768@castorns.776153609 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default DiskcopyId=774606131 DISKCOPY_STAGED lxfsrc3805.cern.ch:/srv/castor/03/68/150952768@castorns.774606131 DISKSERVER_PRODUCTION FILESYSTEM_PRODUCTION DiskPool=default 27
Filters • Other filters produced to improve working environment with file lists • Disktofileid • Disktons • Donsls • DonslsT • Stagerqry <svcclass> • Stagerget <svclass> • Example diskServer_qry lxfsrc4401 | grep STAGEOUT | awk ‘{print $1}’ |./disktofileid | xargs ./checkreplicas -r 29
File Consistency Checks • We run tool in acrontab that collects that status of the files in the disk servers • A recursive ls (rfdir) • Stores them on an AFS buffer • This is then crosschecked with the stager DB looking for • misplaced files • files with wrong size • files in a disk server but not in the stager DB • files in the stager DB but not in any disk server • The tool also checks whether • the file is in the name server • the file has a "tape" file class • the file has been migrated to tape 30
File Consistency Checks • The tool generates the commands needed for recovery and cleanup of the problem cases • It has been used in production environment to recover from • bugs of early CASTOR versions • Hardware problems • Considering to extend the program to verify the checksum of migrated files 33
File Consistency Checks • Outputs set of scripts as follows cms/1192543667/samediskcopyid/oneindb cms/1192543667/samediskcopyid/differentoneindb cms/1192543667/samediskcopyid/noneindb cms/1192543667/samediskcopyid cms/1192543667/sizemismatch cms/1192543667/misplaced/inns cms/1192543667/misplaced/notinns cms/1192543667/misplaced cms/1192543667/instagernotondisk/innsmigrated cms/1192543667/instagernotondisk/innsnotmigratednotapefileclass cms/1192543667/instagernotondisk/innsnotmigratedtapefileclass cms/1192543667/instagernotondisk/notinns cms/1192543667/instagernotondisk cms/1192543667/notinstagerondisk/innsmigrated cms/1192543667/notinstagerondisk/innsnotmigratednotapefileclass cms/1192543667/notinstagerondisk/innsnotmigratedtapefileclass cms/1192543667/notinstagerondisk/notinns cms/1192543667/notinstagerondisk 34
Cleanup Actions • ./notinstagerondisk/innsmigrated # not in stager in ns migrated. size=17172963 /usr/bin/rfrm lxfsre2106.cern.ch:/srv/castor/03/95/129303495@castorns.653840190 • ./instagernotondisk/innsmigrated /usr/bin/cleanLostFiles 530973710 # missing diskcopy lxfsrk3905.cern.ch:/srv/castor/01/58/112646158@castorns.530973710 in ns migrated. size=13139 35
PLSQL Check • 1193738430/PROCEDURE/ANYSEGMENTSFORTAPE.sql • 1193738430/PROCEDURE/ANYTAPECOPYFORSTREAM.sql • 1193738430/PROCEDURE/ARCHIVESUBREQ.sql • 1193738430/PROCEDURE/BASICINPUTFORSTREAMPOLICY.sql • 1193738430/PROCEDURE/BESTFILESYSTEMFORSEGMENT.sql • 1193738430/PROCEDURE/BESTTAPECOPYFORSTREAM.sql • 1193738430/PROCEDURE/BUILDPATHFROMFILEID.sql • 1193738430/PROCEDURE/BULKDELETE.sql • 1193738430/PROCEDURE/CHANGESTREAMSSTATUS.sql • 1193738430/PROCEDURE/CHECKFILEFORREPACK.sql • 1193738430/PROCEDURE/CHECKFSBACKINPROD.sql • 1193738430/PROCEDURE/CHECKPERMISSION.sql • 1193738430/PROCEDURE/CREATEPPUT.sql • 1193738430/PROCEDURE/DELETEARCHIVEDREQUESTS.sql • 1193738430/PROCEDURE/DELETECASTORFILE.sql • 1193738430/PROCEDURE/DELETEOUTOFDATEDISKCOPIES.sql • 1193738430/PROCEDURE/DELETEOUTOFDATEREQUESTS.sql • 1193738430/PROCEDURE/DELETEREQUEST.sql • 1193738430/PROCEDURE/DELETEREQUESTEFFICIENTLY.sql • 1193738430/PROCEDURE/DELETEREQUESTS.sql • 1193738430/PROCEDURE/DELETETAPECOPIES.sql • 1193738430/PROCEDURE/DESCRIBEDISKPOOL.sql • 1193738430/PROCEDURE/DESCRIBEDISKPOOLS.sql 37
PLSQL CHeck [lxplus206] /afs/cern.ch/user/r/reguero/project/castor/deployment/tools/reps/atlas/plsql > diff -r 1193389223 1193738430 diff -r 1193389223/PROCEDURE/ARCHIVESUBREQ.sql 1193738430/PROCEDURE/ARCHIVESUBREQ.sql 22c22,24 < SELECT count(request) INTO nb FROM SubRequest WHERE castorFile = cfId AND status IN (9, 11); --- > SELECT /*+ INDEX(a I_SUBREQUEST_CASTORFILE) */ > count(a.request) INTO nb > FROM SubRequest a WHERE a.castorFile = cfId AND a.status IN (9, 11); 24c26 < FOR sr IN (SELECT request INTO rid --- > FOR sr IN (SELECT /*+ INDEX(a I_SUBREQUEST_CASTORFILE) */ request INTO rid diff -r 1193389223/PROCEDURE/SELECTTAPECOPIESFORMIGRATION.sql 1193738430/PROCEDURE/SELECTTAPECOPIESFORMIGRATION.sql 30,32c30,32 < UPDATE TapeCopy < SET status = 2 -- WAITINSTREAMS < WHERE id MEMBER OF tcIds; --- > UPDATE /*+ INDEX(b I_TAPECOPY_ID) NO_INDEX_FFS(b I_TAPECOPY_ID) */ TapeCopy b > SET b.status = 2 -- WAITINSTREAMS > WHERE b.id MEMBER OF tcIds; 37c37 < SELECT /*+ INDEX(b I_TAPECOPY_ID) CARDINALITY(b 10) */ * FROM TapeCopy b --- > SELECT /*+ INDEX(b I_TAPECOPY_ID) NO_INDEX_FFS(b I_TAPECOPY_ID) */ * FROM TapeCopy b 38
Overview of Upgrade to 2.1.4 • Review of https://twiki.cern.ch/twiki/bin/view/FIOgroup/CastorUpgrade21324to214 39