1 / 12

Tape Operations Update

Tape Operations Update. Vladimír Bahyl IT FIO-TSI CERN. Agenda. Progress on issues (since the last meeting) Current equipment and challenges Development changes Operational changes Conclusion. Progress on issues. NI_FAILURE Problem still present

trisha
Télécharger la présentation

Tape Operations Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TapeOperationsUpdate Vladimír Bahyl IT FIO-TSI CERN

  2. Agenda • Progress on issues (since the last meeting) • Current equipment and challenges • Development changes • Operational changes • Conclusion 2

  3. Progress on issues • NI_FAILURE • Problem still present • Simple procedure exist = no need to reinstall • tplabel command • By default, existing labels are not overwritten • – f option introduced to force relabelling • Cmonitd • No longer used at CERN 3

  4. Equipment today • 25 PB total (around 50% free) • IBM • 2 libraries • ~12 000 slots; 700 GB each • 60 TS1120 drives • Sun • 4 libraries • ~ 36 000 slots; 500 GB each • 60 T10000A drives 4

  5. Equipment near future • Tape space sufficient for 2008 • Unbalanced • New drives • IBM TS1130: ~160 MB/s, 1 TB cartridges • Sun T10000B: ~130 MB/s, 1 TB cartridges • IBM High density frame 5

  6. Challenges • Atlas write low rate partially caused by additional mounts due to a CASTOR policy bug • Alice rate affected by small files from users writing to default pool 6

  7. Development 1/3 • Patch free kernel version (2.1.6-8) • Goal: by SLC5 do not use any CASTOR specific kernel patches • All necessary settings moved to CASTOR tape layer • New SCSI tape driver options introduced: TAPE ST_ASYNC_WRITES 0 TAPE ST_BUFFER_WRITES 0 TAPE ST_LONG_TIMEOUT 3600 TAPE ST_READ_AHEAD 0 TAPE ST_TIMEOUT 900 • Testing on few machines already on SLC4 7

  8. Development 2/3 • Library failure handling (2.1.7-3) • Now possible to overcome short temporary failures of Sun libraries • Options introduced: TAPE ACS_MOUNT_LIBRARY_FAILURE_HANDLING retry 3 300 TAPE ACS_UNMOUNT_LIBRARY_FAILURE_HANDLING retry 3 300 • Use non-labeled tapes (2.1.7-3) • By default, we use AUL ( ) tape labels • NL tapes are now also supported American National Standard label and American National Standard user label 8

  9. Development 3/3 • Option to log to SysLog (2.1.7-4) • See the talk of Giuseppe Lo Re • Can log to DLF since the last meeting • SysLog now also supported • Uses local0 and local1 facilities • Options needed: TAPE TPLOGGER SYSLOG local0.info;local1.info @castortapelog local0.*;local1.* /var/log/castor-tape.log • Log example: Jun 6 15:52:23 tpsrv623 rtcpd[16828]: "TYPE"="RT044 – Request statistics", "FUNC"="rtcpd_FreeResources", "MESSAGE"="Request statistics", "REQUESTTYPE"="READ", "VID"="T07106", "MOUNTTIME"="163", "SERVICETIME"="209", "WAITTIME"="164“, "TRANSFERTIME"="7", "POSITIONTIME"="36", "DATAVOLUMEMB"="115.570068", "DATARATEMBS"="16.510010", "FILES"="1", "DGN"="T10KR1", "VOLREQID"="77219", "CLIENTNAME"="stage”, "CLIENTUID"="14029", "CLIENTGID"="1474", "CLIENTHOST"="c2publicsrv102.cern.ch", "TPVID"="T07106", "REQUESTSTATE"="successful“ 9

  10. Operational changes 1/2 • RTCPD self monitor enabled • RTCP daemon sometimes gets stuck • Self monitor terminates the job and does proper cleanup RTCOPYD SELF_MONITOR YES RTCOPYD MOUNT_TIME 900 • SNMP traps handling • IBM libraries send SNMP traps directly Volser CLN168JA, A Enterprise Tape cleaning cartridge has expired. • ACSLS sends traps on behalf of Sun libraries ACSLS info Lsm 0,7 number of drives changed from 6 to 7. Lsm will be updated. • LEMON creates alarms 10

  11. Operational changes 2/2 • TSMOD (Tape Service Manager on Duty) • Receives daily report TD01E | Drive Down Without Reason | DN 3592B2 35922005@tpsrv135 DOWN 20530 (No_dedication) None TD03E | Job running for too long | DA 994BR0 994B0618@tpsrv635 RUNNING 27769 (No_dedication) P17080 P17080 R 30726 (stage,st)@lxmrrk2707.cern.ch TQ01E | DGN Queue Wait Time Long | Average queue wait time in T10KR1 is 14729 seconds TQ02E | Queue Request Too Old | Q T10KR1 T13388 R 143229 (stage,st)@c2cmssrv102.cern.ch 37990 • Follows procedures according to the error code • Handles most other common issues • E.g. contacting vendors for problems • Weekly rotation 11

  12. Conclusion • Tape capacity sufficient for 2008 • New tape related CASTOR features are constantly being put into production • We are trying to simplify our setup and automate the problem handling 12

More Related