180 likes | 319 Vues
CASTOR@CNAF. October 29 2007. Storage classes @ CNAF. Implementation of 3 (2) Storage Classes needed for LHC Disk0Tape1 (D0T1) CASTOR Space managed by system Data migrated to tapes and deleted from when staging area full Disk1tape0 (D1T0) GPFS/StoRM Space managed by VO
E N D
CASTOR@CNAF October 29 2007
Storage classes @ CNAF • Implementation of 3 (2) Storage Classes needed for LHC • Disk0Tape1 (D0T1) CASTOR • Space managed by system • Data migrated to tapes and deleted from when staging area full • Disk1tape0 (D1T0) GPFS/StoRM • Space managed by VO • Disk1tape1 (D1T1) CASTOR • Space managed by VO (i.e. if disk is full, copy fails) • Large buffer of disk with tape back-end and no gc
CASTOR • 1 production instance + 1 test instance • shared name-server • LEMON for monitoring • NAGIOS for proactive alarms • E-mail notification (SMS in progress) • Exclusion of problematic servers from service (via dns alias) • Test instance • Used to test upgrades and srm 2.2, SLS, New Lemon • Recentrly moved to production instance • 1 stager, 2 disk-servers (3 servers for srm v. 2.2), only 1 used • Production instance • 4 core servers (stager, lsf, dlf, name-server) • 3 db server (for name server, stager, dlf) MIGRATING TO CLUSTER • 1 server for acsls (library management) • 1 library: STK L5500 (1 PB on-line) • Tendering for a new larger library • 15 tape servers (10 9940b drives, 6 LTO2 drives) • 8 servers srm v. 1.1 • 50 disk-servers
CASTOR Setup (1) • Core services are on machines with scsi disks, hardware raid1, redundant power supplies • tape servers and disk servers have lower level hardware, like WNs Sun Blade v100 with 2 internal ide disks with software raid-1 running ACSLS 7.0 OS Solaris 9.0 50 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s (latest...) connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software) • STK L5500 silos (5500 slots, partitioned wth 2 form-factor slots, about 2000 LTO2 for and 3500 9940B, 200GB cartridges, tot capacity ~1.1PB non compressed ) • 6 LTO2 + 10 9940B drives, 2 Gbit/s FC interface, 20-30 MB/s rate (3 more 9940B going to be acquired). SAN 15 tape servers STK FlexLine 600, IBM FastT900, EMC Clarion …
CASTOR Setup (2) CASTOR core services v2.1.3-24 on 4 machines Name server Oracle DB (Oracle 9.2) castor-6: rhserver, stager, rtcpclientd, MigHunter, cleaningDaemon, expert Stager Oracle DB (Oracle 10.2) • 3 SRMv1 endpoints, DNS balanced: • srm://castorsrm.cr.cnaf.infn.it:8443 (used for disk pools with tape backend) • srm://sc.cr.cnaf.infn.it:8443 (used for disk-only pool for atlas) • srm://srm-lhcb durable.sc.cr.cnaf.infn.it (used as disk-only pool for lhcb) • 1 SRMv22 endpoint: • srm://srm-v2.cr.cnaf.infn.it:8443 castorlsf01: Master LSF, rmmaster, rmMasterDaemon dlf01: dlfserver, Cmonit, Oracle for DLF castor-8: nsdaemon, vmgr, vdqm, msgd, cupvd 2 more machines for the Name Server and Stager DB
CASTOR blues • Upgrade from v. 2.1.1-9 to v. 2.1.3-15. (July 9-11) • New LSF plugin – load on Oracle db lowered • Previously frequent rmmaster meltdowns with # PEND jobs > 1000 • Increaded stability (new job limit 30000) • Good performances • Upgrade to v. 2.1.3-24. (September 19-20) • D1T0 pools management problematic (v. 2.1.4-x ?) • Not properly managed disks with no space left (requests still queued) • Nearly impossible for the user to delete files from disk • Name Server and Stager inconsistencies when a user deletes file only from NS • tape recall not yet enough efficient (v. 2.1.5 ?) • No Recall Policies (limit aggressive users to take all drives) • One DS preferred by user is under stress (make replicas of SOME files on other DS without Draining) • Not negligible administrative effort for db clean-up • Repack not yet supported (v. 2.1.4-x ?) • Full SL(C)4.x 64 bits compatibility (castor-gridftp runs in compatibility mode and tape servers problem SSI-STK )
(Other) issues • Problem of shortage of hw resources at CNAF • Very simple setup: 1! disk pool for svcClass for VO • We would need to: • Increase # of disk-servers (but we have at present constrain on total amount of resources) • Define different disk pools for local access vs. import/export? • Have separate stager instances for larger VOs? • Need feedback from CERN and RAL • Migration Issue • Management issue • CASTOR does not seem to cope well with shortage of resources • Same problems hidden (i.e. not relevant) at CERN? • Our manpower dedicated to CASTOR not enough (~ 2 FTE + db management)
Issue about management • Jan 2006 (RAL f2f meeting) the idea of serious developing of high level administration commands which could "do all the necessary (i.e. modify the Stager Tables) in coherent way" about Castor-2 was proposed • After about 2 years Admins still have to login over database and run SQL line and customized scripts which usually leave db tables with dirty entries (and so it's necessary to run other scripts, reset entries etc...). • What's about these commands?
Issue about repack • In castor-1 repack was a very simple (800 lines C code) but essential utility: sequential stage-in of ACTIVE segments, sequential tpwrite on new media, updating of the entries in the castor catalog. • In castor-2 the dev. of the repack utility seems endless. What's the problem? The optimization of resources is fundamental in a Tier1 site we are wasting a lots of tape space
Issue about migration • In our model a single diskserver has 10-20 TB space connected. This is the standard using new hardware ( hard disks have a 750-1000 GB capacity) • rfiod and gridftp could have concurrent access in the diskserver disk backend (and memory) if a minimum # of slots are permitted • Migration stream from "over-loaded(?)" diskserver suffers. disk=>tape stream bandwidth falls down. • New drives will perform a nominal rate of 100MB/s. • If the rfiod disk=>tape child doesn't have a mechanism to win the resource concurrent access on the diskserver the tape nominal rate will be granted only on empty diskserver
Issue about operation • We reports 2 problem (just examples): • Rfiod disk=>tape (rtcpd) sometimes hangs • SL 64 bit acsls toolkit have some incompatibility (NI_FAILURE) Solution implemented at CERN • Sistematically kills the rtcpd hanged (?) • Sistematically reinstall tapeserver (?) These are workarounds not real solutions. In Tier1 if we want to run castor-2 in production we need stability not randomness. This is a crucial point
Issue about monitoring • The monitoring daemon Cmonitd in castor-1 provided real time info about the disk <=> tape stream • Can be used in castor-2 but some info (i.e. diskservers) are wrong (need fix?) • CERN won't support it in the future, adding info on the lemon system and DLF. The same answer was reported in the RAL f2f 2 years ago and no Cmontd info is still avaliable over these two tools. • This is a very bad idea. The Cmonitd could be very useful in some debug case (i.e. the rfiod hangs of the previous slide) and it's the only tool that can reports in real time what a tapedrive is doing (or you need to run rtcpd in debug mode and collect the logs and extract the thoughtput yourself)
Stage Pool Real time Disk to tape streams performance Supported in future? Implemented in castor2?
Issue about monitoring Lemon is a monitoring system. Can extract overall network performance over tapeserver with a fixed granularity. It cannot tell anything about the real time data flows. DLF is the logging facility. It's very useful for collecting logs and trace down stager activity but need to be substantially modified to provide information about rtcpd and connected rfiod streams with real time and statistical throughput. Info about overall and instantaneous thoughtput from migration/recall, diskfile (diskserver:/filesystem/filename), tape drive, tape, segment on tape, castor filename etc... is necessary for tracing operation problems.
Issue about support Support is one of the hot topic in the castor@tier1 implementation Feedback is very good but the point is: • If software is stable, well documented and "rock solid" support is used only in extreme cases or to report specific bugs or limits due to specific Tier1 environment. • If software lacks one of the above features and needs continuosly to be looked after, support from dev. or expert op. also is continuosly needed. We can skill ourselves but we'll never be able to reach the op. and dev. experience at CERN nor the # of FTE. So we confirm that we need stability, stability and stability.
srm 2.2 status • Some problems still not solved • Some related to instability in test stager • Moved srm 2.2 to production instace last week • ll basic tests work, great help from Shaun • Still problems with information system, investigating • To be put in production during November • After CERN
Needed features • Repack • Recall Policies • Monitoring • …