1 / 33

SRM CCRC-08 and Beyond

SRM CCRC-08 and Beyond. Shaun de Witt CASTOR Face-to-Face. Introduction. Problems in 1.3-X And what we are doing about them Positives Setups Recommendations Future Developments Release Procedures. Problems - Database. Deadlocks Observed at CERN and ASGC (CNAF too?)

jerom
Télécharger la présentation

SRM CCRC-08 and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SRM CCRC-08 and Beyond Shaun de Witt CASTOR Face-to-Face

  2. Introduction • Problems in 1.3-X • And what we are doing about them • Positives • Setups • Recommendations • Future Developments • Release Procedures

  3. Problems - Database • Deadlocks • Observed at CERN and ASGC (CNAF too?) • Not at RAL – not sure why??? • Two types (loosely) • Daemon/daemon deadlocks • Server/Daemon deadlocks • Startup problems • Too many connections • ORA-0600 errors

  4. Daemon/Daemon deadlocks • Found ‘accidentally’ at CERN • Caused by multiple back-ends running talking to the same database • Leads to database deadlocks in GC • In 2.7 GC has moved into database as a procedure. • Could be ported to 1.3, but not planned

  5. Server/Daemon deadlocks • Caused by using CASTOR fillObj() API • When filling subrequests, multiple calls can lead to two threads blocking one another. • Daemon and Server both need to check status and possibly modify subrequest info. • Solution proposed is to take lock on the request • This would stop deadlocks • But could lead to lengthy locks

  6. Problems - Database • Start up problems • Seen often at CNAF, infrequently at RAL. • TNS – ‘no listener’ error. • Need to check logs at startup. • No solution at the moment • Restarting cures problem. • Could add monitoring to watch for this error.

  7. Problems - Database • Too many connections • Seen at CERN • Partly down to configuration • Many SRMs talking to the same database instance. • Two solutions • More database hardware • Fewer SRMs on same instance • But expensive • Reduce Threads on server and daemon • May cause TCP timeout errors under load (server) or cause put/get requests to be processed too slowly (daemon) • More on configuration later

  8. Problems - Database • ORA-0600 (Internal Error) problems • Seen at RAL and CERN • Oracle internal error • Will render SRM useless • Fix available from ORACLE • RAL has not seen it since applying fix • Gordon Brown at RAL can provide details

  9. Problems - Network • Intermittent CGSI errors • Terminal CGSI errors • SRM ‘lock-ups’

  10. Problems - Network • Intermittent CGSI-gSOAP errors • cgsi-gSOAP errors reported in logs and to client • Seen 2-10 times per hour (at RAL) • Correlation in time between front-ends • Both will get an error at about the same time • Cause is unclear • No solution at the moment • Seems to happen < 0.1% of requests at RAL

  11. Problems - Network • Terminal CGSI-gSOAP errors • All threads end up returning CGSI-gSOAP errors • Can affect only 1 of front ends • Cause unknown • Does not seem correlated with load or request type • No solution at moment • ASGC site report indicated may be correlated with database deadlocks(?) • Need monitoring to detect this in the log file • Restart of effected front end normally clears problem. • New version of CGSI plug-in available, but not yet tested

  12. Problems - Network • SRM becomes unresponsive • Debugging indicates all threads stuck in recv() • Cause unknown • May have been cause of ATLAS ‘blackouts’ during first CCRC • New releases include recv() and send() timeouts • Should stop this • Two new configurable parameters in srm2.conf

  13. Problems - Other • Interactions with CASTOR • Behaviour when CASTOR is slow • Needless RFIO calls loading job slots • Bulk Removal requests • Use of MSG field in DLF

  14. Problems - Other • Behaviour when CASTOR becomes slow • See error “Too many threads busy with CASTOR” • Can block new requests coming in • But useful diagnostic of CASTOR problems • Solution is to decrease STAGERTIMEOUT in srm2.conf • Default 900 secs too long • Most clients give up after 180 secs • No ‘hard and fast’ rule about what it should be • Somewhere between 60 and 180 is best guess. • Pin time • Implementation ‘miscommunication’ – top heavy a weight applied • Fixed in 1.3-27 • Also reduce Pin Lifetime in srm2.conf

  15. Problems - other • Needless RFIO calls • Identified by CERN • Takes up jobs slots on CASTOR • Timeout after 60 seconds • On all GETS without a space token • Introduced when support for multiple default spaces was introduced • Fix already in CVS • For release 2.7 • Duplicates code when space token provided • Could be backported to 1.3

  16. Problems - other • Bulk removal requests • Sometime produce CGSI-gSOAP errors for large numbers of files (>50) • But deletion does work – problem on send()? • May be load related • On one day 4/6 tests with 100 files produced this error • The next day 0/6 tests with 1000 files produced this error • Some discussion about removing stager_rm and just do nsrm • May help speed up processing • But would leave more work for CASTOR cleaning daemon

  17. Problems - Other • Lots of MSG fields left blank • Problem for monitoring • Addressed in 2.7 • Will not be back ported. • Occasional crashes • Traced to use of strtok (not _r) • Fixed in 1.3-27

  18. Positives • Request rate • At RAL on 1 cms front end with 50 threads: • 21K requests/hr • Distribution of type of request not known. • Processing speed • Again using CMS at RAL • Daemon running 10/5 threads • Put requests in 1-5 seconds • Same for GET requests w/o tape recall

  19. Positives • Front end quite stable • At RAL few interventions required

  20. SETUPS • Different sites have different hardware set ups • Hope you can fill the gaps…!

  21. RAL Setup SRM-CMS SRM-ATLAS SRM-LHCb SRM-ALICE 3 Node RAC

  22. CERN Setup srm-cms srm-alice srm-dteam srm-ops Single Machine shared-db atlas-db lhcb-db srm-atlas srm-lhcb

  23. CNAF Setup srm-cms srm-shared cms-db shared-db Single Machine

  24. ASGC Setup srm srm-db castor-db dlf-db 3 node RAC

  25. Useful Configuration Parameters • Based on you setup, you will need to tune some or all of the following parameters: • SERVERTHREADS • CASTORTHREADS • REQTHREADS • POLLTHREADS • COPYTHREADS • The more instances on a single database instance, the fewer threads should be assigned to the SRM • Need to balance request and processing rates on daemon and server • SOAPBACKLOG • SOAPRECVTIMEOUT • SOAPSENDTIMEOUT • Number of SOAP requests, and timeouts related to recv() and send() • Best ‘guesstimate’ for these are 100, 60, 60 • TIMEOUT • Stager timeout in castor.conf • Best ‘guesstimate’ 60-180 seconds • PINTIME • Keep low

  26. Future Developments • Move to SL4 • Move to castor clients 2.1.7 • New MoU

  27. Move to SLC4 • URGENT • No support for SLC3 • Support effort for SL3 dwindling • Have built and tested one version • In 1.3 series • All new developments (2.7-X) on SL4 • No new development in 1.3 series

  28. Move to 2.1.7 clients • URGENT • Addresses security vulnerability with regards to proxy certificates • Much better error messaging • Fewer ‘unknown error’ messages • 2.1.3 clients no longer supported or developed • Since this requires a schema change, releases in this series will be 2.7-X

  29. New MoU • Major new features: • srmPurgeFromSpace • Used to remove disk copies from a space • Initial implementation will only remove files currently also on tape • VOMS based security • This will be implemented in CASTOR but may need changes to SRM/CASTOR interface.

  30. Future Development Summary • New features will be put into 2.7-X or later releases. • 2.7-X releases only on SLC4 • Is port of 1.3-X to SLC4 required? • Esp. given security hole in 1.3 • Will require 2.1.7 clients installed on SRM nodes • Timescale? • End June. Tall order!

  31. Release Procedures • Following problems just after CCRC • Srm seemed to pass all tests • But daemon failed immediately in production (CERN and RAL) • Brought about by a ‘simple’ change which only affected recalls when no space token was passed. • Clear need for additional tests before release • Public s2 not enough

  32. Pre-Release Procedures • (Re) Developing shell test tool which will be delivered with the SRM. • To include basic tests of all SRM functions • Will include testing of tape recalls if possible (i.e. not if only using a Disk1Tape0 system) • New tests added when we find missing cases. • Will require tester to have certificate (i.e. can not be run as root) • Looking at running FULL s2 test suite • This includes tests of a number of invalid requests • Not normally run since VERY time consuming

  33. Pre-Release Procedures • As now, s2 tests will be run over 1 week to try and ensure stability • Problem still is stress testing • No dedicated stress tests exist • But this is most likely to catch database problems. • Could develop simple ones • But would they be realistic enough?

More Related