50 likes | 128 Vues
Improving ARC backends: Condor and SGE/GE LRMS interface. Adrian Taga University of Oslo. LRMS bakends in ARC. supported: PBS&variants, LSF, Condor, SGE, LL perl/sh wrappers around the command line interface of LRMS advantage: easy to modify No clear structure Scripts for:
E N D
Improving ARC backends: Condor and SGE/GE LRMS interface Adrian Taga University of Oslo
LRMS bakends in ARC • supported: PBS&variants, LSF, Condor, SGE, LL • perl/sh wrappers around the command line interface of LRMS • advantage: easy to modify • No clear structure Scripts for: • Populating InfoSystem • Job control (submit, get status, kill)
Queues in Condor Why have queues? • Inhomogeneous clusters Problem: Condor has no notion of queues Solution: use ClassAd mechanism to partition the cluster. Example: Queues defined based on memory size [queue/large] requirements="(Opsys == "linux" && Arch == "intel"" requirements=" && (Disk > 30000000 && Memory > 2000)" [queue/small] requirements="(Opsys == "linux" && Arch == "intel”" requirements=" && (Disk > 30000000 && Memory <= 2000 && Memory > 1000)"
Error reporting Many LRMS backends • Inconsistent reporting of LRMS errors • Clear diagnosis is needed for ATLAS prodsys Scenario: job exceeded resource limits • PBS & Co – natively reports • WallTime, CpuTime, Memory limit exceeded • Exitcode < 256 from job, • Exitcode ≥ 256 resource limit hit • Condor, SGE - provide no clear diagnosis, workarounds needed