Distributed Access Management

David Groep, Nikhef Distributed Access Management Recovering control over compute in the wake of community-run scheduling services

Overview • Grid ‘infrastructure’ view • Workload model are changing • ‘traditional’ job submission models and brokering • VO-centric workload management and multi-user pilot jobs • impact on traceability and incident handling • Recovering control • policy actions and liability; containing mechanisms • Distributing your control point: gLExec • sources and early deployments • Towards integrated authorization – choices to make

e-Infrastructure model for Grid Users and user communities, sites and resourceswhere do you schedule your work? Policy Coordination: User and VO AUPs, operations, trust, enabling AAA Access negotiation: VO meta-data, ACLs, operational environment needs Submitting work: user pushes work to the resource selected

Job Submission Scenario

Securing the entrance Resource boundary – Enforcing access control Graphics: OGSA 1.0 GFD Frank Siebenlist, Globus and ANL OGF28 Security Workshop

Classic job submission models • In the submission models shown, submission of the user job to the batch system is done with the original job owner’s mapped (uid, gid) identity • grid-to-local identity mapping is done only on the front-end system (CE) • batch system accounting provides per-user records • inspection shows Unix process on worker nodes and in batch queue per-user

Late binding: pilot jobs Job submission gets more and more intricate … • Late binding of jobs to job slots via pilot jobssome users and communities develop and prefer to use proprietary, VO or user specific scheduling & job management • ‘visible’ job is a placeholder that downloads a real job • the placeholders first establish an overlay network • subsequent scheduling and starting of jobs is faster • placeholder is not committed to any particular task on launch • perhaps not even bound to a particular user! • Scheduling within this overlay is orthogonal to site-provided systems

User or VO overlay network OGF28 Security Workshop

Every user a pilot: nothing really new ‘User WMS’ • This is happening today if you allow (outbound) network connections! • Indistinguishable from ‘apparent’ use of your resources • But does introduce additional attack surfaces (the user WMS)

Pilot job incentives Some Pros: • Worker node validation and matching to task properties • Intra-VO priorities can be reshuffled on the fly without involving site administrators • Avoid jobs sitting in queues when they could run elsewhere From: https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs • For any kind of pilot job: • Frameworks such as Condor glide-in, DIRAC, PANDA, … or Topos, are popular, because they are ‘easy’ (that’s why there are so many of them!) • Single-user pilot jobs are no different than other jobs when you allow network connections to and from the WNs • Of course: any framework used to distribute payload gives additional attack surface

Multi-user pilot jobs • All pilot jobs are submitted by a single (or a few) individuals from a user community (VO) • Creating an overlay network of waiting pilot jobs • VO maintains a task queue to which people (presumably from the VO) can submit their work • Users put their programs up on the task queue • Pilot jobs on the worker node looks for work from that task queue to get its payload • Pilot jobs can execute work for one or more users in sequence, until wall time is consumed

VO overlay networks: MUPJ

Pros and Cons of MUpilot jobs In current ‘you only see the VO pilot submitter’ model • Loss over control on who gets to use your site • Loss of control over scheduling/workload assignment, • site admin cannot adjust share of specific user overloading e.g. the Storage Element (only the pilots are seen by the batch system) and might need to: • ban entire VO instead of user from the SE and/or CE, or • reduce the entire VO share • Is that acceptable in case of a non-confirmed incident? • Traceability and incident handling issues Also some apparent advantages for providers • you only see & need to configure a single user • Needs no software or config – since the VO does that ... Extensive list of technical issues (both pros and cons) https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs

Traceability and compromises • Post-factum: in case of security incidents: • Complete & confirmed compromise is simple: ban VO • In case of suspicion: to ban or not to ban, that’s the question • There is no ‘commensurate’ way to contain compromises • Do you know which users are inside the VO?No: the list is largely privateNo: it takes a while for a VO to respond to ‘is this user known’?No: the VO will ban user only in case they think (s)he is malicious – that may be different from your view, or from the AIVD’s view, or ... • So: the VO may or may not block • The site is left in the cold: there is no ‘easy’ way out except blocking the entire VO, which then likely is not ‘acceptable’

Traceability and compromises Ante-factum requirements • Sites may need proof of the identity of who was (or is about to!) use the resources at any time, in particular the identities involved in any ongoing incidents • Information supplied by the VO may be (legally) insufficient or too late • Privacy laws might hamper the flow of such information back and forth • c.f. the German government’s censorship bill, with the list of domains that a DNS server must block, but which cannot be published by the enforcing ISP • Or other government requirements or ‘requests’ that need to be cloaked OGF28 Security Workshop

Traceability and compromises • Protecting user payload, other users, and the pilot framework itself from malicious payloads • To some extent a problem for the VO framework, not for the site • Not clear which payload caused the problem: all of them are suspect • User proxies (when used) can be stolen by rogue payloads • … or the proxy of the pilot job submitter itself can be stolen • Risk for other user to be held legally accountable • Cross-infection of users by modifying key scripts and environment of the framework users at each site • compromise of any user using the MUPJ framework ‘compromises’ the entire framework • Seeing distinguished users helps site administrators to understand which user is causing a problem and remedy

Policy Cooperation Tools Recovering Control OGF28 Security Workshop

Recovering control: policy • Draft a Policy and try to ensure compliance • E.g. in EGEE https://edms.cern.ch/document/855383 • Implemented to varying degrees • Actual policy requires use of fine-grained control tools • Collaboration with the VOs and frameworksYou cannot do without them! • Vulnerability assessment of the framework software • Work jointly to implement and honour controls • Where relevant: ‘trust, but verify’ • Provide middleware control mechanisms • Supporting site requirements on honouring policy • Support Vos in maintaining framework integrity • Protect against ‘unfortunate’ user mistakes

Recovering control: mechanisms • Unix-level sandboxing • POSIX user-id and group-id mechanisms for protection • Enforced by the ‘job accepting elements’: • Gatekeeper in EGEE (Globus and lcg-CE), TeraGrid and selected HPC sites • Unicore TSI • gLite CREAM-CE via sudo • VM sandboxing • Not widely available yet • Only helps either the VO or the site, but not bothsince VMs cannot (yet) be nested ... a slight technical digression on (1) follows ...

Pushing access control downwards Multi-user pilot jobs hiding in the classic model

Pushing access control downwards Making multi-user pilot jobs explicit with distributed Site Access Control (SAC) - on a cooperative basis -

Recovering Control • Make pilot job subject to normal site policies for jobs • VO submits a pilot job to the batch system • the VO ‘pilot job’ submitter is responsible for the pilot behaviour this might be a specific role in the VO, or a locally registered ‘special’ user at each site • Pilot job obtains the true user job, and presents the user credentials and the job(executable name) to the site (glexec) to request a decision on a cooperative basis • Preventing ‘back-manipulation’ of the pilot job • make sure user workload cannot manipulate the pilot • project sensitive data in the pilot environment (proxy!) • by changing uid for target workload away from the pilot

Recovering control: gLExec

What is gLExec? gLExec a thin layerto change Unix domain credentialsbased on grid identity and attribute information you can think of it as • ‘a replacement for the gatekeeper’ • ‘a griddy version of Apache’s suexec’ • ‘a program wrapper around LCAS, LCMAPS or GUMS’ OGF28 Security Workshop

What gLExec does … cryptographically protected by CA or VO AA certificate gLExec Authorization (‘LCAS’) • check white/blacklist • VOMS-based ACLs • is executable allowed? • … Credential Acquisition • voms-poolaccount • localaccount • GUMS, … ‘do it’ • LDAP account • posixAccount • AFS, … LCMAPS Execute command with arguments as user (uid, pgid, sgids … ) User grid credential(subject name, VOMS, …) command to execute current uid allowed to execute gLExec

Pieces of the solution VO supplied pilot jobs must observe and honour the same policies the site uses for normal job execution(e.g. banned individual users) Three pieces that go together: • glexec on the worker-node deployment • the mechanism for pilot job to submit themselves and their payload to site policy control • give ‘incontrovertible’ evidence of who is running on which node at any one time (in mapping mode) • gives ability to identify individual for actions • by asking the VO to present the associated delegation for each user • VO should want this • to keep user jobs from interfering with each other, or the pilot • honouring site ban lists for individuals may help in not banning the entire VO in case of an incident

Pieces of the solution • glexec on the worker-node deployment • keep the pilot jobs to their word • mainly: monitor for compromised pilot submitters credentials • process or system call level auditing of the pilot jobs • logging and log analysis • gLExec cannot to better than what the OS/batch system does • ‘internal accounting should now be done by the VO’ • the regular site accounting mechanisms are via the batch system, and these will see the pilot job identity • the site can easily show from those logs the usage by the pilot job • accounting based glexec jobs requires a large and unknown effort • time accrual and process tree remain intact across the invocation • but, just like today, users can escape from both anyway!

gLExec deployment modes • Identity Mapping Mode – ‘just like on the CE’ • have the VO query (and by policy honour) all site policies • actually change uid based on the true user’s grid identity • enforce per-user isolation and auditing using uids and gids • requires gLExec to have setuidcapability • Non-Privileged Mode – declare only • have the VO query (and by policy honour) all site policies • do not actually change uid: no isolation or auditing per user • Pilot and framework remain vulnerable • the gLExec invocation will be logged, with the user identity • does not require setuid powers – job keeps running in pilot space • ‘Empty Shell’ – do nothing but execute the command…

Installation • Actually only identity mapping mode really helps Otherwise • back-compromise (and worm infections) remain possible unless you change ID or sandbox the payload • attributing actions to users on WN is impossible (that needs a uid change) • Interesting back-compromises of the entire framework are possible: (.bashrc or .curlrc infections, &c) • But a setuid installation triggers other worries • Stability and security of the tool itself • Integration with the scheduling systems Which choice would you make??

Conclusions • Grid access model changed fundamentally with introduction of MUPJs • Both benefits and drawbacks to the new situation • But this reality cannot be ignored, nor banned • There are good reasons to retain control • Ability to react commensurately • Implementation of site-local or national controls • Systems-operational reasons • Regaining control through both of: • Policy: JSPG policy document, reviews of VO workload management systems • Systems tooling: gLExec or like tools in strict sandboxing mode • Monitoring: compliance verification of VO All the options are there – how do you react to the changed world?

Distributed Access Management

Distributed Access Management

Presentation Transcript

CAMP: Building a Distributed Access Management Infrastructure

Distributed Trust Management

22. Parallel, Distributed Access

Distributed process management: Distributed deadlock

Distributed Scan Management

Introduction: Distributed POOL File Access

Distributed Data Management

Distributed Systems Management

Distributed Data Management

Access management

Distributed project management

Distributed Layout Management

Distributed Process Management

Distributed Process Management

Distributed Systems Management

NDG Security: Distributed Governance, Distributed Access Control, Distributed Data.

Distributed Project Management

Distributed Transaction Management

Distributed Data Management

Type Based Distributed Access Control

ALMIS Distributed Access Method (ADAM)

Access to distributed resources