Keep the Adversary Guessing: Agent Security by Policy Randomization

Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California paruchur@usc.edu

Motivation: The Prediction Game • Police vehicle • Patrols 4 regions • Can you predict the patrol pattern ? • Pattern 1 • Pattern 2 • Randomization decreases Predictability • Increases Security

Domains • Police patrolling groups of houses • Scheduled activities at airports like security check, refueling etc • Adversary monitors activities • Randomized policies

Problem Definition • Problem : Securityfor agents in uncertain adversarial domains • Assumptions for Agent/agent-team: • Variable information about adversary • Adversary cannot be modeled (Part 1) • Action/payoff structure unavailable • Adversary is partially modeled(Part 2) • Probability distribution over adversaries • Assumptions for Adversary: • Knows agents plan/policy • Exploits the action predictability

Outline Security via Randomization No Adversary Model Partial Adversary Model Contributions: New, Efficient Algorithms Randomization + Quality Constraints MDP/Dec-POMDP Mixed strategies: Bayesian Stackelberg Games

No Adversary Model: Solution Technique • Intentional policy randomizationfor security • Information Minimization Game • MDP/POMDP:Sequential decision making under uncertainty • POMDP Partially Observable Markov Decision Process • Maintain Quality Constraints • Resource constraints (Time, Fuel etc) • Frequency constraints (Likelihood of crime, Property Value)

Randomization with quality constraints Fuel used < Threshold

No Adversary Model: Contributions • Two main contributions • Single Agent Case: • Nonlinear program: Entropy based metric • Hard to solve (Exponential) • Convert to Linear Program: BRLP (Binary search for randomization) • Multi Agent Case: RDR (Rolling Down Randomization) • Randomized policies for decentralized POMDPs

MDP based single agent case • MDP is tuple < S, A, P, R > • S – Set of states • A – Set of actions • P – Transition function • R – Reward function • Basic terms used : • x(s,a) : Expected times action a is taken in state s • Policy (as function of MDP flows) :

Entropy : Measure of randomness • Randomness or information content quantified using Entropy ( Shannon 1948 ) • Entropy for MDP - • Additive Entropy – Add entropies of each state • Weighted Entropy – Weigh each state by it contribution to total flow

Randomized Policy Generation • Non-linear Program: Max entropy, Reward above threshold • Exponential Algorithm • Linearize: Obtain Poly-time Algorithm • BRLP (Binary Search for Randomization LP) • Entropy as function of flows

BRLP: Efficient Randomized Policy • Inputs: and target reward • can be any high entropy policy (uniform policy) • LP for BRLP • Entropy control with

BRLP in Action Beta = .5 = 1 - Max entropy = 0 Deterministic Max Reward Target Reward Increasing scale of

Results (Averaged over 10 MDPs) For a given reward threshold, • Highest entropy : Weighted Entropy : 10% avg gain over BRLP • Fastest : BRLP : 7 fold average speedup over Expected Entropy

Multi Agent Case: Problem • Maximize entropy for agent teams subject to reward threshold • For agent team: • Decentralized POMDP framework • No communication between agents • For adversary: • Knows the agents policy • Exploits the action predictability

Policy trees : Deterministic vs Randomized A1 O2 O1 A1 A2 O1 O2 O1 O2 A1 A1 A1 A2 A2 A2 O1 O2 O1 O2 O1 O2 O1 O2 O2 O1 Deterministic Policy Tree Randomized Policy Tree

RDR : Rolling Down Randomization • Input : • Best ( local or global ) deterministic policy • Percent of reward loss • d parameter – Number of turns each agent gets • Ex: d = .5 => Number of steps = 1/d = 2 • Each agent gets one turn ( for 2 agent case ) • Single agent MDP problem at each step

RDR : d = .5 Agent 1 Fix Agent 2’s policy Maximize joint entropy Joint Reward > 90% M = Max Reward Agent 2 Fix Agent 1’s policy Maximize joint entropy Joint reward > 80% 90% of M 80% of M

RDR Details • To derive single agent MDP: • New Transition, Observation and Belief Update rules needed • Original Belief Update Rule – • New Belief Update Rule –

Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

Security with Partial Adversary Modeled • Police agent patrolling a region. • Many adversaries (robbers) • Different motivations, different times and places • Model (Action & Payoff) of each adversary known • Probability distribution known over adversaries • Modeled as Bayesian Stackelberg game

Bayesian Game • It contains: • Set of agents: N (Police and robbers) • A set of typesθm (Police and robber types) • Set of strategiesσi for each agent i • Probability distribution over types Пj: θj  [0,1] • Utility function: Ui : θ1 * θ2 * σ1 * σ2  R

Stackelberg Game • Agent as leader • Commits to strategy first: Patrol policy • Adversaries as followers • Optimize against leaders fixed strategy • Observe patrol patterns to leverage information Adversary Nash Equilibrium: <a,a> : [2,1] Leader commits to uniform random strategy {.5,.5} Follower plays b: [3.5,1] Agent

Previous work: Conitzer, Sandholm AAAI’05, EC’06 • MIP-Nash (AAAI’05): Efficient best Nash procedure • Multiple LPs Method (EC’06): Given normal form game • Finds optimal leader strategy to commit to • Bayesian to Normal Form Game • Harsanyi Transformation: Exponential adversary strategies • NP-hard • For every joint pure strategy j of adversary: (R, C: Agent, Adversary)

Bayesian Stackelberg Game: Approach • Two Approaches: • Heuristic solution • ASAP: Agent Security via Approximate Policies • Exact Solution • DOBSS: Decomposed Optimal Bayesian Stackelberg Solver • Exponential savings • No Harsanyi Transformation • No exponential # of LP’s • One MILP program (Mixed Integer Linear Program)

ASAP vs DOBSS • ASAP: Heuristic • Control probability of strategy • Discrete probability space • Generates k-uniform policies • k = 3 => Probability = {0, 1/3, 2/3, 1} • Simple and easy to implement • DOBSS: Exact • Modify ASAP Algorithm • Discrete to continuous probability space • Focus of rest of talk

DOBSS Details • Previous work: • Fix adversary (joint) pure strategy • Solve LP to find best agent strategy • My approach: • For each agent mixed strategy • Find adversary best response • Advantages: • Decomposition technique • Given agent strategy • Each adversary can find Best-response independently • Mathematical technique obtains single MILP

Obtaining MILP • Decomposing Substitute

Experiments: Domain • Patrolling Domain: Security agent and robber • Security agent patrols houses • Ex: Visit house a • Observe house and its neighbor • Plan for patrol length 2 • 6 or 12 strategies : 3 or 4 houses • Robbers can attack any house • 3 possible choices for 3 houses • Reward dependent on house and agent position • Joint space of robbers exponential • strategies: 3 houses, 10 robbers

Sample Patrolling Domain: 3 & 4 houses 3 houses LPs: 7 followers DOBSS: 20 4 houses LP’s: 6 followers DOBSS: 12

Conclusion • Agent cannot model adversary • Intentional randomization algorithms for MDP/Dec-POMDP • Agent has partial model of adversary • Efficient MILP solution for Bayesian Stackelberg games

Vision • Incorporating machine learning • Dynamic environments • Resource constrained agents • Constraints might be unknown in advance • Developing real world applications • Police patrolling, Airport security

Thank You • Any comments/questions ?

Keep the Adversary Guessing: Agent Security by Policy Randomization