Tutorial on Optimization with Submodular Functions

Tutorial onOptimization with Submodular Functions Andreas Krause TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

M Algorithms implemented Acknowledgements • Partly based on material presented at IJCAI 2009 together with Carlos Guestrin(CMU) • MATLAB Toolboxavailable on http://las.ethz.ch/

Discrete Optimization Many discrete optimization problems can be formulated as follows: Given finite set V wish to select subset A (subject to some constraints) maximizing utility F(A) These problems are the focus of this tutorial.

Hach Sensor Example: Sensor Placement [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt‘08] Contamination of drinking watercould affect millions of people Contamination Sensors Simulator from EPA ~$14K Place sensors to detect contaminations “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination?

Depth Predict atunobservedlocations Location across lake Robotic monitoring of rivers and lakes[with Singh, Guestrin, Kaiser, Journal of AI Research ’08] Need to monitor large spatial phenomena Temperature, nutrient distribution, fluorescence, … NIMSKaiseret.al.(UCLA) Can only make a limited number of measurements! Color indicates actual temperature Predicted temperature Use robotic sensors tocover large areas Where should we sense to get most accurate predictions?

Example: Model-based sensing • Model predicts impact of contaminations • For water networks: Water flow simulator from EPA • For lake monitoring: Use statistical, data-driven models • For each subset A of V compute sensing quality F(A) Model predicts High impact Lowimpactlocation Contamination Medium impactlocation S3 S1 S2 S3 S4 S2 Sensor reducesimpact throughearly detection! S1 S4 Set V of all network junctions S1 Low sensing quality F(A)=0.01 High sensing quality F(A) = 0.9

Sensor placement Given: finite set V of locations, sensing quality F Want: such that NP-hard! How well can this simple heuristic do? S3 S1 S5 S2 S4 M Greedy algorithm: Start with A = {} For i = 1 to k s* := argmaxs F(A U {s}) A := A U {s*} S6

Performance of greedy algorithm Greedy score empirically close to optimal. Why? Optimal Greedy Population protected (higher is better) Small subset of Water networks data Number of sensors placed

Key property 1: Monotonicity Placement A = {S1, S2} Placement B = {S1, S2, S3, S4} S2 S2 S3 S1 S1 S4 F is monotonic: Adding sensors can only help

Key property 2: Diminishing returns Placement A = {S1, S2} Placement B = {S1, S2, S3, S4} S2 S2 S3 S1 S1 S4 Adding S’will help a lot! Adding S’doesn’t help much S’ New sensor S’ New sensor Y’ + S’ B Large improvement Submodularity: A + S’ Small improvement

~63% One reason submodularity is useful Theorem [Nemhauser et al ‘78] Suppose F is monotonic and submodular. Then greedy algorithm gives constant factor approximation: • Greedy algorithm gives near-optimal solution! • In general, guarantees best possible unless P = NP!

In this tutorial you will • See examples and properties of submodular functions • Learn how submodularity helps to find good solutions • Find out about current research directions

Set functions • Finite set V = {1,2,…,n} • Function • Will always assume F({}) = 0 (w.l.o.g.) • Assume black-box that can evaluate F for any input A • Approximate (noisy) evaluation of F is ok S3 S2 S3 S4 S1 S2 S4 Set V of all network junctions S1 Low sensing quality F(A)=0.01 High sensing quality F(A) = 0.9

AU B AB Submodular set functions • Set function F on V is called submodularif • Equivalent diminishing returns characterization: + ≥ + B A A + S Large improvement Submodularity: B + S Small improvement

Submodularity and supermodularity • F is called supermodularif –F is submodular • F is called modular if F is both sub- and supermodular for modular (“additive”) F:

Example: Set cover Want to cover floorplan with discs Place sensorsin building Possiblelocations V For A V: F(A) = “area covered by sensors placed at A” Node predicts values of positions with some radius Formally: W finite set, collection of n subsets SiW For A Vdefine F(A) = |iA Si|

S’ S’ Set cover is submodular A={S1,S2} S1 S2 F(A U {S’})-F(A) ≥ F(B U {S’})-F(B) S1 S2 S3 S4 B = {S1,S2,S3,S4}

A better model for sensing X6 X3 X4 X1 X5 X2 Likelihood Prior Y1 Y2 Y3 Ys: temperatureat location s Xs: sensor valueat location s Y6 Y4 Y5 Xs= Ys+ noise Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn| Y1,…,Yn)

X1 X2 X3 Uncertaintyabout temperature Ybefore sensing Uncertaintyabout temperature Y after sensing Information gain Utility of having sensors at subset A of all locations X4 X1 X5 A={1,2,3}: High valueF(A) A={1,4,5}: Low valueF(A)

Y1 Y2 Y3 X1 X2 X3 X4 Example: Submodularity of info-gain Y1,…,Ym, X1, …, Xn discrete RVs F(A) = I(Y; XA) = H(Y)-H(Y | XA) • F(A) is always monotonic • However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!

Proof: Submodularity of information gain Y1,…,Ym, X1, …, Xn discrete RVs F(A) = I(Y; XA) = H(Y)-H(Y | XA) Variables X independent given Y F(A U{s}) – F(A) = H(Xs|XA) – H(Xs|Y) Δ(s | A) = F(A U {s})-F(A) monotonically nonincreasing F submodular  Nonincreasing in A:A  B :H(Xs|XA) ≥ H(Xs|XB) Constant(indep. of A) „informationneverhurts“

0.2 0.5 0.4 0.2 0.3 0.5 0.5 Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Who should get free cell phones? V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona} F(A) = Expected number of people influenced when targeting A Alice Prob. ofinfluencing Bob Fiona Charlie

Influence in social networks is submodular[Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 0.5 0.4 0.2 0.3 0.5 Bob 0.5 Fiona Charlie Key idea: Flip coins c in advance  “live” edges Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!

Closedness properties F1,…,Fmsubmodular functions on V and 1,…,m > 0 Then: F(A) = ii Fi(A) is submodular! Submodularity closed under nonnegative linear combinations! Extremely useful fact!! • F(A) submodular P() F(A) submodular! • Multicriterion optimization: F1,…,Fmsubmodular, i>0 ii Fi(A) submodular

Submodularity and Concavity Suppose g: N R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A|

Maximum of submodular functions Suppose F1(A) and F2(A) submodular. Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general!

Minimum of submodular functions Well, maybe F(A) = min(F1(A),F2(A)) instead? F({b}) – F({})=0 < F({a,b}) – F({a})=1 min(F1,F2) not submodular in general!

g(w) wb g(1,1)=F({a,b}) 1 g(1,0)=F({a}) wa 1 Submodularity and convexity Key result [Lovasz’83]: Every submodular function F induces a convex function g on Rn+, such that • F(A) = g(wA) for all A V • minA F(A) = minw g(w) s.t. w [0,1]n is attained at corner! • g is efficient to evaluate F: 2V Rsubmodularon V = {a,b} g: R|V|R Can use ellipsoid algorithm to minimize submodular functions!

Minimization ofsubmodular functions

Minimizing submodular functions Theorem [Iwata & Orlin (SODA 2009)] There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time O(n6log n) Polynomial-time = Practical ?? Is there a more practical alternative?

x{b} 2 1 x{a} -1 0 1 -2 The submodular polyhedron PF Example: V = {a,b} x({b}) ≤F({b}) PF x({a,b}) ≤F({a,b}) x({a}) ≤F({a})

x{b} 2 1 x{a} -1 0 1 -2 A more practical alternative? [Fujishige’91, Fujishige et al ‘11] x({a,b})=F({a,b}) Minimum norm algorithm: • Find x* = argmin ||x||2s.t. x inBFx*=[-1,1] • Return A* = {i: x*(i) < 0} A*={a} Theorem [Fujishige’91]: A* is an optimal solution! Note: Can solve 1. using Wolfe’s algorithm Runtime finite but unknown!!  x* Base polytope: [-1,1] M

Empirical comparison [Fujishige et al ’06] Cut functions from DIMACS Challenge Minimum norm algorithm orders of magnitude faster! Our implementation can solve n = 10k in < 6 minutes! Running time (seconds) Lower is better (log-scale!) Minimumnorm algorithm 64 128 256 512 1024 Problem size (log-scale!)

Example: Image denoising

Y1 Y2 Y3 X1 X2 X3 Y4 Y5 Y6 X4 X5 X6 Y7 Y8 Y9 X7 X8 X9 Example: Image denoising Pairwise Markov Random Field P(x1,…,xn,y1,…,yn) = i,ji,j(yi,yj) ii(xi,yi) Wantargmaxy P(y | x) =argmaxy log P(x,y) =argminyi,j Ei,j(yi,yj)+i Ei(yi) Ei,j(yi,yj) = -log i,j(yi,yj) Xi: noisy pixels Yi: “true” pixels When is this MAP inference efficiently solvable(in high treewidth graphical models)?

MAP inference in Markov Random Fields Energy E(y) = i,jEi,j(yi,yj)+iEi(yi) Suppose yi are binary, define F(A) = E(yA) where yAi = 1 iffi inA Then miny E(y) = minA F(A) Theorem [c.f., Hammer ’65, Kolmogorov et al, PAMI ’04]MAP inference problem solvable by graph cuts  For all i,j: Ei,j(0,0)+Ei,j(1,1) ≤Ei,j(0,1)+Ei,j(1,0)  each Ei,j is submodular “Efficient if prefer that neighboring pixels have same color”

Constrained minimization Have seen: if F submodular on V, can solve What about E.g., balanced cut, … In general, not much known about constrained minimization  However, can do • A*=argmin F(A) s.t. 0<|A|< n • A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan‘95] • A*=argmin F(A) s.t. A in argminG(A) for G submodular [Fujishige’91] Some recent lower bounds by Svitkina & Fleischer ‘08

State oftheart in minimization • Fasteralgorithmsforspecialcasesvia modern resultsforconvexminimization [Stobbe & Krause NIPS ‘10] • Approximation algorithmsforconstrainedminimization:Cooperativecuts [Jegelka & Bilmes CVPR ’11] • Fast approximateminimizationvia iterative reductiontographcuts [Jegelka & Bilmes NIPS ‘11] • Many open question in principled, efficient, large-scale submodular minimization!

Maximizing submodular functions

Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable! Maximizing convex functions: NP hard! Maximizing submodular functions: NP hard! But can get approximation guarantees 

maximum |A| Maximizing non-monotonic functions • Suppose we want for not monotonic F • Example: • F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] • In general: NP hard. Moreover: • If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)

Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07] • picking a random set gives ¼ approximation (½ approximation if F is symmetric!) • Current best approximation [Chekuri, Vondrak, Zenklusen]: ~0.41 via simulated annealing • we cannot get better than ½ approximation unless P = NP M Theorem There is an efficient randomized local search procedure,that, given a positive submodular function F, F({})=0, returns set ALS such that F(ALS) ≥ (2/5) maxA F(A)

Option 1: Option 2: Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: “Scalarization” “Constrained maximization” Can get 2/5 approx… if F(A)-C(A) ≥0 for all sets A Greedy algorithm gets(1-1/e)≈0.63 approx. Many more results!!Focus for rest of today Positiveness is a strong requirement 

Battle of the Water Sensor Networks Competition [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt 2008] • Real metropolitan area network (12,527 nodes) • Water flow simulator provided by EPA • 3.6 million contamination events • Multiple objectives: Detection time, affected population, … • Place sensors that detect well “on average”

Reward function is submodular • Claim:Reward function is monotonic submodular • Consider event i: • Ri(uk) = benefit from sensor ukin event i • Ri(A) = max Ri(uk), ukA • Ri is submodular • Overall objective: • R(A) =  Prob(i) Ri(A)  R is submodular  Can use greedy algorithm to solve ! Ri(u2) u2 u1 Ri(u1) Event i

Offline (Nemhauser)bound 1.4 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] (1-1/e) bound quite loose… can we get better bounds? Population protected F(A) Higher is better Water networksdata Number of sensors placed

M Data dependent bounds [Minoux ’78] • Suppose A is candidate solution to argmax F(A) s.t. |A| ≤k and A* = {s1,…,sk} be an optimal solution • Then F(A*) ≤F(A UA*) = F(A)+i(F(A U {s1,…,si})-F(A U{s1,…,si-1})) ≤F(A) + i (F(A U {si})-F(A)) = F(A) + iΔsi For each s  V\A, let Δs= F(A U {s})-F(A) Order such that Δ1 ≥Δ2≥… ≥Δn Then: F(A*) ≤F(A) + i=1kΔi

Offline (Nemhauser)bound 1.4 Data-dependentbound 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] Submodularity gives data-dependent bounds on the performance of any algorithm Sensing quality F(A) Higher is better Water networksdata Number of sensors placed

30 25 E H G 20 G E H 15 Total ScoreHigher is better G 10 G D H D G 5 0 Gueli Trachtman Guan et. al. Berry et. al. Dorini et. al. Huang et. al. Wu & Walski Preis & Ostfeld Propato & Piller Greedy approach Ghimire & Barkdoll Ostfeld & Salomons Eliades & Polycarpou BWSN Competition results • 13 participants • Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge E: “Exact” method (MIP) H: Other heuristic 24% better performance than runner-up! 

300 Exhaustive search (All subsets) 200 Naive Running time (minutes) greedy ubmodularity to the rescue 100 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected What was the trick? , 16 GB in main memory (compressed) Simulated all on 2 weeks / 40 processors 152 GB data on disk  Very accurate computation of F(A) 3.6M contaminations Very slow evaluation of F(A)  30 hours/20 sensors 6 weeks for all30 settings  Lower is better

Tutorial on Optimization with Submodular Functions