Exact Inference on Graphical Models

Exact Inference on Graphical Models Samson Cheung

Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree

What is inference? Given a fully specified joint distribution (database), inference is to query information about some random variables, given knowledge about other random variables. Query about XF? Information about XF Evidence: xE

Conditional/Marginal Prob. Ex. Visual Tracking – you want to compute the conditional to quantify the uncertainty in your tracking Conditional of XF? Evidence: xE

Maximum A Posterior Estimate Error Control – Care about the decoded symbol. Difficult to compute the error probability in practice due to high bandwidth. Most likely value of XF? Evidence: xE

Inferencing is not easy Potential: (p,q) = exp(-|p-q|) • Computing marginals or MAP requires global communication! Evidence Marginal: P(p,q)=G\{p,q} p(G)

EXACT APPROXIMATE • Iterative Conditional Modes • EM • Mean field • Variational techniques • Structural Variational techniques • Monte-Carlo • Expectation Propagation • Loopy belief propagation General Graph Polytrees ELIMINATION ALGORITHM JUNCTION TREE BELIEF PROPAGATION Inference Algorithms 10-100 nodes: Expert systems Diagnostics Simulation >1000 nodes: Image Processing Vision Physics General Inference Algorithms NP -hard

Outline • What is inferencing? • Overview • Preliminaries • Three general algorithms for inferencing • Elimination Algorithm • Junction Tree • Probability Propagation

Calculating Marginal Introducing evidence • Inferencing : summing or maxing “part” of the joint distribution • In order not to be sidetrack by the evidence node, we roll them into the joint by considering • Hence we will be summing or maxing the entire joint distribution

X1 π π X2 X3 X4 X5 X6 Moralization • Every directed graph can be represented as an undirected by linking up parents who have the same child. • Deal only with undirected graph X1 P(X1) P(X2|X1) P(X3|X1) P(X4|X1) P(X5|X2,X3) P(X6|X3,X4) (X1,X2,X3) (X1,X3,X4) (X2,X3,X5) (X3,X4,X6) X2 X3 X4 X5 X6

(X1,X2,X3,X4) (X2,X3,X5) (X3,X4,X6) π X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X6 X6 Adding edges is “okay” • The pdf of an undirected graph can ALWAYS be expressed by the same graph with extra edges added. • A graph with more edge • Lose important conditional independence information (okay for inferencing, not good for parameter est.) • Use more storage (why?) (X1,X2,X3) (X1,X3,X4) (X2,X3,X5) (X3,X4,X6)

C5 Separator: C1 C3={X2,X3} C6 C1 C2 C4 C3 Undirected graph and Clique graph C1(X1,X2,X3) C2(X1,X3,X4) C3(X2,X3,X5) C4(X3,X4,X6) C5(X7,X8,X9) C6(X1,X7) • Clique graph • Each node is a clique from the parametrization • An edge between two nodes (cliques) if the two nodes (cliques) share common variables X9 X7 X8 X1 X2 X3 X4 X5 X6

Computing Marginal • Need to marginalize x2,x3,x4,x5 • We need to sum N5 terms (N is the number of symbols for each r.v.) • Can we do better? X4 X2 X1 X5 X3

C: O(N3) S: O(N2) C: O(N2) S: O(N) C: O(N3) S: O(N2) C: O(N2) S: O(N) Elimination (Marginalization) Order • Try to marginalize in this order: x5, x4, x3, x2 • Complexity: O(KN3), Storage: O(N2) K=# r.v.s

MAP is the same • Just replace summation with max • Note • All the m’s are different from marginal • Need to remember the best configuration as you go

Kill X2 Kill X3 Kill X4 Kill X5 Graphical Interpretation List of active potential functions: X4 X2 X1 m2(X1) C1(X1,X2) m4(X2) m3(X1,X2) C1(X1,X2) m4(X2) m3(X1,X2) C1(X1,X2) C2(X1,X3) C3(X2,X5) C4(X3,X5) C5(X2,X4) C1(X1,X2) C2(X1,X3) C3(X2,X5) C4(X3,X5) C5(X2,X4) C1(X1,X2) C2(X1,X3) C5(X2,X4) m5(X2,X3) C1(X1,X2) C2(X1,X3) C5(X2,X4) m5(X2,X3) C1(X1,X2) C2(X1,X3) m4(X2) m5(X2,X3) C1(X1,X2) C2(X1,X3) m4(X2) m5(X2,X3) X5 X3

X4 X2 X1 X5 First real link to graph theory • Reconstituted Graph = the graph that contain all the extra edges after the elimination • Depends on the elimination order! X3 The complexity of graph elimination is O(NW), where W is the size of the largest clique in the reconstituted graph Proof : Exercise

Finding the optimal order • To minimize the clique size turns out to be NP-hard1 • Greedy algorithm2: • Find the node v in G that connects to the least number of neighbors • Eliminate v and connect all its neighbors • Go back to 1 until G becomes a clique • Current best techniques use other simulated annealing3 or approximated algorithm4 1 S. Arnborg, D.G. Corneil, A. Proskurowski, Complexity of finding embeddings in a k-tree, SIAM J.Algebraic and Discrete Methods 8 (1987) 277–284. 2 D. Rose, Triangulated graphs and the elimination process, J. Math. Anal. Appl. 32 (1974) 597–609. 3U. Kjærulff, Triangulation of graph-algorithms giving small total state space, Technical Report R 90-09, Department of Mathematics and Computer Science, Aalborg University, Denmark, 1990. 4A. Becker, D. Geiger, “A sufficiently fast algorithm for finding close to optimal clique trees,” Arificial Intelligence 125 (2001) 3-17

Largest clique: 4 Grow linearly with dimension (?) This is serious • One of the most commonly used graphical model in vision is Markov Random Field • Try to find a elimination order of this model. (p,q) = exp(-|p-q|) Pixel: I(x,y)

What about other marginals? • We have just computed P(X1). • What if I need to compute P(X1) or P(X5) ? • Definitely, some part of the calculation can be reused! Ex. m5(X2,X3) is the same for both! X4 X2 X1 X5 X3

Focus on trees • Focus on tree like structures: • Why trees? Directed Tree = undirected tree after moralization Undirected Tree

Why trees? • No moralization is necessary • There is a natural elimination ordering with query node as root • Depth first search : all children before parent • All sub-trees with no evidence nodes can be ignored (Why? Exercise for the undirected graph)

mji(xi) Elimination on trees When we eliminate node j, the new potential function must be • A function of xi • Any other nodes? • nothing in the sub-tree below j (already eliminated) • nothing from other sub-trees, since the graph is a tree • only i, from ij which relates i and j Think of the new potential functions as a message mji(xi) from node j to node i

What is in the message? This message is created by summing over j the product of all earlier messages mkj(xj) sent to j as well as E(xj) (if j is an evidence node). • c(j) = children of node j • E(xj) = δ(xj=xj) if j is an evidence node; 1 otherwise

Elimination = Passing message upward • After passing the message up to the query (root) node, we compute the conditional: • What about answering other queries? = query node (need 3 messages)

Messages are reused! • We can compute all possible messages in only double the amount of work it takes to do one query. • Then we take the product of relevant messages to get marginals. Even though the naive approach (rerun Elimination) needs to compute N(N-1)messages to find marginals for all N query nodes, there are only 2(N-1) possible messages.

Computing all possible messages • Idea: respect the following Message-Passing-Protocol: A node can send a message to a neighbour only when it has received messages from all its other neighbours. • Protocol is realizable: designate one node (arbitrarily) as the root. • Collect messages inward to root then distribute back out to leaves.

mij mji mjk mjl mkj mlj Belief Propagation i j k l

Belief Propagation (sum-product) • Choose a root node (arbitrarily or as first query node). • If j is an evidence node, E(xj) = (xj=xj), else E(xj) = 1 • Pass messages from leaves up to root and then back down using: • Given messages, compute marginals using:

MAP is the same (max-product) • Choose a root node arbitrarily. • If j is an evidence node, E(xj) = (xj=xj), else E(xj) = 1 • Pass messages from leaves up to root using: • Remember which choice of xj = xj* yielded maximum. • Given messages, compute max value using any node i: • Retrace steps from root back to leaves recalling best xj to get the maximizing argument (configuration) x.

Corresponding factor graph IS A TREE After moralization “Tree”-like graphs work too • Pearl (1988) shows that BP works for factor tree • See Jordan Chapter 4 for more details This is not a directed tree

What about arbitrary graphs? • BP only works on tree-like graphs • Question: Is there an algorithm for general graph? • Also, after BP, we get the marginal for each INDVIDUAL random variables • But the graph is characterized by cliques • Question: Can we get the marginal for every clique?

Mini-outline • Back to Reconstituted Graph • Three equivalent concepts • Triangulated graph – easy to validate • Decomposable graph – link to probability • Junction Tree – computational inference • Junction Tree Algorithm • Example

Back to Reconstituted graph The reconstituted graph is a very important type of graph: triangulated (chordal) graph • Definition: A graph is triangulated if any loop with 4 or more nodes will have a chord. Non-triangulated All trees are triangulated triangulated All cliques are triangulated

Added during eliminationchordal Proof • Prove for any N-node graph, the reconstituted graph after elimination is triangulated. • Proof: By induction • N=1 : trivial • Assume N=k is true. • N=k+1 case: Reconstituted graph with k nodes  triangulated v v = first node eliminated

Lessons from graph theory • Graph coloring problem: find the smallest number of vertex colors so that adjacent colors are different = chromatic number • Sample application 1: Scheduling • Node = tasks • Edge = two tasks are not compatible • Coloring = Number of parallel tasks • Sample application 2 : Communication • Node = symbols • Edge = two symbols may produce the same output due to transmission error • Largest set of vertices with the same color = number of symbols that can be reliably sent

Lesson from graph theory • Determining the chromatic number  is NP-hard • Not so for a general type of graph called Perfect Graph • Definition:  = the size of the largest clique • Triangulated graph is an important type of perfect graphs. • Strong Perfect Graph Conjecture was proved in 2002 (148-page!) • Bottom line: Triangulated graph is “algorithmically friendly” – very easy to check whether a graph is triangulated and to compute properties from such a graph.

Link to Probability: Graph Decomposition • Definition: Given a graph G, a triple (A,B,S) with Vertex(G) = ABS is a decomposition G if • S separates A and B (i.e. every path from aA to bB must past through S. • S is a clique • Definition: G is decomposable if • G is complete or • There exist a decomposition (A,B,S) of G such that AS and BS are decomposable. S A B

What’s the big deal? Decomposable graph can be parametrized by marginals! If G is decomposable, then where C1,C2, …,CNare cliques in G, and S1,S2, …,SN-1 are (special) separators between cliques. Notice there are one less separators than cliques. Equivalently, we can say that G can parameterized by marginals p(xC) and ratios of marginals, p(xC)/p(xS)

This is not true in general • If the graph can be expressed in terms of a product marginals or ratio of marginals, at least one of the potentials is a marginal. • However, f(XAB) is not a constant C D B A

Proof : A B Proof by induction: G can be decomposed into A,B, and S, where AS and B S are decomposable; S separates A and B and is complete. S All cliques are subsets of either AS or B S

Continue Recursively apply on AS and BS based on induction assumption.

So what? It turns out that Triangulated Graph  Decomposable Graph Triangulated Graph Decomposable Graph Parametrized by marginals Nice algorithmically

Decomposable Triangulation Prove by induction: If G is complete, it is triangulated. Otherwise By IA, GAS and GBS are triangulated and thus all cycles in them have a chord. The case we need to consider is the cycle that span A, B and S. But S is complete, so it must have a chord! QED A B S

B A a b TriangulationDecomposable S Prove by induction. Let G be a triangulated graph with N nodes. Show is G can be decomposed into (A,B,S). If G’s complete, done. If not, choose non-adjacent a and b. S = smallest set that intersects with all paths between a and b. A = all nodes in G\S reached by a B = all nodes in G\S reached by b Cleary A and B are separated by S.

b1 a1 b2 a2 TriangulationDecomposable c S Need to prove S is complete. Consider arbitrary c,dS. There is a path acb such that c is the only node in S. If not, then S is not minimum as c can be put into either A or B. Similarly, there is a path adb. Now we a cycle. Since G is triangulated, this cycle must have a chord. Since S separates A and B, the chord must be entirely in AS or BS. Keep shrinking the cycle and eventually there must be a chord between c and d, hence S must be complete. B A a b d

Exact Inference on Graphical Models

Exact Inference on Graphical Models

Presentation Transcript

Graphical Models

Exact and approximate inference in probabilistic graphical models

Exact Inference

Exact Inference Continued

Exact Inference: Clique Trees

Exact and approximate inference in probabilistic graphical models

Learning with Inference for Discrete Graphical Models

Graphical Models

Graphical Models

Graphical Models

Graphical Models - Inference -

GRAPHICAL MODELS

Inference using Graphical Models and Software Tools

Max-Sum Exact Inference

Lecture 22: Inference in Graphical Models

Notes on Graphical Models

Graphical Models

Causal Inference and Graphical Models

Graphical Models