Learning factor graphs in polynomial time & sample complexity

Learning factor graphs in polynomial time & sample complexity Pieter Abbeel Daphne Koller Andrew Y. Ng Stanford University

Introduction Overview • First polynomial time & sample complexity learning algorithm for factor graphs, • a superset of Bayesian nets, Markov nets. • Applicable to any factor graph of bounded factor size and connectivity, • including intractable networks (e.g., grids). • New technical ideas: • Parameter learning: closed-form; parameterization with low-dimensional frequencies only. • Structure learning: results aboutguaranteed-approximate Markov blankets from sample data.

Introduction Factor graph distributions Bayesian network Factor graph 1 factor per conditional probability table Factor graph Markov random field 1 factor per clique

Introduction {2} {1} {3} 3 2 1 {2,3} {1,2} Factor graph distributions • … • Example: factor over variables Cjµ X1:n partition function instantiation x1:n restricted to Cj Factor node Variable node

Introduction Related work • Our work: first poly time & sample complexity solutions for parameter estimation & structure learning of factor graphs. • Current practice for parameter learning: max likelihood. • Expensive, and applies only to tractable networks. • Current practice for structure learning: local search heuristics or heuristic learning of bounded tree-width model. • Slow to evaluate, and no performance guarantees. [4],[5],[6],[7],[8] [1] Chow&Liu, 1968; [2] Srebro, 2001;[3] Narasimhan&Bilmes, 2004; [4] Della Pietra et al., 1997; [5] McCallum, 2003; [6] Malvestuto, 1991; [7] Bach&Jordan, 2002; [8] Deshpande et al., 2001

Parameter learning 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Canonical parameterization • Consider the factor graph: • Hammersley-Clifford theorem gives: canonical factors

Parameter learning Canonical factors • No lower-order interactions by inclusion-exclusion: Complete interaction. Subtract lower order interactions. Compensate for double counting. Frequencies only. Equal # of +,- terms. Closed-form parameter learning ? NO. (Not yet.) The frequencies P(X1:16=(x1,x2,0,…,0)) involve full instantiations and are thus expensive to estimate from samples.

Parameter learning Markov blanket canonical factors Positive and negative term of canonical factor: Transform to conditional probability. Terms cancel. Conditional independence. Low dimensional distributions. (MB: Markov blanket.)

Parameter learning Markov blanket canonical factors • {Cj*}: all subfactors of the given structure. • : from distribution over Cj*, MB(Cj*). • Low dimensional distributions. • Efficient estimation from samples. • Example:

Parameter learning Parameter learning • Algorithm: • Estimate the Markov blanket canonical factors from data. • Return • Theorem. The parameter learning algorithm • runs in polynomial time, • uses polynomial # of samples, guarantees: D(……) is small with high probability. No dependence on tree-width of the network!

Parameter learning Graceful degradation • Theorem. When • true distribution: factor graph G, • structure for parameter learning: factor graph G ( G), then the additional error consists of two terms:  Canonical factors capture residual highest-order interactions only. Small error when subfactors are in G.   If MB is a good approximation of MB, error will be small. (See structure learning.) MB in given factor graph G MB in given factor graph G 

Structure learning Structure learning • Assume factor size k. ? = Structure learning Structure: all factors of size  k Parameter learning + Estimating Markov blanket canonical factors requires knowledge of the Markov blankets. NO: But if we knew the Markov blankets, structure learning problem would be solved.

Structure learning Recovering the Markov blankets Markov blanket criterion True distribution True Markov blankets Markov blanket criterion Sample data ??? At best approximate Markov blanket from sample data. Key for parameter learning: Desired property for approximate Markov blanket:

Structure learning Conditional entropy • Conditional entropy: • For any candidate Markov blanket Y : Conditional independence Conditioning reduces entropy: For any X,Y,Z H (X |Y,Z ) H (X |Y ). Conditional entropy Thus True distribution True Markov blankets What about Conditional entropy Sample data ???

Structure learning Conditional entropy • Theorem. Empirical conditional entropy estimates are a good approximation for the true conditional entropy, even with poly number of samples. • Theorem. Conditional entropy satisfies the desired approximate Markov blanket property: For any  > 0,  MB(C) looks like Markov blanket if  MB(C) can be used as Markov blanket for learning then where

Structure learning Structure learning algorithm • Assume factor size k, Markov blanket size b. • For all subsets of variables Cj* of size k • Estimate Markov blanket canonical factors from data. • Discard factors that are close to the trivial “all ones” factor. • Return Find Markov blankets from empirical entropy. Parameter learning. Simplify structure.

Structure learning Structure learning theorem • Assume fixed: factor size k, MB size b. • Theorem. The structure learning algorithm • runs in polynomial time, • uses polynomial # of samples, guarantees: D(……) is small with high probability. • Note: • Exponential dependence on factor size, MB size for computational and sample complexity. • Bounded connectivity implies bounded factor and MB size. No dependence on tree-width of the network!

Structure learning Graceful degradation • Theorem. Let G be the factor graph of true distribution.When in the true distribution the max factor size > k or max MB size > b, the additional error consists of three terms: Canonical factors capture residual highest-order interactions only. Small error when small true interactions of order > k.  If MB is a good approximation of MB, error will be small. Factors that are trivial in the true distribution; but estimated as non-trivial since their MB size is larger than b.

Structure learning Factor node Variable node Consequences for Bayesian networks Factor graph Bayesian network 1 factor per conditional probability table bounded factor size, bounded Markov blanket size bounded fan-in, fan-out Factor graph Samples from PBN with unknown structure. Factor graph distribution P with D(PBN||P) .   structure learning Learning a factor graph (not a Bayesian network) gives efficient learning of the distribution from finite data.

Structure learning Related work • Finding highest scoring, bounded in-degree Bayesian network is NP-hard (Chickering, Meek & Heckerman, 2003). • Our algorithm recovers a factor graph representation only. • The (difficult) acyclicity constraint is avoided. ? • Learning a factor graph (not a Bayesian network) gives efficient learning of the distribution from finite data. • Note: Spirtes, Glymour & Scheines (2000) and Chickering & Meek (2002) do recover Bayesian network structure, but only with access to true distribution (infinite sample size).

Conclusion Discussion and conclusion • First polynomial time & polynomial sample complexity learning algorithm for factor graphs. • Applicable to any factor graph of bounded factor size and connectivity, • including intractable networks (e.g., grids). • Practical drawbacks of the proposed algorithm: • Estimates parameters from only small fraction of data. • Structure learning: algorithm enumerates all possible Markov blankets. • Complexity exponential in Markov blanket size.

Done ... • Additional and outdated slides follow.

Detailed theorem statements Parameter learning theorem

Detailed theorem statements Structure learning theorem

Learning factor graphs in polynomial time & sample complexity • Factor graphs: superset of Markov, Bayesian networks. Factor graph Markov network (MN) 1 factor per clique Bayesian network (BN) Factor graph 1 factor per conditional probability table • Current practice in Markov network learning: • parameter learning: max likelihood, only applicable in tractable MN’s. • structure learning: local-search heuristics or heuristic learning of bounded tree-width model. No performance guarantees. • Finding highest scoring BN is NP-hard (Chickering et al. 2003). Pieter Abbeel, Daphne Koller and Andrew Y. Ng

Learning factor graphs in polynomial time & sample complexity • First polynomial time & sample complexity learning algorithm for factor graphs. • Applicable to any factor graph of bounded factor size and connectivity, • including intractable networks (e.g., grids). • New technical ideas: • Parameter learning: in closed-form, using parameterization with low-dimensional frequencies only. • Structure learning: results aboutguaranteed-approximate Markov blankets from sample data. Pieter Abbeel, Daphne Koller and Andrew Y. Ng

Structure learning Relation to Narasimhan & Bilmes (2004) n x n grid: treewidth=n+1, Markov blanket size=6. n-star graph: treewidth=2, Markov blanket size=n. Factor node Variable node

Canonical parameterization

Canonical parameterization (2)

Canonical parameterization (3)

Markov blanket canonical factors

Markov blanket canonical parameterization

Approximate Markov blankets

Structure learning algorithm

Learning factor graphs in polynomial time & sample complexity