Efficient Learning of Bayesian Network Structures Using Sparse Candidate Algorithm

Learning Bayesian Network Structure from Massive Datasets:The “Sparse Candidate” Algorithm Nir Friedman, Iftach Nachman, and Dana Peer Announcer: Kyu-Baek Hwang

Abstract • Learning Bayesian network • Optimization problem (in machine learning) • Constraint satisfaction (in statistics) • Search space is extremely large. • Search procedure spends most of times examining extremely unreasonable candidate structures. • If we can reduce search space, faster learning will be possible. • Some restrictions on candidate parent variables for a variable are given. • Bioinformatics

Learning Bayesian Network Structures • Constraint satisfaction problem • 2-test • Optimization problem • BDe, MDL • Learning is to find the structure maximizes these scores. • Search technique • Generally NP-hard • Greedy hill-climbing, simulated annealing • O(n2) • If the number of examples and the number of attributes are large, the computational cost is too expensive to get tractable result.

Combining Statistical Properties • Most of the candidates considered during the search procedure can be eliminated in advance based on our statistical understanding on the domain • If X and Y are almost independent in data, we might decide not to consider Y as a parent of X. • Mutual information • Restricting the possible parents of each variable (k) • k << n – 1 • The key idea is to use the network structure found at the last stage to find better candidate parents.

Background • A Bayesian network for X = {X1, X2, …, Xn} • B = <G, > • The problem of learning a Bayesian network • Given a training set D = {X1, X2, …, XN}, • Find a B that best matches D. • BDe, MDL • Score(G:D) = iScore(Xi|Pa(Xi):NXi, Pa(Xi)) • Greedy hill-climbing search • At each step, all possible local change is examined and the change which brings maximal gain in the score is selected. • Calculation of sufficient statistics is computational bottle-neck.

Simple Intuitions • Using mutual information or correlation • If the true structure is X -> Y -> Z, • I(X;Z) > 0, I(Y;Z) > 0, I(X;Y) > 0 and I(X;Z|Y) = 0 • Basic idea of “Sparse Candidate” algorithm • For each variable X, we find a set of variables Y1, Y2, …, Yk that are most promising candidate parents for X. • This gives us smaller search space. • The main drawback of this idea • A mistake in initial stage can lead us to find an inferior scoring network. • To iterate basic procedure, using the previously constructed network to reconsider the candidate parents.

Outline of the Sparse Candidate Algorithm

Convergence Properties of the Sparse Candidate Algorithm • We require that in Restrict step, the selected candidates for Xi’s parents include Xi’s current parents. • PaGn(Xi)  Cin+1 • This requirement implies that the winning network Bn is a legal structure in the n + 1 iteration. • Score(Bn+1|D)  Score(Bn|D) • Stopping criterion • Score(Bn) = Score(Bn-1)

Mutual Information • Mutual information • Example • I(A;C) > I (A;D) > I(A;B) B A C D

Discrepancy Test • Initial iteration uses mutual information and after this, discrepancy.

Other tests • Conditional mutual information • Penalizing structures with more parameters

Learning with Small Candidate Sets • Standard heuristics • Unconstrained • Space: O(nCk) • Time: O(n2) • Constrained by small candidate • Space: O(2k) • Time: O(kn) • Divide and Conquer heuristics

Strongly Connected Components • Decomposing H into strongly connected components takes linear time.

S H1 H’1 H’2 H2 X Y Separator Decomposition • The bottle-neck is S. • We can order the variables in S to disallow any cycle in H1H2.

Experiments on Synthetic Data

Experiments on Real-Life Data

Conclusions • Sparse candidate set enables us to search for good structure efficiently. • Better criterion is necessary. • The authors applied these techniques to Spellman’s cell-cycle data. • Exploiting of network structure to search in H needs to be improved.

Efficient Learning of Bayesian Network Structures Using Sparse Candidate Algorithm