EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Overview • Bayesian network and other probabilistic graph models

Bayesian networks (informal) • A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions • Syntax: • a set of nodes, one per variable • a directed, acyclic graph (link ≈ "directly influences") • a conditional distribution for each node given its parents: P (Xi | Parents (Xi)) • In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

Example • Topology of network encodes conditional independence assertions: • Weather is independent of the other variables • Toothache and Catch are conditionally independent given Cavity

Example • I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? • Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls • Network topology reflects "causal" knowledge: • A burglar can set the alarm off • An earthquake can set the alarm off • The alarm can cause Mary to call • The alarm can cause John to call

Example contd.

Semantics • The full joint distribution is defined as the product of the local conditional distributions: P (X1, … ,Xn) = πi = 1P (Xi | Parents(Xi)) • e.g., P(j  m  a b e) = P (j | a) P (m | a) P (a | b, e) P (b) P (e) n

Inference • Given the data that “neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call”, how do we make a decision about the following four possible explanations: • Nothing at all • Burglary but not Earthquake • Earthquake but not Burglary • Burglary and Earthquake

Learning • Suppose that we only have a joint distribution, how do you “learn” the topology of a BN?

Application: Clustering Users • Input: TV shows that each user watches • Output: TV show “clusters” • Assumption: shows watched by same users are similar • Class 1 • Power rangers • Animaniacs • X-men • Tazmania • Spider man • Class 2 • Young and restless • Bold and the beautiful • As the world turns • Price is right • CBS eve news • Class 3 • Tonight show • Conan O’Brien • NBC nightly news • Later with Kinnear • Seinfeld • Class 5 • Seinfeld • Friends • Mad about you • ER • Frasier • Class 4 • 60 minutes • NBC nightly news • CBS eve news • Murder she wrote • Matlock

P(Level | Module, Regulators) Module HAP4  Expression level of Regulator1 in experiment CMK1  1 What module does gene “g” belong to? 0 Regulator1 0 0 BMH1  Regulator2 GIC2  2 Regulator3 0 0 0 Expression level in each module is a function of expression of regulators Level App.: Finding Regulatory Networks Experiment Module Gene Expression

App.: Finding Regulatory Networks Ypl230w Not3 Bmh1 Yap6 Gac1 Gis1 Tpk2 Pph3 Sip2 Gcn20 Yer184c Kin82 Xbp1 Gat1 Ime4 Ppt1 Tpk1 Msn4 Hap4 Lsg1 Cmk1 31 36 47 39 5 16 30 42 26 4 18 13 17 15 14 41 33 2 3 25 1 9 10 11 8 N36 N26 N18 N41 HSF N30 N14 N11 N13 MIG1 CAT8 GATA HAC1 STRE XBP1 ADR1 GCR1 GCN4 MCM1 ABF_C CBF1_B HAP234 REPCAR DNA and RNAprocessing Energy andcAMP signaling Amino acidmetabolism nuclear Module (number) Inferred regulation 48 Regulation supported in literature Regulator (Signaling molecule) Enriched cis-Regulatory Motif Regulator (transcription factor) Experimentally tested regulator

Constructing Bayesian networks • Base: • We know the joint distribution of X = X1, … ,Xn • We know the “topology” of X • Xi X, we know the parents of Xi • Goal: we want to create a Bayesina network that capture the joint distribution according to the topology • Theorem: such BN exists n n

Prove by Construction • A leaf in X is a Xi X such that Xi has no child. • For each Xi • add Xi to the network • select parents from X1, … ,Xi-1 such that P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1) • X = X – {Xi} This choice of parents guarantees: P (X1, … ,Xn) = πi =1P (Xi | X1, … , Xi-1) (chain rule) = πi =1P (Xi | Parents(Xi)) (by construction)

Compactness • A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values • Each row requires one number p for Xi = true(the number for Xi = false is just 1-p) • If each variable has no more than k parents, the complete network requires O(n · 2k) numbers • I.e., grows linearly with n, vs. O(2n) for the full joint distribution • For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)

Reasoning: Probability Theory • Well understood framework for modeling uncertainty • Partial knowledge of the state of the world • Noisy observations • Phenomenon not covered by our model • Inherent stochasticity • Clear semantics • Can be learned from data

Probability Theory • A (Discrete) probability P over (, S = 2) is a mapping from elements in S such that: •  is a set of all possible outcomes (sample space) in a probabilistic experiment, S is a set of “events” • P() 0 for all S • P() = 1 • If ,S and =, then P()=P()+P() • Conditional Probability: • Chain Rule: • Bayes Rule: • Conditional Independence:

Random Variables & Notation • Random variable: Function from to a non-negative real value such that summation of all the values is 1. • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x) • Eg. P(X = x), P(X) = {P(X=x) | x }

Joint Probability Distribution • Given a group of random variables X= X1, … ,Xn, Xi takes value from a set xi, the joint probability distribution is a function that maps elements in  =Πxi to a non-negative valuesuch that the summation of all the values is 1. • For example, RV weather takes four values “sunny, rainy, cloudy, snow”, RV Cavity takes 2 values “true, false” P(Weather,Cavity) = a 4 × 2 matrix of values: Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08

Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08 P(weather=sunny) = 0.144 + 0.576 = 0.72 P(Cavity=true) = 0.144+0.02+0.016 + 0.02 = 0.2 Marginal Probability • Given a set of RV X and its joint probabilities, a marginal probability distribution over X’  X is:

Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08 P(weather=sunny) = 0.144 + 0.576 = 0.72 P(Cavity=true) = 0.144+0.02+0.016 + 0.02 = 0.2 Independence • Two RV X, Y are independent, denoted as X Y if • Conditional independence:Xis independent of Y given Z if:

If X1,..,Xn binary, need 2n parameters to describe P Representing Joint Distributions • Random variables: X1,…,Xn • P is a joint distribution over X1,…,Xn Can we represent P more compactly? • Key: Exploit independence properties

Independent Random Variables • If X and Y are independent then: • P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) • If X1,…,Xn are independent then: • P(X1,…,Xn) = P(X1)…P(Xn) • O(n) parameters • All 2n probabilities are implicitly defined • Cannot represent many types of distributions • We may need to consider conditional independence

Conditional Parameterization • S = Score on test, Val(S) = {s0,s1} • I = Intelligence, Val(I) = {i0,i1} • G = Grade, Val(G) = {g0,g1,g2} • Assume that G and S are independent given I • Joint parameterization • 223=12-1=11 independent parameters • Conditional parameterization has • P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) • P(I) – 1 independent parameter • P(S|I) – 21 independent parameters • P(G|I) - 22 independent parameters • 7 independent parameters

Naïve Bayes Model • Class variable C, Val(C) = {c1,…,ck} • Evidence variables X1,…,Xn • Naïve Bayes assumption: evidence variables are conditionally independent given C • Applications in medical diagnosis, text classification • Used as a classifier: • Problem: Double counting correlated evidence

Bayesian Network A Formal Study • A Bayesian network on a group of random variables X = X1, … ,Xnis a tupple (T, P) such that • The topology T  X  X is a directed acyclic graph • A joint distribution P such that • for all i [1,n], for all possible value of xi and xs P(Xi= xi| Xs = xs) = P(Xi = xi| parents(Xi) = xs) • S = non-descendents of Xi in X • Or, Xi is conditional independent of any of its non-descendent variables, given itsparents(Xi)

Factorization Theorem • If G is an Independence-Map (I-map) of P, then Proof: • X1,…,Xn is an ordering consistent with G • By chain rule: • From assumption: • Since G is an I-Map  (Xi; NonDesc(Xi)| Pa(Xi))I(P)

Factorization Implies I-Map  G is an I-Map of P Proof: • Need to show that P(Xi | ND(Xi)) = P(Xi | Pa(Xi)) • D is the descendents of node I, ND all nodes except i and D

Probabilistic Graphical Models • Tool for representing complex systems and performing sophisticated reasoning tasks • Fundamental notion: Modularity • Complex systems are built by combining simpler parts • Why have a model? • Compact and modular representation of complex systems • Ability to execute complex reasoning patterns • Make predictions • Generalize from particular problem

Probabilistic Graphical Models • Increasingly important in Machine Learning • Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism • Graphical models provides a common framework • Advantage: specialized techniques developed in one field can be transferred between research communities

Representation: Graphs • Intuitive data structure for modeling highly-interacting sets of variables • Explicit model for modularity • Data structure that allows for design of efficient general-purpose algorithms

Reference • “Bayesian Networks and Beyond”,Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

EECS 800 Research Seminar Mining Biological Data