Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf (pwoolf@umich.edu)

Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf (pwoolf@umich.edu) University of MichiganMichigan Chemical Process Dynamics and Controls Open Textbookversion 1.0 Creative commons

Existing plant measurements Physics, chemistry, and chemical engineering knowledge & intuition Bayesian network models to establish connections Patterns of likely causes & influences Efficient experimental design to test combinations of causes ANOVA & probabilistic models to eliminate irrelevant or uninteresting relationships Process optimization (e.g. controllers, architecture, unit optimization, sequencing, and utilization) Dynamical process modeling

More scenarios where Bayesian Networks can help • Inferential sensing: how do you sense the state of something you don’t see? • Sensor redundancy: if multiple sensors disagree, what can you say about the state of the system? • Nosy systems: if your system is highly variable, how can you model it?

e.g. Solve a given ODE e.g. Fit parameters to an ODE using optimization ?? ?? More research.. ?? Stages of knowing a model: Topology and parameters are known. Topology is known and we have data to learn parameters Only data are known, must learn topology and parameters Only partial data are known, must learn topology and parameters Model is unknown and nonstationary More realistic Bayesian Networks

A B Probability Tables Note: Rows sum to 1, but columns don’t

P(C-) P(C+) 0.5 0.5 Bayesian Networks • Graphical form of Bayes’ Rule • Conditional independence • Decomposition of joint probabilityP(C+, S-, R+, W+) = P(C+)P(S-|C+)P(R+|C+)P(W+|S-,R+) • Causal networks • Inference on a network vs inference of a network

A B Inference on a network • Exact vs. Approximate calculation: • In some cases you can exactly calculate probabilities on a BN given some data. This can be done directly or using quite complex algorithms for faster execution time. • For large networks, exact is impractical.

A B Inference on a network Given a value of A, say A=high, what is B? P(B=on)=0.3 P(B=off)=0.7 The answer is a probability!

A B Inference on a network Given a value of B, say B=on, what is A? This is what Genie is doing on the wiki examples

A B Inference on a network • Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes • Repeat sampling out until you fill the network. • Start over and gather averages.

Inference on a network • Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes • Repeat sampling out until you fill the network. • Start over and gather averages. e1 e1 e1 * e1 *=observed data e1, e2=sample estimates in round 1 and 2

Inference on a network • Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes • Repeat sampling out until you fill the network. • Start over and gather averages. • Method always works in the limit of infinite samples… e1 e1 e2 e2 e1 * e1 e2 *=observed data e1, e2=sample estimates in round 1 and 2

This can be interpreted as a Bayesian network! The network is the same as saying: Example scenario

recall Note that these are equivalence classes and are a fundamental property of observed data. Causality can only be determined from observational data to some extent! The network A->B<-C is fundamentally different (prove it to yourself with Bayes rule), and can be distinguished with observational data.

FUNDAMENTAL PROPERTY! Equivalent models if we just observe A, B, and C. If we intervene and change A, B, or C we can distinguish between them. OR we can use our knowledge to choose the direction No arrangement of this last model will produce the upper 3 models.

Example scenario

(1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don’t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property. Here we can use the multinomial distribution and the probabilities in the table above: Compare to the binomial distribution we discussed previously (k=2)

(1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don’t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property. Here we can use the multinomial distribution and the probabilities in the table above: For this problem we find:

(2) After gathering these 9 temperature readings, what is the most likely next temperature reading you will see? Why? The next most likely temperature reading is medium, because this has the highest probability of 0.4. The previous sequence of temperature readings do not matter assuming these are independent readings, as is mentioned above.

Here we can use the two state case of the multinomial distribution, (the binomial distribution): (3) What is the probability of sampling a set of 9 observations with 7 of them catalyst A and 2 of them catalyst B? Here again, order does not matter.

(4) What is the probability of observing the following yield values? Note here we have the temperature and catalyst values, so we can use the conditional probability values. As before, order of observations does not matter, but the association between temperature and catalyst to yield does matter. For this part, just write down the expression you would use—you don’t need to do the full calculation.

Calculation method 1: First we will calculate the probability of this set for a particular ordering: The number of orderings of identical items is the factorial term in the multinomial: Thus the total probability is 0.00071048

The probabilities can be interpreted here as another multinomial term. For example, for the first observation, we could say what is the probability of observing a 4 high, 0 med, and 0 low yields for a system with a high temperature and catalyst A? Using the multinomial distribution we would find: Calculation method 2: The combination term is the same, 1260. Note that this matches the result in calculation method 1 exactly. We can repeat this for the second case to find p(0H,0M,2L|T=med, Cat=B)=0.032 which is again the same as above. Taking the product of the combinations and probabilities we find the same total probability of 0.00071048.

Note that the joint probability model here is p(temperature, catalyst, yield)= p(temperature)*p(catalyst)*p(yield | temperature, catalyst)= 0.047*00.0212*0.00071=7.07e-7 (Note: p(temp) and p(cat) were calculated earlier in the lecture) This term is the probability of the data given a model and parameters: P(data|model, parameters) The absolute value of this probability is not very informative by itself, but it could be if it were compared to something else.

As an example, lets say that you try another model where yield only depends on temperature. This model is shown graphically below: What is the conditional probability model? P(temperature, cat, yield)=p(temp)p(cat)p(yield | temp) (call this model 2)

P(temperature, cat, yield)=p(temp)p(cat)p(yield|temp) (call this model 2) How do we change this table to get p(yield|temp)?

Now what?

So which model is better?

Both models are nearly equal Models are different A Bayes factor (BF) is like a p-value in probability or Bayesian terms. BF near 1=? BF far from 1=?

Limitations: • Analysis based on only 9 data points. This is useful for identifying unusual behavior. For example, in this case, we might conclude that catalyst A and B still have distinct properties, even though, say, they have been recycled many times. • We don’t always have parameters like the truth table to start with.

Constraints: • There are a total of 100 samples drawn, thus 100=H+M+L • For the maximum likelihood case, H=51, so the relationship between M and L is 100=51+M+L → M=49-L • At some lower value of H we get the expression M=(100-H)-L Integrate by summing! 51H, 8M, and 41L L M

Take Home Messages • Using a Bayesian network you can describe complex relationships between variables • Multinomial distributions allow you to handle variables with more than 2 states • Using the rules of probability (Baye’s rule, marginalization, and independence), you can infer states on a Bayesian network

Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf (pwoolf@umich.edu)