Lecture 3 Bayesian Reasoning 第 3 讲贝叶斯推理

Lecture 3 Bayesian Reasoning 第3讲贝叶斯推理 3.1 Bayesian 规则 3.2 Naïve Bayes Model 3.3 Bayesian Network

Sources of Uncertainty • Information is partial • Information is not fully reliable. • Representation language is inherently imprecise. • Information comes from multiple sources and it is conflicting. • Information is approximate • Non-absolute cause-effect relationships exist

Source of Uncertainty • Uncertain data (noise) • Uncertain knowledge (e.g, causal relations) • A disorder may cause any and all POSSIBLE manifestations in a specific case • A manifestation can be caused by more than one POSSIBLE disorders • Uncertain reasoning results • Abduction(溯因) and induction(归纳) are inherently uncertain • Default reasoning, even in deductive fashion, is uncertain • Incomplete deductive inference may be uncertain • Incomplete knowledge and data

Probabilistic Reasoning • Evidence • What we know about a situation. • Hypothesis • What we want to conclude. • Compute • P( Hypothesis | Evidence )

Bayes Theorem • P( H | E )= P(H). P(E|H)/ P(E) This can be derived from the definition of conditional probability. 后验概率=先验概率*似然概率/证据因子 • Posterior = (Prior. Likelihood) / Evidence

P(H|E) = P(H, E) P(E) Baye’s Formula P(H|E) = P(E|H) ● P(H) P(E)

Bayes Theorem 的经典应用：相继律 一个人打靶，打了n次，命中T次，问此人打靶命中的概率θ如何估计? T/n ?

先验分布： 似然度：已知打靶命中率为θ，则打靶n次命中恰为T的的概率为：

后验概率密度：

利用后验分布的期望值估计 相继律

Naïve Bayes Model朴素贝叶斯模型

Outline • Independence and Conditional Independence （条件独立） • Naïve Bayes Model • Application: Spam（垃圾邮件）Detection

Probability of Events • Sample space and events • Sample space S: (e.g., all people in an area) • Events E1  S: (e.g., all people having cough) E2  S: (e.g., all people having cold) • Prior (marginal) probabilities of events • P(E) = |E| / |S| (frequency interpretation) • P(E) = 0.1 (subjective probability) • 0 <= P(E) <= 1 for all events • Two special events:  and S: P() = 0 and P(S) = 1.0 • Boolean operators between events (to form compound events) • Conjunctive (intersection): E1 ^ E2 ( E1  E2) • Disjunctive (union): E1 v E2 ( E1  E2) • Negation (complement): ~E (E = S – E)

~E E1 E E2 E1 ^ E2 • Probabilities of compound events • P(~E) = 1 – P(E) because P(~E) + P(E) =1 • P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2) • But how to compute the joint probability P(E1 ^ E2)? • Conditional probability (of E1, given E2) • How likely E1 occurs in the subspace of E2

Independence assumption • Two events E1 and E2 are said to be independent of each other if (given E2 does not change the likelihood of E1) • It can simplify the computation • Mutually exclusive (ME) and exhaustive (EXH) set of events • ME: • EXH:

Independence: Intuition • Events are independent if one has nothing whatever to do with others. Therefore, for two independent events, knowing one happening does change the probability of the other event happening. • one toss of coin is independent of another coin (assuming it is a regular coin). • price of tea in England is independent of the result of general election in Canada.

Independence: Definition • Events A and B are independent iff: P(A, B) = P(A) . P(B) which is equivalent to P(A|B) = P(A) and P(B|A) = P(B) when P(A, B) >0. T1: the first toss is a head. T2: the second toss is a tail. P(T2|T1) = P(T2)

Conditional Independence • Dependent events can become independent given certain other events. • Example, • Size of shoe • Age • Size of vocabulary • Two events A, B are conditionally independent given a third event C iff P(A|B, C) = P(A|C)

Conditional Independence:Definition • Let E1 and E2 be two events, they are conditionally independent given E iff P(E1|E, E2)=P(E1|E), that is the probability of E1 is not changed after knowing E2, given E is true. • Equivalent formulations: P(E1, E2|E)=P(E1|E) P(E2|E) P(E2|E, E1)=P(E2|E)

Naïve Bayes Method • Knowledge Base contains • A set of hypotheses • A set of evidences • Probability of an evidence given a hypothesis • Given • A sub set of the evidences known to be present in a situation • Find • the hypothesis with the highest posterior probability: P(H|E1, E2, …, Ek).

Naïve Bayes Method • Assumptions • Hypotheses are exhaustive and mutually exclusive • H1 v H2 v … v Ht • ¬ (Hi ^ Hj) for any i≠j • Evidences are conditionally independent given a hypothesis • P(E1, E2,…, Ek|H) = P(E1|H)…P(Ek|H) • P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek, H)/P(E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek)

Maximum a posteriori (MAP) (最大后验概率假设)

Example: Play Tennis? Predict playing tennis when <sunny, cool, high, true> What probability should be used to make the prediction? How to compute the probability? H={+, -}

Probabilities of Individual Attributes • Given the training set, we can compute the probabilities − − + + P(sunny | +)=2/9

P(+ |sunny, cool, high, true) =P(sunny, cool, high, true | +)* P(+)/ P(sunny, cool, high, true) = P(sunny| +)* P(cool| +)* P(high| +)* P(true| +)* P(+)/ P(sunny, cool, high, true) =(2/9) * (3/9) * (3/9) * (3/9) * (9/14) / P(sunny, cool, high, true) P(sunny, cool, high, true) = P(sunny, cool, high, true | +)* P(+) + P(sunny, cool, high, true | − )* P(−) = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) / [(2/9) * (3/9) * (3/9) * (3/9) * (9/14) + (3/5) * (1/5) * (4/5)* (3/5) * (5/14) ] =0.435

思考题： 计算 P(− |sunny, cool, high, true)

Example 2 • Suppose we have a set of data on credit authorization with 10 training instances, each classified into one of 4 classes: • C1 =authorize • C2 =authorize after identification • C3 =do not authorize • C4 =do not authorize; call the police

Example 2 • Training data:

Example 2 • P(C1) = 6 / 10 = 0.6 • P(C2) = 2 / 10 = 0.2 • P(C3) = 1 / 10 = 0.1 • P(C4) = 1 / 10 = 0.1

Example 2 • P(x1 = 4 | C1) = P(x1 = 4 and C1) / P(C1) = 2/6 = 0.33 • P(x1 = 3 | C1) = 2/6 = 0.33 • P(x1 = 2 | C1) = 2/6 = 0.33 • P(x1 = 1 | C1) = 0/6 = 0 • Similarly, we have • P(x1 = 2 | C2) = 1/2 = 0.5 • P(x1 = 1 | C2) = 1/2 = 0.5 • P(x1 = 3 | C3) = 1/1 = 1 • P(x1 = 1 | C4) = 1/1 = 1 • All other probabilities P(x1 | Cj) = 0

Example 2 • Suppose now we want to classify a tuple t = {3, “Excellent”}. We have: P(t | C1) =  P(xik |C1) = P(x1 = 3 | C1) P(x2 = “Excellent” | C1) = 0.33 * 0.5 = 0.17 P(t | C2) = 0 * 0 = 0 P(t | C3) = 1 * 0 = 0 P(t | C4) = 0 * 0 = 0 • P(t) =  P(t |Cj) P(Cj) = 0.17 * 0.6 + 0 + 0 + 0 = 0.1

Example 2 • Then we can calculate P(Cj | t) for each class. • P(C1 | t) = P(t |C1) P(C1) / P(t) = 0.17 * 0.6 / 0.1 = 1 • P(C2 | t) =0 • P(C3 | t) =0 • P(C4 | t) =0 • Therefore, tuple t is classified as class C1 because it has the highest probability.

Example 2 • Suppose now we want to classify another tuple t = {2, “Good”}. We have: P(t | C1) =  P(xik |C1) = P(x1 = 2 | C1) P(x2 = “Good” | C1) = 0.33 * 0.5 = 0.17 P(t | C2) = 0.5 * 0.5 = 0.25 P(t | C3) = 0 * 0 = 0 P(t | C4) = 0 * 0 = 0 • P(t) =  P(t |Cj) P(Cj) = 0.17 * 0.6 + 0.25 * 0.2 + 0 + 0 = 0.15

Application: Spam Detection(垃圾邮件检测) • Spam • Dear sir, We want to transfer to overseas ($ 126,000.000.00 USD) One hundred and Twenty six million United States Dollars) from a Bank in Africa, I want to ask you to quietly look for a reliable and honest person who will be capable and fit to provide either an existing …… • Legitimate email • Ham: for lack of better name.

Example 3 • Hypotheses: {Spam, Ham} • Evidence: a document • The document is treated as a set (or bag) of words • Knowledge • P(Spam) • The prior probability of an e-mail message being a spam. • How to estimate this probability? • P(w|Spam) • the probability that a word is w if we know w is chosen from a spam. • How to estimate this probability?

Let V be the vocabulary of all words in the documents in D For each category ci  C Let Dibe the subset of documents in D in category ci P(ci) = |Di| / |D| Let Ti be the concatenation of all the documents in Di Let ni be the total number of word occurrences in Ti For each word wj  V Let nij be the number of occurrences of wj in Ti Let P(wi| ci) = (nij + 1) / (ni + |V|)

Given a test document X Let n be the number of word occurrences in X Return the category: where ai is the word occurring the ith position in X

Minimum Description Length (MDL)（最小描述长度） • Occam’s razor(奥坎坶剃刀): “prefer the shortest hypothesis”(选择最短(最简单)的假设) • MDL: prefer hypothesis that minimizes

Minimum Description Length C2和给定h时 D的最优编码 C1为h的最优编码

为解释这一点，先引入信息论中的一个基本结论：为解释这一点，先引入信息论中的一个基本结论： • 设想要为随机传送的消息设计一个编码，其中遇到消息i的概率是pi • 设计最简短的编码，即为了传输随机信息的编码所能得到的最小期望传送位数 • 为使期望的编码长度最小，必须为可能性较大的消息赋予较短的编码 • Shannon & Weaver（1949）证明最优编码（使得期望消息长度最短的编码）对消息i的编码长度为-log2pi位 • 例： “cbabcccd”的最优编码：

Summary 条件独立贝叶斯规则相继率朴素贝叶斯方法最大后验假设最小描述长度 Next lecture: 贝叶斯网

Bayesian networks (贝叶斯网) • Probabilistic networks(概率网) • Causal networks(因果网) • Belief networks(信度网) （不同的名称）

Probabilistic Belief • There are several possible worlds that areindistinguishable to an agent given some priorevidence. • The agent believes that a logic sentence B is True with probability p and False with probability 1-p. B is called abelief • In the frequency interpretation of probabilities, this means that the agent believes that the fraction of possible worlds that satisfy B is p • The distribution (p,1-p) is the strength of B

Bayesian Networks: Definition • Bayesian networks are directed acyclic graphs (DAGs). • Nodes in Bayesian networks represent random variables, which is normally assumed to take on discrete values. • The links of the network represent direct probabilistic influence（直接概率影响）. • The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.

Bayesian Network: Probabilities • The nodes and links are quantified with probability distributions. • The root nodes (those with no ancestors) are assignedprior probability distributions. • The other nodes are assigned with the conditional probability distribution of the node given its parents.

Lecture 3 Bayesian Reasoning 第 3 讲贝叶斯推理