Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory(Sections 2.1-2.2) Decision problem posed in probabilistic terms Bayesian Decision Theory–Continuous Features All the relevant probability values are known

Probability Density Jain CSE 802, Spring 2013

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE Bayes Decision Theory Supervised Learning Unsupervised Learning Parametric Approach Nonparametric Approach Parametric Approach Nonparametric Approach “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)

Introduction • From sea bass vs. salmon example to “abstract” decision making problem • State of nature; a priori (prior) probability • State of nature (which type of fish will be observed next) is unpredictable, so it is a random variable • The catch of salmon and sea bass is equiprobable • P(1) = P(2) (uniform priors) • P(1) + P( 2) = 1 (exclusivity and exhaustivity) • Prior prob. reflects our prior knowledge about how likely we are to observe a sea bass or salmon; these probabilities may depend on time of the year or the fishing area!

Bayes decision rule with only the prior information • Decide 1 if P(1) > P(2), otherwise decide 2 • Error rate = Min {P(1) , P(2)} • Suppose now we have a measurement or feature on the state of nature - say the fish lightness value • Use of the class-conditional probability density • P(x | 1) and P(x | 2) describe the difference in lightness feature between populations of sea bass and salmon

Amount of overlap between the densities determines the “goodness” of feature

Maximum likelihood decision rule • Assign input pattern x to class 1 if P(x | 1) > P(x | 2), otherwise 2 • How does the feature x influence our attitude (prior) concerning the true state of nature? • Bayes decision rule

Posteriori probability, likelihood, evidence • P(j , x) = P(j | x)p (x) = p(x | j) P (j) • Bayes formula P(j | x) = {p(x | j) . P (j)} / p(x) where • Posterior = (Likelihood. Prior) / Evidence • Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1 • P(x | j) is called the likelihood of j with respect to x; the category j for which P(x | j) is large is more likely to be the true category

P(1 | x) is the probability of the state of nature being 1 given that feature value x has been observed • Decision based on the posterior probabilities is called the Optimal Bayes Decision rule For a given observation (feature value) X: if P(1 | x) > P(2 | x) decide 1 if P(1 | x) < P(2 | x) decide 2 To justify the above rule, calculate the probability of error: P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1

So, for a given x, we can minimize te rob. Of error, decide 1 if P(1 | x) > P(2 | x);otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] • Thus, for each observation x, Bayes decision rule minimizes the probability of error • Unconditional error: P(error) obtained by integration over all x w.r.t. p(x)

Optimal Bayes decision rule Decide 1 if P(1 | x) > P(2 | x);otherwise decide 2 • Special cases: (i) P(1) = P(2); Decide 1 if p(x | 1) > p(x | 2), otherwise 2 (ii) p(x | 1) = p(x | 2); Decide 1 if P(1) > P(2), otherwise 2

Bayesian Decision Theory – Continuous Features • Generalization of the preceding formulation • Use of more than one feature (d features) • Use of more than two states of nature (c classes) • Allowing other actions besides deciding on the state of nature • Introduce a loss function which is more general than the probability of error

Allowing actions other than classification primarily allows the possibility of rejection • Refusing to make a decision when it is difficult to decide between two classes or in noisy cases! • The loss function specifies the cost of each action

Let {1, 2,…, c} be the set of c states of nature (or “categories”) • Let {1, 2,…, a}be the set of a possible actions • Let (i | j)be the loss incurred for taking action i when the true state of nature is j • General decision rule (x) specifies which action to take for every possible observation x

For a given x, suppose we take the action i ; if the true state is j , we will incur the loss (i | j). P(j | x) is the prob. that the true state is j But, any one of the C states is possible for given x. Conditional Risk Overall risk R = Expected value of R(i | x) w.r.t. p(x) Minimizing R Minimize R(i| x) for i = 1,…, a Conditional risk

Select the action i for which R(i | x) is minimum The overall risk R is minimized and the resulting risk is called the Bayes risk; it is the best performance that can be achieved!

Two-category classification 1: deciding 1 2: deciding 2 ij = (i|j) loss incurred for deciding iwhen the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x)

Bayes decision rule is stated as: if R(1 | x) < R(2 | x) Take action 1: “decide 1” This results in the equivalent rule: decide 1if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide2 otherwise

Likelihood ratio: The preceding rule is equivalent to the following rule: then take action 1 (decide 1); otherwise take action 2 (decide 2) Note that the posteriori porbabilities are scaled by the loss differences.

Interpretation of the Bayes decision rule: “If the likelihood ratio of class 1 and class 2 exceeds a threshold value (that is independent of the input pattern x), the optimal action is to decide 1” Maximum likelihood decision rule: the threshold value is 1; 0-1 loss function and equal class prior probability

Bayesian Decision Theory(Sections 2.3-2.5) Minimum Error Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density

Minimum Error Rate Classification • Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i  j • Seek a decision rule that minimizes the probability of error or the error rate

Zero-one (0-1) loss function: no loss for correct decision and a unit loss for any error The conditional risk can now be simplified as: “The risk corresponding to the 0-1 loss function is the average probability of error” 

Minimizing the risk requires maximizing the posterior probability P(i | x) since R(i | x) = 1 – P(i | x)) • For Minimum error rate • Decide i if P (i | x) > P(j | x) j  i

Decision boundaries and decision regions • If  is the 0-1 loss function then the threshold involves only the priors:

Classifiers, Discriminant Functionsand Decision Surfaces • Many different ways to represent pattern classifiers; one of the most useful is in terms of discriminant functions • The multi-category case • Set of discriminant functions gi(x), i = 1,…,c • Classifier assigns a feature vector x to class iif: gi(x) > gj(x) j  i

Network Representation of a Classifier

Bayes classifier can be represented in this way, but the choice of discriminant function is not unique • gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!)

Effect of any decision rule is to divide the feature space into c decision regions if gi(x) > gj(x) j  i then x is in Ri (Region Rimeans assign x to i) • The two-category case • Here a classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2

So, a “dichotomizer” computes a single discriminant function g(x) and classifies x according to whether g(x) is positive or not. • Computation of g(x) = g1(x) – g2(x)

The Normal Density • Univariate density: N( , 2) • Normal density is analytically tractable • Continuous density • A number of processes are asymptotically Gaussian • Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted versions of a single typical or prototype (Central Limit theorem) where:  = mean (or expected value) of x 2 = variance (or expected squared deviation) of x

Multivariate density: N( , ) • Multivariate normal density in d dimensions: where: x = (x1, x2, …, xd)t(t stands for the transpose of a vector)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse of ,respectively • The covariance matrix is always symmetric and positive semidefinite; we assume  is positive definite so the determinant of  is strictly positive • Multivariate normal density is completely specified by [d + d(d+1)/2] parameters • If variables x1 and x2 are statistically independent then the covariance of x1 and x2 is zero.

Multivariate Normal density Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of 

Transformation of Normal Variables Linear combinations of jointly normally distributed random variables are normally distributed Coordinate transformation can convert an arbitrary multivariate normal distribution into a spherical one

Bayesian Decision Theory (Sections 2-6 to 2-9) Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features

Discriminant Functions for the Normal Density • The minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • In case of multivariate normal densities

Case i = 2.I(I is the identity matrix) Features are statistically independent and each feature has the same variance

A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations: gi(x) = gj(x)

The hyperplane separatingRiand Rj is orthogonal to the line linking the means!

Case 2: i =  (covariance matrices of all classes are identical but otherwise arbitrary!) • Hyperplane separating Ri and Rj • The hyperplane separating Ri and Rj is generally not orthogonal to the line between the means! • To classify a feature vector x, measure the squared Mahalanobis distance from x to each of the c means; assign x to the category of the nearest mean

Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory (Sections 2.1-2.2)

Presentation Transcript

Bayesian Decision Theory

Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2)

SECTIONS 2.1-2.2

Intro to Pattern Recognition : Bayesian Decision Theory

Read Sections 2.1, 2.2, and 2.3 in the textbook before viewing the slide show.

Describing Data: One Variable

Sections 2.1-2.2

Read Sections 2.1, 2.2, and 2.3 in the textbook before viewing the slide show.

Describing Data: One Variable

Sections 2.1 and 2.2

Describing Data: One Variable

Sections 2.1, 2.2, 2.3, 2.4, 2.5

Sections 2.1, 2.2

Sections 2.1, 2.2, 2.3, 2.4, 2.5

Bayesian Decision Theory

Sections 2.1, 2.2, 2.3, 2.4, 2.5

Lecture 6 Sections 2.1 – 2.2

Bayesian Decision Theory

Bayesian Decision Theory