Create Presentation
Download Presentation

Download Presentation

Bayesian Decision Theory (Sections 2.1-2.2)

Download Presentation
## Bayesian Decision Theory (Sections 2.1-2.2)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Bayesian Decision Theory(Sections 2.1-2.2)**Decision problem posed in probabilistic terms Bayesian Decision Theory–Continuous Features All the relevant probability values are known**Probability Density**Jain CSE 802, Spring 2013**Course Outline**MODEL INFORMATION COMPLETE INCOMPLETE Bayes Decision Theory Supervised Learning Unsupervised Learning Parametric Approach Nonparametric Approach Parametric Approach Nonparametric Approach “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)**Introduction**• From sea bass vs. salmon example to “abstract” decision making problem • State of nature; a priori (prior) probability • State of nature (which type of fish will be observed next) is unpredictable, so it is a random variable • The catch of salmon and sea bass is equiprobable • P(1) = P(2) (uniform priors) • P(1) + P( 2) = 1 (exclusivity and exhaustivity) • Prior prob. reflects our prior knowledge about how likely we are to observe a sea bass or salmon; these probabilities may depend on time of the year or the fishing area!**Bayes decision rule with only the prior information**• Decide 1 if P(1) > P(2), otherwise decide 2 • Error rate = Min {P(1) , P(2)} • Suppose now we have a measurement or feature on the state of nature - say the fish lightness value • Use of the class-conditional probability density • P(x | 1) and P(x | 2) describe the difference in lightness feature between populations of sea bass and salmon**Amount of overlap between the densities determines the**“goodness” of feature**Maximum likelihood decision rule**• Assign input pattern x to class 1 if P(x | 1) > P(x | 2), otherwise 2 • How does the feature x influence our attitude (prior) concerning the true state of nature? • Bayes decision rule**Posteriori probability, likelihood, evidence**• P(j , x) = P(j | x)p (x) = p(x | j) P (j) • Bayes formula P(j | x) = {p(x | j) . P (j)} / p(x) where • Posterior = (Likelihood. Prior) / Evidence • Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1 • P(x | j) is called the likelihood of j with respect to x; the category j for which P(x | j) is large is more likely to be the true category**P(1 | x) is the probability of the state of nature being**1 given that feature value x has been observed • Decision based on the posterior probabilities is called the Optimal Bayes Decision rule For a given observation (feature value) X: if P(1 | x) > P(2 | x) decide 1 if P(1 | x) < P(2 | x) decide 2 To justify the above rule, calculate the probability of error: P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1**So, for a given x, we can minimize te rob. Of error, decide**1 if P(1 | x) > P(2 | x);otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] • Thus, for each observation x, Bayes decision rule minimizes the probability of error • Unconditional error: P(error) obtained by integration over all x w.r.t. p(x)**Optimal Bayes decision rule**Decide 1 if P(1 | x) > P(2 | x);otherwise decide 2 • Special cases: (i) P(1) = P(2); Decide 1 if p(x | 1) > p(x | 2), otherwise 2 (ii) p(x | 1) = p(x | 2); Decide 1 if P(1) > P(2), otherwise 2**Bayesian Decision Theory – Continuous Features**• Generalization of the preceding formulation • Use of more than one feature (d features) • Use of more than two states of nature (c classes) • Allowing other actions besides deciding on the state of nature • Introduce a loss function which is more general than the probability of error**Allowing actions other than classification primarily allows**the possibility of rejection • Refusing to make a decision when it is difficult to decide between two classes or in noisy cases! • The loss function specifies the cost of each action**Let {1, 2,…, c} be the set of c states of nature**(or “categories”) • Let {1, 2,…, a}be the set of a possible actions • Let (i | j)be the loss incurred for taking action i when the true state of nature is j • General decision rule (x) specifies which action to take for every possible observation x**For a given x, suppose we take the action i ; if the**true state is j , we will incur the loss (i | j). P(j | x) is the prob. that the true state is j But, any one of the C states is possible for given x. Conditional Risk Overall risk R = Expected value of R(i | x) w.r.t. p(x) Minimizing R Minimize R(i| x) for i = 1,…, a Conditional risk**Select the action i for which R(i | x) is minimum**The overall risk R is minimized and the resulting risk is called the Bayes risk; it is the best performance that can be achieved!**Two-category classification**1: deciding 1 2: deciding 2 ij = (i|j) loss incurred for deciding iwhen the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x)**Bayes decision rule is stated as:**if R(1 | x) < R(2 | x) Take action 1: “decide 1” This results in the equivalent rule: decide 1if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide2 otherwise**Likelihood ratio:**The preceding rule is equivalent to the following rule: then take action 1 (decide 1); otherwise take action 2 (decide 2) Note that the posteriori porbabilities are scaled by the loss differences.**Interpretation of the Bayes decision rule:**“If the likelihood ratio of class 1 and class 2 exceeds a threshold value (that is independent of the input pattern x), the optimal action is to decide 1” Maximum likelihood decision rule: the threshold value is 1; 0-1 loss function and equal class prior probability**Bayesian Decision Theory(Sections 2.3-2.5)**Minimum Error Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density**Minimum Error Rate Classification**• Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i j • Seek a decision rule that minimizes the probability of error or the error rate**Zero-one (0-1) loss function: no loss for correct decision**and a unit loss for any error The conditional risk can now be simplified as: “The risk corresponding to the 0-1 loss function is the average probability of error” **Minimizing the risk requires maximizing the posterior**probability P(i | x) since R(i | x) = 1 – P(i | x)) • For Minimum error rate • Decide i if P (i | x) > P(j | x) j i**Decision boundaries and decision regions**• If is the 0-1 loss function then the threshold involves only the priors:**Classifiers, Discriminant Functionsand Decision Surfaces**• Many different ways to represent pattern classifiers; one of the most useful is in terms of discriminant functions • The multi-category case • Set of discriminant functions gi(x), i = 1,…,c • Classifier assigns a feature vector x to class iif: gi(x) > gj(x) j i**Bayes classifier can be represented in this way, but the**choice of discriminant function is not unique • gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x) P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!)**Effect of any decision rule is to divide the feature space**into c decision regions if gi(x) > gj(x) j i then x is in Ri (Region Rimeans assign x to i) • The two-category case • Here a classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x) g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2**So, a “dichotomizer” computes a single discriminant**function g(x) and classifies x according to whether g(x) is positive or not. • Computation of g(x) = g1(x) – g2(x)**The Normal Density**• Univariate density: N( , 2) • Normal density is analytically tractable • Continuous density • A number of processes are asymptotically Gaussian • Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted versions of a single typical or prototype (Central Limit theorem) where: = mean (or expected value) of x 2 = variance (or expected squared deviation) of x**Multivariate density: N( , )**• Multivariate normal density in d dimensions: where: x = (x1, x2, …, xd)t(t stands for the transpose of a vector) = (1, 2, …, d)t mean vector = d*d covariance matrix || and -1 are determinant and inverse of ,respectively • The covariance matrix is always symmetric and positive semidefinite; we assume is positive definite so the determinant of is strictly positive • Multivariate normal density is completely specified by [d + d(d+1)/2] parameters • If variables x1 and x2 are statistically independent then the covariance of x1 and x2 is zero.**Multivariate Normal density**Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of **Transformation of Normal Variables**Linear combinations of jointly normally distributed random variables are normally distributed Coordinate transformation can convert an arbitrary multivariate normal distribution into a spherical one**Bayesian Decision Theory (Sections 2-6 to 2-9)**Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features**Discriminant Functions for the Normal Density**• The minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • In case of multivariate normal densities**Case i = 2.I(I is the identity matrix)**Features are statistically independent and each feature has the same variance**A classifier that uses linear discriminant functions is**called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations: gi(x) = gj(x)**The hyperplane separatingRiand Rj**is orthogonal to the line linking the means!**Case 2: i = (covariance matrices of all classes are**identical but otherwise arbitrary!) • Hyperplane separating Ri and Rj • The hyperplane separating Ri and Rj is generally not orthogonal to the line between the means! • To classify a feature vector x, measure the squared Mahalanobis distance from x to each of the c means; assign x to the category of the nearest mean