Bayesian Methods in Statistical Pattern Recognition
1.31k likes | 1.34k Vues
Learn about Bayesian methods for classifying patterns based on probabilities, features, and minimizing errors. Understand the basics, estimation of probabilities, and decision rules for effective pattern recognition.
Bayesian Methods in Statistical Pattern Recognition
E N D
Presentation Transcript
Chapter 11Supervised Learning:STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan
Outline • Bayesian Methods • Basics of Bayesian Methods • Bayesian Classification – General Case • Classification that Minimizes Risk • Decision Regions and Probability of Errors • Discriminant Functions • Estimation of Probability Densities • Probabilistic Neural Network • Constraints in Classifier Design Cios / Pedrycz / Swiniarski / Kurgan
Outline • Regression • Data Models • Simple Linear Regression • Multiple Regression • General Least Squares and Multiple Regression • Assessing Quality of the Multiple Regression Model Cios / Pedrycz / Swiniarski / Kurgan
Bayesian Methods Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification. The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features. Cios / Pedrycz / Swiniarski / Kurgan
Basics of Bayesian Methods Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk. States of nature C = { “ an eagle ”, “ a hawk ” } Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” } We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”) and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N) Cios / Pedrycz / Swiniarski / Kurgan
Basics of Bayesian Methods • A priori (prior) probability P(ci): • Estimation of a prior P(ci): • P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object. Cios / Pedrycz / Swiniarski / Kurgan
Basics of Bayesian Methods The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears. • Natural and best decision: “Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ” • The probability of classification error: P(classification error) = P(c2) if we decide C = c1 P(c1) if we decide C = c2 Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Feature variable / feature x • It characterizes an object and allows for better discrimination between one class from another • We assume it to be a continuous random variable taking continuous values from a given range • The variability of a random variable x can be expressed in probabilistic terms • We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function): Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification Examples of probability densities Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Probability density function p(x|ci) • also called the likelihood of a class ciwith respect to the valuexof a feature variable • the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger • joint probability density function p(ci , x) • A probability density that an object is in a class ci and has a feature variable value x. • A posteriori (posterior) probability P(x|ci) • The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x. Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Bayes’ rule / Bayes’ theorem • From probability theory (see Appendix B) • An unconditional probability density function Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Bayes’ rule • “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).” Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Bayes’ decision rule • P(c2|x) if we decide C = c1 • P(classification error | x) = • P(c1|x) if we decide C = c2 • “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)” • Bayes’ classification rule guarantees minimization of the average probability of classification error Cios / Pedrycz / Swiniarski / Kurgan
Involving Object Features in Classification • Example • Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). • Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2. • Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is Cios / Pedrycz / Swiniarski / Kurgan
Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Real-valued features of an object as n-dimensional column vector x Rn: • The object may belong to l distinct classes (l distinct states of nature): Cios / Pedrycz / Swiniarski / Kurgan
Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes’ theorem • A priori probability: P(ci) (i = 1, 2…,l) • Class conditional probability density function : p(x|ci) • A posteriori (posterior) probability: P(ci|x) • Unconditional probability density function: Cios / Pedrycz / Swiniarski / Kurgan
Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes classification rule: • A given object with a given value x of a feature vector can be classified as belonging to class cj when: • Assign an object with a given value x of a feature vector to class cj when: Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • Basic Idea • To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature • A loss function • Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • A loss matrix • We denote a loss function by Lijmatrix for l-class classification problems • Expected (average) conditional loss • In short, Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • Overall Risk • The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision. • Bayes risk • Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error. Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • Bayes’ classification rule with Bayes risk • Choose a decision (a class) ci for which: Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • Bayesian Classification Minimizing the Probability of Error • Symmetrical zero-one conditional loss function • The conditional risk R(cj| x) criterion is the same as the average probability of classification error: • An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision Cios / Pedrycz / Swiniarski / Kurgan
Classification that Minimizes Risk • Generalization of the Maximum Likelihood Classification • Generalized likelihood ratio for classes ci and cj • Generalized threshold value • The maximum likelihood classification rule • “Decide a class cj if Cios / Pedrycz / Swiniarski / Kurgan
Decision Regions and Probability of Errors • Decision regions • A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl • The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci Cios / Pedrycz / Swiniarski / Kurgan
Decision Regions and Probability of Errors • Decision boundaries (decision surfaces) • The regions intersect, and boundaries between adjacent regions • “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion” Cios / Pedrycz / Swiniarski / Kurgan
Decision Regions and Probability of Errors • Decision boundaries Cios / Pedrycz / Swiniarski / Kurgan
Decision Regions and Probability of Errors • Optimal classification with decision regions • Average probability of correct classification • “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion” Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Discriminant functions: • Discriminant type classifier • It assigns an object with a given value x of a feature vector to a class cj if • Classification rule for a discriminant function-based classifier • Compute numerical values of all discriminant functions for x • Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest: • Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Discriminant classifier Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Discriminant type classifier for Bayesian classification • The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x): • Practical versions using Bayes’ theorem • Bayesian discriminant in a natural logarithmic form Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Characteristics of discriminant function • Discriminant functions define the decision boundaries that separate the decision regions • Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal • The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Bayesian Discriminant Functions for Two Classes • General case • Two discriminant functions: d1(x) and d2(x). • Two decision regions: R1andR2. • The decision boundary: d1(x) = d2(x). • Using dichotomizer • Single discriminant function: d(x) = d1(x) -d2(x). Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Assumption: • A multivariate normal Gaussian distribution of the feature vector x within each class • The Bayesian discriminant( in the previous section): Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Gaussian distribution of the probability density function • Quadratic Discriminant function • Decision boundaries: • hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.) Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci) • Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set • Compute values of the discriminant function for all classes • Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: • Assumption: equal covariances for all classes i= • The Quadratic Discriminant: • A linear form of discriminant functions: Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • The classification process using linear discriminants • Compute, for a given x, numerical values of discriminant functions for all classes: • Choose a class ci for which a value of the discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Example • Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution: Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Example • The estimates of the symmetric covariance matrices for both classes • The linear discriminant functions for both classes Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Example • Two-class two-feature pattern dichotomizer. Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Equal a priori probabilities for all classes P(ci) = P • Discriminant function Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier. • Linear version of the minimum Mahalanobis distance classifier Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of the Mahalanobis distances between x and means i for all classes. • Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum: Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Discriminant function • where Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Discriminants • Quadratic discriminant formula • Linear discriminant formula Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • “Neural network” style as a linear threshold machine • where • The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x). Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Equal a priori probabilities for all classes P(ci) = P • Discriminants • or Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j. • Linear version of the minimum distance classifier Cios / Pedrycz / Swiniarski / Kurgan
Discriminant Functions • Quadratic and Linear Discriminants • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of Euclidean distances between x and means i for all classes: • Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest: Cios / Pedrycz / Swiniarski / Kurgan