ECES 690 – Statistical Pattern Recognition

ECES 690 – Statistical Pattern Recognition Lecture 2-Linear Classifiers 1/14-16/2013

Recap • Pattern recognition is about classifying data • The classes may be known a priori – supervised, or they may be unknown – unsupervised • As you plan for your term project, it should probably contain some aspect related to classification.

Nonparametric Estimation In words : Place a segment of length h at and count points inside it. • If is continuous: as , if

Example – histograms! • MATLAB • x=normrnd(0,1,N); • hist(x,k) • Histogram – count of frequencies • PDF – histogram relationship • nHist=hist(x,100,N) • nPDF=nHist./(sum(nHist))

Parzen Windows • Place at a hypercube of length and count points inside.

Define • That is, it is 1 inside a unit side hypercube centered at 0 • The problem: • Parzen windows-kernels-potential functions

Mean value Hence unbiased in the limit The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated

h=0.1, N=1000 h=0.8, N=1000 • Variance • The smaller the h the higher the variance

h=0.1, N=10000 • The higher the N the better the accuracy

If asymptotically unbiased • The method • Remember:

CURSE OF DIMENSIONALITY • In all the methods, so far, we saw that the highest the number of points,N, the better the resulting estimate. • If in the one-dimensional space an interval, filled withNpoints, is adequate (for good estimation), in the two-dimensional space the corresponding square will require N2and in the ℓ-dimensional space the ℓ-dimensional cube will require Nℓ points. • The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

An Example :

NAIVE – BAYES CLASSIFIER • Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points. • Assume x1, x2 ,…,xℓmutually independent. Then: • In this case, one would require, roughly, Npoints for each pdf. Thus, a number of points of the order N·ℓ would suffice. • It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

K Nearest Neighbor Density Estimation • In Parzen: • The volume is constant • The number of points in the volume is varying • Now: • Keep the number of pointsconstant • Leave the volume to be varying

The Nearest Neighbor Rule • Choose k out of the N training vectors, identify the k nearest ones to x • Out of these k identify ki that belong to class ωi • The simplest version k=1 !!! • For large N this is not bad. It can be shown that: if PB is the optimal Bayesian error probability, then:

For small PB: • An example:

Voronoi tesselation

BAYESIAN NETWORKS • Bayes Probability Chain Rule • Assume now that the conditional dependence for each xiis limited to a subset of the features appearing in each of the product terms. That is: where

For example, if ℓ=6, then we could assume: Then: • The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is: Ai= Ø, for i=1, 2, …, ℓ

A graphical way to portray conditional dependencies is given below • According to this figure we have that: • x6 is conditionally dependent on x4, x5 • x5onx4 • x4onx1, x2 • x3onx2 • x1, x2are conditionallyindependent on other variables. • For this case:

Bayesian Networks • Definition: A Bayesian Network is a directed acyclicgraph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xiis the variable associated with the node and Aiis the set of its parents in the graph. • A Bayesian Network is specified by: • The marginal probabilities of its root nodes. • The conditional probabilities of the non-root nodes, given their parents, for ALL possible values of the involved variables.

The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field. • This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.

Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities. • Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology. • Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.

Example: Consider the Bayesian network of the figure: a) If x is measured to be x=1(x1), compute P(w=0|x=1) [P(w0|x1)]. b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].

For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63. • For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4. • In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG. • Complexity: • For singly connected graphs, message passing algorithms amount to a complexity linear in the number of nodes.

Bayesian networks and functional graphical models Pearl, J., Causality: Models, Reasoning, and Inference 2000: Cambridge University Press.

Intermission • Chapter 2 • Bayes classification • Estimating the distribution • Isolating dependencies • Chapter 3 • Perceptrons • Linear Classifiers • Neural networks (1) • Support Vector Machines (linear?)

LINEAR CLASSIFIERS • The Problem:Consider a two class task with ω1, ω2

Hence:

The Perceptron Algorithm • Assume linearly separable classes, i.e., • The casefalls under the above formulation, since

Our goal: Compute a solution, i.e., a hyperplane w,so that • The steps • Define a cost function to be minimized. • Choose an algorithm to minimize the cost function. • The minimum corresponds to a solution.

The Cost Function • Where Y isthe subset of the vectors wrongly classified by w. When Y=O (empty set) a solution is achieved and

J(w) is piecewise linear (WHY?) • The Algorithm • The philosophy of the gradient descent is adopted.

Wherever valid This is the celebrated Perceptron Algorithm. w

An example: • The perceptron algorithm converges in a finite number of iteration steps to a solution if

A useful variant of the perceptron algorithm • It is a reward and punishment type of algorithm.

The perceptron • The network is called perceptron or neuron. • It is a learning machine that learns from the training vectors via the perceptron algorithm.

Example:At some stage t the perceptron algorithm results in The corresponding hyperplane is ρ=0.7

Least Squares Methods • If classes are linearly separable, the perceptron output results in • If classes are NOT linearly separable, we shall compute the weights, , so that the difference between • The actual output of the classifier, , and • The desired outputs, e.g., to be SMALL.

SMALL, in the mean square error sense, means to choose so that the cost function:

Minimizing where Rx is the autocorrelation matrix and the crosscorrelationvector.

Multi-class generalization • The goal is to compute M linear discriminant functions: according to the MSE. • Adopt as desired responsesyi: • Let • And the matrix

The goal is to compute : • The above is equivalent to a number Mof MSE minimization problems. That is: Design each so that its desired output is 1 for and 0 for any other class. • Remark:The MSE criterion belongs to a more general class of cost function with the following important property: • The value of is an estimate, in the MSE sense, of the a-posteriori probability , provided that the desired responses used during training are and 0 otherwise.

Mean square error regression: Let , be jointly distributed random vectors with a joint pdf • The goal: Given the value of , estimate the value of . In the pattern recognition framework, given one wants to estimate the respective label . • The MSE estimate of , given , is defined as: • It turns out that: The above is known as the regression of given and it is, in general, a non-linear function of . If is Gaussian the MSE regressor is linear.

SMALL in the sum of error squares sense means that is, the input xi and its corresponding class label (±1). : training pairs

Pseudoinverse Matrix • Define

Thus • Assume N=lX square and invertible. Then Pseudoinverse of X

Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously: • The “solution” corresponds to the minimum sum of squares solution.

Example:

ECES 690 – Statistical Pattern Recognition

ECES 690 – Statistical Pattern Recognition

Presentation Transcript

Response of the Innate Immune System to Pathogens: Pattern Recognition Receptors

Speech Recognition

Revenue Recognition

Pattern Matching

Sector Search Pattern

Pattern Recognition

From Pattern Formation to Phase Field Crystal Model

Face Recognition 2.1.2013

Hierarchical Neural Networks for Object Recognition and Scene “Understanding”

Local Invariant Feature Descriptors

Abdominal X-ray Radiological Signs

Statistical Relational Learning

Get Your Plate in Shape

Design Patterns

Design and Implementation of Speech Recognition Systems

Feature Extraction for speech applications

Nursing Health Assessments

Statistical Studies: Statistical Investigations

Detection of Fraud

Part 1: Object recognition Part 2: Computational modelling

Introduction to Face Recognition and Detection

Neural Networks