580 likes | 585 Vues
Midterm Review. CS479/679 Pattern Recognition Spring 2019 – Dr. George Bebis. Reminders. Graduate students need to select a paper for presentation (about 15 minutes presentation) Send me your top three choices by Thursday, March 28th
E N D
Midterm Review CS479/679 Pattern RecognitionSpring 2019 – Dr. George Bebis
Reminders • Graduate students need to select a paper for presentation (about 15 minutes presentation) • Send me your top three choices by Thursday, March 28th • Presentations will be scheduled on April 30th and May 2nd • Check posted guidelines in preparing your presentation • Guest Lectures • Dr. Tin Nguyen, March 28th (Bioinformatics) • Dr. Emily Hand, April 4th (Face Recognition) • Colloquium • Dr. George Vasmatzis, Mayo Clinic, April 19th
Midterm Material • Intro to Pattern Recognition • Math review (probabilities, linear algebra) • Bayesian Decision Theory • Bayesian Networks • Parameter Estimation (ML and Bayesian) • Dimensionality Reduction • Feature Selection Case studies are also included in the midterm
Intro to Pattern Recognition (PR) • Definitions • Pattern, Class, Class model • Classification vs Clustering • PR applications • What are the main classification approaches? • Generative • Model p(x, ω); estimate P(ω/x) • Discriminative • Estimate P(ω/x) directly x: features ω: class
Some Important Issues • Feature Extraction • Model Selection (i.e., simple vs complex) • Generalization
Features Extraction • Which features? • Discriminative • How many? • Curse of dimensionality • Dimensionality reduction • Feature selection • Missing features • Marginalization (i.e., compute P(ωi/xg))
Simple vs Complex Models • Complex models are tuned to the particular training samples, rather than on the characteristics of the true model (overfitting or memorization).
Generalization • The ability of a classifier to produce correct results on novel patterns. • How can we improve generalization performance ? • More training examples (i.e., lead to better model estimates). • Simpler models usually yield better performance.
Probabilities • Prior and conditional probabilities • Law of total probability • Bayes rule • Random variables • pdf/pmf and PDF • Independence • Marginalization • Multivariate Gaussian • Covariance matrix decomposition • Whitening transformation
Linear Algebra • Vector dot product • Orthogonal/Orthonormal vectors • Linear dependence/independence • Space spanning • Vector basis • Matrices (diagonal, symmetric, transpose, inverse, trace, rank) • Eigenvalues/Eigenvectors • Matrix diagonalization/decomposition
Decision Rule Using Bayes Rule Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2 • The Bayes rule isoptimum (i.e., it minimizes the average probability error): where P(error/x) = min[P(ω1/x), P(ω2/x)]
Decision Rule Using Conditional Risk • Suppose λ(αi / ωj) is the loss (or cost) incurred for taking action αi when the classification category is ωj • The expected loss (or conditional risk) with taking action αi :
Decision Rule Using Conditional Risk • Bayes decision rule minimizes overall risk R by: • Computing R(αi /x) for every αi given an x • Choosing the action αi with the minimum R(αi /x)
Zero-One Loss Function • Assign the same loss to all errors: • The conditional risk corresponding to this loss function: Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2
Discriminant Functions • Assign a feature vector x to class ωi if gi(x) > gj(x)for all Examples:
Discriminant Function for Multivariate Gaussian • Σi=σ2I linear discriminant: Need to know how to derive the decision boundary decision boundary: ) ) Special case: equal priors Minimum distance classifier
Discriminant Function for Multivariate Gaussian (cont’d) • Σi= Σ linear discriminant: Need to know how to derive the decision boundary decision boundary: Special case: equal priors Mahalanobis distance classifier
Discriminant Function for Multivariate Gaussian (cont’d) • Σi= arbitrary Need to know how to derive the decision boundary quadratic discriminant: hyperquadrics decision boundary:
Bayesian Nets • How is it defined? • Directed acyclic graph (DAG) • Each node represents one of the system variables. • Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous). • A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X).
Bayesian Nets (cont’d) • Why are they useful? • Allow us to decompose high dimensional probability density functions into lower dimension probability density functions P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)
Bayesian Nets (cont’d) • What is the Markov property? • “Each node is conditionally independent of its ancestors given its parents” • Why is the Markov property important?
Computing Joint Probabilities • We can compute the probability of any configuration of variables in the joint density distribution: e.g., P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)= 0.25 x 0.6 x 0.4 x 0.5 x 0.4 = 0.012
Inference Example Compute probabilities assuming missing information
Bayesian Nets (cont’d) • Given a problem, you should know how to: • Design the structure of the Bayesian Network (i.e., identify variables and their dependences) • Compute various probabilities (i.e., inference)
Parameter Estimation • What is the goal of parameter estimation? • Estimate the parameters of the class probability models. • What are the main parameter estimation methods we discussed in class? • Maximum Likelihood (ML) • Bayesian Estimation (BE)
Parameter Estimation (cont’d) • Compare ML with BE: • ML assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing • BE assumes that the parameters q are random variables that have some known a-priori distribution p(q). • Estimates a distribution rather than making point estimates like ML p(D/ q) Note: estimated distribution might not be of the assumed form
ML Estimation • Using independence assumption: • Using log-likelihood: • Find by maximizingln p(D/ θ):
Maximum A-Poteriori Estimator (MAP) • Assuming known p(θ), MAP maximizes p(D/θ)p(θ) = • Find by maximizing ln p(D/ θ)p(θ): • When is MAP is equivalent to ML? • when p(θ) is uniform
θ=μ Multivariate Gaussian Density ML estimate: MAP estimate:
θ=(μ,Σ) Multivariate Gaussian Density ML estimate:
BE Estimation Step 1: Compute p(θ/D) : Step 2:Compute p(x/D) :
Interpretation of BE Solution • The BE solution implies that if we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: • Samples D exert their influence on p(x / D) through p(θ / D).
Incremental Learning • p(θ/D) can be computed recursively: n=1,2,…
Relation to ML solution • Ifp(D/θ) peaks sharply at (i.e., ML solution) then p(θ /D) will, in general, peak sharply at too (assuming p(θ) is broad and smooth) • Therefore, ML is a special case of BE! p(θ /D) p(θ /D)
Univariate Gaussian θ=μ x const. as (ML estimate)
Multivariate Gaussian θ=μ BE solution converges to ML solution as
Computational Complexity dimensionality: d # training data: n # classes: c ML approach BE approach Higher learning complexity Same classification complexity
Main Sources of Error in Classifier Design • Bayes error • The error due to overlapping densities p(x/ωi) • Model error • The error due to choosing an incorrect model. • Estimation error • The error due to incorrectly estimated parameters.
Dimensionality Reduction • What is the goal of dimensionality reduction and why is it useful? • Reduce the dimensionality of the data by eliminating redundant and irrelevant features • Less training samples, faster classification • How is dimensionality reduction performed? • Map the data to a sub-space of lower-dimensionality through a linear (or non-linear) transformation y = UTx x ϵ RN, U is NxK, and y ϵ RK • Alternatively, select a subset of features.
PCA and LDA • What is the main difference between PCA and LDA? • PCA seeks a projection that preserves as much information in the data as possible. • LDA seeks a projection that best separates the data.
PCA • What is the PCA solution? • “Largest” eigenvectors (i.e., corresponding to the largest eigenvalues - principal components) of the covariance matrix of the training data. • You need to know the steps of PCA, its geometric interpretation, and how to choose the number of principal components.
Face Recognition Using PCA • You need to know how to apply PCA for face recognition and face detection. • What practical issue arises when applying PCA for face recognition? How do we deal with it? • The covariance matrix AAT is typically very large (i.e., N2xN2 for NxN images) • Consider the alternative matrix ATA which is only MxM (M is the number of training face images)