1 / 58

Midterm Review

Midterm Review. CS479/679 Pattern Recognition Spring 2019 – Dr. George Bebis. Reminders. Graduate students need to select a paper for presentation (about 15 minutes presentation) Send me your top three choices by Thursday, March 28th

walterreed
Télécharger la présentation

Midterm Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Midterm Review CS479/679 Pattern RecognitionSpring 2019 – Dr. George Bebis

  2. Reminders • Graduate students need to select a paper for presentation (about 15 minutes presentation) • Send me your top three choices by Thursday, March 28th • Presentations will be scheduled on April 30th and May 2nd • Check posted guidelines in preparing your presentation • Guest Lectures • Dr. Tin Nguyen, March 28th (Bioinformatics) • Dr. Emily Hand, April 4th (Face Recognition) • Colloquium • Dr. George Vasmatzis, Mayo Clinic, April 19th

  3. Midterm Material • Intro to Pattern Recognition • Math review (probabilities, linear algebra) • Bayesian Decision Theory • Bayesian Networks • Parameter Estimation (ML and Bayesian) • Dimensionality Reduction • Feature Selection Case studies are also included in the midterm

  4. Intro to Pattern Recognition (PR)

  5. Intro to Pattern Recognition (PR) • Definitions • Pattern, Class, Class model • Classification vs Clustering • PR applications • What are the main classification approaches? • Generative • Model p(x, ω); estimate P(ω/x) • Discriminative • Estimate P(ω/x) directly x: features ω: class

  6. PR phases: Training & Testing

  7. Some Important Issues • Feature Extraction • Model Selection (i.e., simple vs complex) • Generalization

  8. Features Extraction • Which features? • Discriminative • How many? • Curse of dimensionality • Dimensionality reduction • Feature selection • Missing features • Marginalization (i.e., compute P(ωi/xg))

  9. Simple vs Complex Models • Complex models are tuned to the particular training samples, rather than on the characteristics of the true model (overfitting or memorization).

  10. Generalization • The ability of a classifier to produce correct results on novel patterns. • How can we improve generalization performance ? • More training examples (i.e., lead to better model estimates). • Simpler models usually yield better performance.

  11. Math Review

  12. Probabilities • Prior and conditional probabilities • Law of total probability • Bayes rule • Random variables • pdf/pmf and PDF • Independence • Marginalization • Multivariate Gaussian • Covariance matrix decomposition • Whitening transformation

  13. Linear Algebra • Vector dot product • Orthogonal/Orthonormal vectors • Linear dependence/independence • Space spanning • Vector basis • Matrices (diagonal, symmetric, transpose, inverse, trace, rank) • Eigenvalues/Eigenvectors • Matrix diagonalization/decomposition

  14. Bayesian Decision Theory

  15. Decision Rule Using Bayes Rule Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2 • The Bayes rule isoptimum (i.e., it minimizes the average probability error): where P(error/x) = min[P(ω1/x), P(ω2/x)]

  16. Decision Rule Using Conditional Risk • Suppose λ(αi / ωj) is the loss (or cost) incurred for taking action αi when the classification category is ωj • The expected loss (or conditional risk) with taking action αi :

  17. Decision Rule Using Conditional Risk • Bayes decision rule minimizes overall risk R by: • Computing R(αi /x) for every αi given an x • Choosing the action αi with the minimum R(αi /x)

  18. Zero-One Loss Function • Assign the same loss to all errors: • The conditional risk corresponding to this loss function: Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2

  19. Discriminant Functions • Assign a feature vector x to class ωi if gi(x) > gj(x)for all Examples:

  20. Discriminant Function for Multivariate Gaussian • Σi=σ2I linear discriminant: Need to know how to derive the decision boundary decision boundary: ) ) Special case: equal priors Minimum distance classifier

  21. Discriminant Function for Multivariate Gaussian (cont’d) • Σi= Σ linear discriminant: Need to know how to derive the decision boundary decision boundary: Special case: equal priors Mahalanobis distance classifier

  22. Discriminant Function for Multivariate Gaussian (cont’d) • Σi= arbitrary Need to know how to derive the decision boundary quadratic discriminant: hyperquadrics decision boundary:

  23. Receiver Operating Characteristic (ROC) Curve

  24. Bayesian Belief Networks

  25. Bayesian Nets • How is it defined? • Directed acyclic graph (DAG) • Each node represents one of the system variables. • Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous). • A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X).

  26. Bayesian Nets (cont’d) • Why are they useful? • Allow us to decompose high dimensional probability density functions into lower dimension probability density functions P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)

  27. Bayesian Nets (cont’d) • What is the Markov property? • “Each node is conditionally independent of its ancestors given its parents” • Why is the Markov property important?

  28. Computing Joint Probabilities • We can compute the probability of any configuration of variables in the joint density distribution: e.g., P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)= 0.25 x 0.6 x 0.4 x 0.5 x 0.4 = 0.012

  29. Inference Example Compute probabilities assuming missing information

  30. Bayesian Nets (cont’d) • Given a problem, you should know how to: • Design the structure of the Bayesian Network (i.e., identify variables and their dependences) • Compute various probabilities (i.e., inference)

  31. Parameter Estimation

  32. Parameter Estimation • What is the goal of parameter estimation? • Estimate the parameters of the class probability models. • What are the main parameter estimation methods we discussed in class? • Maximum Likelihood (ML) • Bayesian Estimation (BE)

  33. Parameter Estimation (cont’d) • Compare ML with BE: • ML assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing • BE assumes that the parameters q are random variables that have some known a-priori distribution p(q). • Estimates a distribution rather than making point estimates like ML p(D/ q) Note: estimated distribution might not be of the assumed form

  34. ML Estimation • Using independence assumption: • Using log-likelihood: • Find by maximizingln p(D/ θ):

  35. Maximum A-Poteriori Estimator (MAP) • Assuming known p(θ), MAP maximizes p(D/θ)p(θ) = • Find by maximizing ln p(D/ θ)p(θ): • When is MAP is equivalent to ML? • when p(θ) is uniform

  36. θ=μ Multivariate Gaussian Density ML estimate: MAP estimate:

  37. θ=(μ,Σ) Multivariate Gaussian Density ML estimate:

  38. BE Estimation Step 1: Compute p(θ/D) : Step 2:Compute p(x/D) :

  39. Interpretation of BE Solution • The BE solution implies that if we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: • Samples D exert their influence on p(x / D) through p(θ / D).

  40. Incremental Learning • p(θ/D) can be computed recursively: n=1,2,…

  41. Relation to ML solution • Ifp(D/θ) peaks sharply at (i.e., ML solution) then p(θ /D) will, in general, peak sharply at too (assuming p(θ) is broad and smooth) • Therefore, ML is a special case of BE! p(θ /D) p(θ /D)

  42. Univariate Gaussian θ=μ x const. as (ML estimate)

  43. Multivariate Gaussian θ=μ BE solution converges to ML solution as

  44. Computational Complexity dimensionality: d # training data: n # classes: c ML approach BE approach Higher learning complexity Same classification complexity

  45. Main Sources of Error in Classifier Design • Bayes error • The error due to overlapping densities p(x/ωi) • Model error • The error due to choosing an incorrect model. • Estimation error • The error due to incorrectly estimated parameters.

  46. Dimensionality Reduction

  47. Dimensionality Reduction • What is the goal of dimensionality reduction and why is it useful? • Reduce the dimensionality of the data by eliminating redundant and irrelevant features • Less training samples, faster classification • How is dimensionality reduction performed? • Map the data to a sub-space of lower-dimensionality through a linear (or non-linear) transformation y = UTx x ϵ RN, U is NxK, and y ϵ RK • Alternatively, select a subset of features.

  48. PCA and LDA • What is the main difference between PCA and LDA? • PCA seeks a projection that preserves as much information in the data as possible. • LDA seeks a projection that best separates the data.

  49. PCA • What is the PCA solution? • “Largest” eigenvectors (i.e., corresponding to the largest eigenvalues - principal components) of the covariance matrix of the training data. • You need to know the steps of PCA, its geometric interpretation, and how to choose the number of principal components.

  50. Face Recognition Using PCA • You need to know how to apply PCA for face recognition and face detection. • What practical issue arises when applying PCA for face recognition? How do we deal with it? • The covariance matrix AAT is typically very large (i.e., N2xN2 for NxN images) • Consider the alternative matrix ATA which is only MxM (M is the number of training face images)

More Related