1 / 66

A Survey on Distance Metric Learning (Part 1)

A Survey on Distance Metric Learning (Part 1). Gerry Tesauro IBM T.J.Watson Research Center. Acknowledgement. Lecture material shamelessly adapted/stolen from the following sources: Kilian Weinberger: “Survey on Distance Metric Learning” slides IBM summer intern talk slides (Aug. 2006)

darshan
Télécharger la présentation

A Survey on Distance Metric Learning (Part 1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center

  2. Acknowledgement • Lecture material shamelessly adapted/stolen from the following sources: • Kilian Weinberger: • “Survey on Distance Metric Learning” slides • IBM summer intern talk slides (Aug. 2006) • Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) • Yann LeCun talk slides (NIPS 2006 workshop on “Learning to Compare Examples”)

  3. Outline Part 1 • Motivation and Basic Concepts • ML tasks where it’s useful to learn dist. metric • Overview of Dimensionality Reduction • Mahalanobis Metric Learning for Clustering with Side Info (Xing et al.) • Pseudo-metric online learning (Shalev-Shwartz et al.) • Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis) • Metric Learning for Kernel Regression (Weinberger & Tesauro) • Metric learning for RL basis function construction (Keller et al.) • Similarity learning for image processing (LeCun et al.) Part 2

  4. Motivation • Many ML algorithms and tasks require a distance metric (equivalently, “dissimilarity” metric) • Clustering (e.g. k-means) • Classification & regression: • Kernel methods • Nearest neighbor methods • Document/text retrieval • Find most similar fingerprints in DB to given sample • Find most similar web pages to document/keywords • Nonlinear dimensionality reduction methods: • Isomap, Maximum Variance Unfolding, Laplacian Eigenmaps, etc.

  5. Motivation (2) • Many problems may lack a well-defined, relevant distance metric • Incommensurate features  Euclidean distance not meaningful • Side information  Euclidean distance not relevant • Learning distance metrics may thus be desirable • A sensible similarity/distance metric may be highly task-dependent or semantic-dependent • What do these data points “mean”? • What are we using the data for?

  6. Which images are most similar?

  7. right centered left It depends ...

  8. male female It depends ...

  9. student professor ... what you are looking for

  10. nature background plain background ... what you are looking for

  11. Key DML Concept: Mahalanobis distance metric • The simplest mapping is a linear transformation

  12. PSD Mahalanobis distance metric • The simplest mapping is a linear transformation Algorithms can learn both matrices

  13. >5 Minutes Introduction to Dimensionality Reduction

  14. How can the dimensionality be reduced? • eliminate redundant features • eliminate irrelevant features • extract low dimensional structure

  15. Notation Input: with Output: Embedding principle: Nearby points remain nearby, distant points remain distant. Estimate r.

  16. Two classes of DR algorithms Linear Non-Linear

  17. Linear dimensionality reduction

  18. Principal Component Analysis (Jolliffe 1986) Project data into subspace of maximum variance.

  19. Facts about PCA • Eigenvectors of covariance matrix C • Minimizes ssq reconstruction error • Dimensionality r can be estimated from eigenvalues of C • PCA requires meaningful scaling of input features

  20. Multidimensional Scaling (MDS)

  21. Multidimensional Scaling (MDS)

  22. Multidimensional Scaling (MDS) • equivalent to PCA • use eigenvectors of inner-product matrix • requires only pairwise distances

  23. Non-linear dimensionality reduction

  24. Non-linear dimensionality reduction

  25. From subspace to submanifold We assume the data is sampled from some manifold with lower dimensional degree of freedom. How can we find a truthful embedding?

  26. Approximate manifold with neighborhood graph

  27. Approximate manifold with neighborhood graph

  28. Isomap Tenenbaum et al 2000 • Compute shortest path between all inputs • Create geodesic distance matrix • Perform MDS with geodesic distances geodesic distance

  29. Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

  30. Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

  31. Optimization problem unfold data by maximizing pairwise distances Preserve local distances

  32. Optimization problem center output (translation invariance)

  33. Optimization problem Problem: Optimization non-convex multiple local minima

  34. Optimization problem Solution: Change of notation single global minimum

  35. Unfolding the swiss-roll

  36. Mahalanobis Metric Learning for Clustering with Side Information (Xing et al. 2003) • Exemplars {xi , i=1,…,N} plus two types of side info: • “Similar” set S = { (xi , xj ) } s.t. xi and xj are “similar” (e.g. same class) • “Dissimilar” set D = { (xi , xj ) } s.t. xi and xj are “dissimilar” • Learn optimal Mahalanobis matrix M D2ij = (xi – xj)TM (xi – xj) (global dist. fn.) • Goal : keep all pairs of “similar” points close, while separating all “dissilimar” pairs. • Formulate as a constrained convex programming problem • minimize the distance between the data pairs in S • Subject to data pairs in D are well separated

  37. MMC-SI (Cont’d) • Objective of learning: • M is positive semi-definite • Ensure non negativity and triangle inequality of the metric • The number of parameters is quadratic in the number of features • Difficult to scale to a large number of features • Significant danger of overfitting small datasets

  38. Mahalanobis Metric for Clustering (MMC-SI) Xing et al., NIPS 2002

  39. MMC-SI Move similarly labeled inputs together

  40. MMC-SI Move different labeled inputs apart

  41. Convex optimization problem

  42. Convex optimization problem target: Mahalanobis matrix

  43. Convex optimization problem pushing differently labeled inputs apart

  44. Convex optimization problem pulling similar points together

  45. Convex optimization problem ensuring positive semi-definiteness

  46. CONVEX Convex optimization problem

  47. Gradient Alternating Projection

  48. Gradient Alternating Projection Take step along gradient.

  49. Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space.

  50. Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone.

More Related