1 / 28

Presented by: Peng Zhang 4/15/2011

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10 (2009) 341-376. Presented by: Peng Zhang 4/15/2011. Outline. Motivation Major Contributions Preliminaries Algorithms Discussions

lieu
Télécharger la présentation

Presented by: Peng Zhang 4/15/2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376 Presented by: Peng Zhang 4/15/2011

  2. Outline • Motivation • Major Contributions • Preliminaries • Algorithms • Discussions • Experiments • Conclusions

  3. Motivation • Low-rank matrix nearness problems • Learning low-rank positive semidefinite (kernel) matrices for machine learning applications • Divergence (distance) between data objects • Find suitable divergence measures to certain matrices • Efficiency • Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods • Current learning techniques require positive semidefinite constraint, resulting in expensive computations • Bypass such constraint, find divergences with automatic enforcement of PSD

  4. Major Contributions • Goal • Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints • Proposals • Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning • Use Bregman projections for the divergences • Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix • Properties of the proposed algorithms • Range-space preserving property (rank of output = rank of input) • Do not decrease rank • Computationally efficient • Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)

  5. Preliminaries • Kernel methods • Inner products in feature space • Only information needed is kernel matrix K • K is always PSD • If is low rank • Use low rank decomposition to improve computational efficiency Low rank kernel matrix learning

  6. Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x. Preliminaries • Bregman vector divergences • Extension to Bregman matrix divergences

  7. Preliminaries • Special Bregman matrix divergences • The von Neumann divergence (DvN) • The LogDet divergence (Dld) All for full rank matrices

  8. Preliminaries • Important properties of DvN and Dld • X is defined over positive definite matrices • No explicit constrain as positive definite • Range-space preserving property • Scale-invariance of LogDet • Transformation invariance • Others • Beyond transductive setting, evaluate kernel function over new data points

  9. Preliminaries • Spectral Bregman matrix divergence • Generating convex function • Function of eigenvalues and convex function • Bregman matrix divergence by eigenvalues and eigenvectors

  10. Preliminaries • Kernel matrix learning problem of this paper • Non-convex • Convex when using LogDet/von Neumann, because rank is implicitly enforced • Interested in constraint as squared Euclidean distance between points • A is rank one, and the problem can be: • Learn a kernel matrix over all data points from side information (labels or constraints)

  11. Preliminaries • Bregman projections • A method to solve the ‘no rank constraint’ version of the previous problem • Choose one constraint each time • Perform Bragman projection so that current solution satisfies that constraint • Using LogDet and von Neumann divergences, projections can be computed efficiently • Convergence guaranteed, but may require many iterations

  12. Preliminaries • Bregman divergences for low rank matrices • Deal with matrices with 0 eigenvalues • Infinite divergences might occur because • These imply a rank constraint if the divergence is finite Range … Rank …

  13. Preliminaries • Rank deficient LogDet and von Neumann Divergences • Rank deficient Bregman projections • von Neumann: • LogDet:

  14. Algorithm Using LogDet • Cyclic projection algorithm using LogDet divergence • Update for each projection • Can be simplified to • Range space is unchanged, no eigen-decomposition required • (21) costs O(n^2) operations per iteration • Improving update efficiency with factored n x r matrix G • This update can be done using Cholesky rank-one update • O(r^3) complexity • Further improve update efficiency to O(r^2) • Combines Cholesky rank-one update with matrix multiplication

  15. Algorithm Using LogDet • G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T • L can be determined implicitly

  16. Algorithm Using LogDet • What’re the constraints? Convergence? O(cr^2) Convergence is checked by how much v has changed May require large number of iterations O(nr^2)

  17. Algorithm Using von Neumann • Cyclic projection algorithm using von Neumann divergence • Update for each projection • This can be modified to • To calculate , find the unique root of the function

  18. Algorithm Using von Neumann • Slightly slower than Algorithm 2 Root finder, slows down the process O(r^2)

  19. Discussions • Limitations of Algorithm 2 and Algorithm 3 • The initial kernel matrix must be low-rank • Not applicable for dimensionality reduction • Number of iterations may be large • This paper only optimized the computations for each iteration • Reducing the total number of iterations is future topic • Handling new data points • Transductive setting • All data points are up front • Some of the points have labels or other supervisions • When new data point is added, re-learn the entire kernel matrix • Circumvent • View B as linear transformation • Apply B to new points

  20. Discussions • Generalizations to more constraints • Slack variables • When number of constraints is large, no feasible solution to Bregman divergence minimization problem • Introduce slack variables • Allows constraints to be violated but penalized • Similarity constraints • , or • Distance constraints • O(r^2) per projection • If arbitrary linear constraints are applied, O(nr)

  21. Discussions • Special cases • DefiniteBoost optimization problem • Online-PCA • Nearest correlation matrix problem • Minimizing LogDet divergence and semidefinite programming (SDP) • SDP relaxation of min-balanced-cut problem • Can be solved by LogDet divergence

  22. Experiments • Transductive learning and clustering • Data sets • Digits • Handwritten samples of digits 3,8 and 9 from UCI repository • GyrB • Protein data set with three bacteria species • Spambase • 4601 email messages with 57 attributes, spam/not spam labels • Nursery • 12960 instances with 8 attributes and 5 class labels • Classification • k-nearest neighbor classifier • Clustering • Kernel k-means algorithm • Use normalized mutual information (NMI) measure

  23. Experiments • Learn a kernel matrix only using constraints • Low rank kernels learned by proposed algorithms attain accurate clustering and classification • Use original data to get initial kernel matrix • The more constraints used, the more accurate results • Convergence • von Neumann divergence • Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints • LogDet divergence • Between 17 and 354 cycles

  24. Simulation Results Significant improvements 0.948 classification accuracy For DefiniteBoost, 3220 cycles to convergence

  25. Simulation Results Rank 57 Rank 8 LogDet needs fewer constraints LogDet converges much more slowly (Future work) But often it has fewer overall running time

  26. Simulation Results • Metric learning and large scale experiments • Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data • Compare proposed algorithms with metric learning algorithms • Metric learning by collapsing classes (MCML) • Large-margin nearest neighbor metric learning (LMNN) • Squared Euclidean Baseline

  27. Conclusions • Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems • Running times are linear in number of data points and quadratic in the rank of the kernel • The algorithms can be used in conjunction with a number of kernel-based learning algorithms

  28. Thank you

More Related