1 / 50

Density Ratio Estimation: A New Versatile Tool for Machine Learning

Density Ratio Estimation: A New Versatile Tool for Machine Learning. Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/~sugi. Overview of My Talk (1). Consider the ratio of two probability densities .

gur
Télécharger la présentation

Density Ratio Estimation: A New Versatile Tool for Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMU Density Ratio Estimation:A New Versatile Toolfor Machine Learning Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/~sugi

  2. Overview of My Talk (1) • Consider the ratio of two probability densities. • If the ratio is estimated, various machine learning problems can be solved! • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

  3. Overview of My Talk (2) Vapnik said: When solving a problem of interest, one should not solve a more general problem as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • Various direct density ratio estimation methods have been developed recently. Knowing densities Knowing ratio

  4. Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation

  5. Covariate Shift Adaptation Shimodaira (JSPI2000) • Training/test input distributions are different, but target function remains unchanged: “extrapolation” Input density Function Training samples Test samples Target function

  6. Ordinary least-squares is not consistent. Adaptation Using Density Ratios • Density-ratio weighted least-squares is consistent. Applicable to any likelihood-based methods!

  7. Model Selection • Controlling bias-variance trade-off is important in covariate shift adaptation. • No weighting: low-variance, high-bias • Density ratio weighting: low-bias, high-variance • Density ratio weighting plays an important role for unbiased model selection: • Akaike information criterion (regular models) • Subspace information criterion (linear models) • Cross-validation (arbitrary models) Shimodaira (JSPI2000) Sugiyama & Müller (Stat&Dec.2005) Sugiyama, Krauledat & Müller (JMLR2007)

  8. Active Learning • Covariate shift naturally occurs in active learning scenarios: • is designed by users. • is determined by environment. • Density ratio weighting plays an important role for low-bias active learning. • Population-based methods: • Pool-based methods: Wiens (JSPI2000) Kanamori & Shimodaira (JSPI2003) Sugiyama, (JMLR2006) Kanamori, (NeuroComp2007), Sugiyama & Nakajima (MLJ2009)

  9. Real-world Applications • Age prediction from faces (with NECsoft): • Illumination and angle change (KLS&CV) • Speaker identification (with Yamaha): • Voice quality change (KLR&CV) • Brain-computer interface: • Mental condition change (LDA&CV) Ueki, Sugiyama & Ihara (ICPR2010) Yamada, Sugiyama & Matsui (SP2010) Sugiyama, Krauledat & Müller (JMLR2007) Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)

  10. Real-world Applications (cont.) こんな・失敗・は・ご・愛敬・だ・よ・. • Japanese text segmentation (with IBM): • Domain adaptation from general conversation to medicine (CRF&CV) • Semiconductor wafer alignment (with Nikon): • Optical marker allocation (AL) • Robot control: • Dynamic motion update (KLS&CV) • Optimal exploration (AL) Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008) Sugiyama & Nakajima (MLJ2009) Hachiya, Akiyama, Sugiyama & Peters (NN2009) Hachiya, Peters & Sugiyama (ECML2009) Akiyama, Hachiya, & Sugiyama (NN2010)

  11. Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation

  12. Inlier-based Outlier Detection Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008) Smola, Song & Teo (AISTATS2009) • Test samples having different tendency from regular ones are regarded as outliers. Outlier

  13. Real-world Applications • Steel plant diagnosis (with JFE Steel) • Printer roller quality control (with Canon) • Loan customer inspection (with IBM) • Sleep therapy from biometric data Takimoto, Matsugu & Sugiyama (DMSS2009) Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010) Kawahara & Sugiyama (SDM2009)

  14. Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation

  15. Mutual Information Estimation • Mutual information (MI): • MI as an independence measure: • MI can be approximated using density ratio: andare statistically independent Suzuki, Sugiyama & Tanaka (ISIT2009)

  16. Usage of MI Estimator Suzuki, Sugiyama, Sese & Kanamori (BMC Bioinformatics 2009) • Independence between input and output: • Feature selection • Sufficient dimension reduction • Independence between inputs: • Independent component analysis • Independence between input and residual: • Causal inference Suzuki & Sugiyama (AISTATS2010) Suzuki & Sugiyama (ICA2009) Yamada & Sugiyama (AAAI2010) y Causal inference ε Feature selection, Sufficient dimension reduction x x’ Independent component analysis

  17. Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation

  18. Conditional Probability Estimation • When is continuous: • Conditional density estimation • Mobile robots’ transition estimation • When is categorical: • Probabilistic classification Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010) 80% Class 1 Pattern Class 2 20% Sugiyama (Submitted)

  19. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment matching Approach • Ratio matching Approach • Comparison • Dimensionality Reduction

  20. Density Ratio Estimation • Density ratios are shown to be versatile. • In practice, however, the density ratio should be estimated from data. • Naïve approach: Estimate two densities separately and take their ratio

  21. Vapnik’s Principle When solving a problem, don’t solve more difficult problems as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • We estimate density ratio without going through density estimation. Knowing densities Knowing ratio

  22. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction

  23. Probabilistic Classification • Separate numerator and denominator samples by a probabilistic classifier: • Logistic regression achieves the minimum asymptotic variance when model is correct. Qin (Biometrika1998)

  24. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction

  25. Moment Matching • Match moments of and : • Ex) Matching the mean:

  26. Kernel Mean Matching Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006) • Matching a finite number of moments does not yield the true ratio even asymptotically. • Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS: • Ratio values at samples can be estimated via quadratic program. • Variants for learning entire ratio function and other loss functions are also available. Kanamori, Suzuki & Sugiyama (arXiv2009)

  27. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction

  28. Ratio Matching Sugiyama, Suzuki & Kanamori (RIMS2010) • Match a ratio model with true ratio under the Bregman divergence: • Its empirical approximation is given by (Constant is ignored)

  29. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction

  30. Kullback-Leibler Divergence (KL) Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007) Nguyen, Wainwright & Jordan (NIPS2007) • Put . Then BR yields • For a linear model , minimization of KL is convex. (ex. Gauss kernel)

  31. Properties Nguyen, Wainwright & Jordan (NIPS2007) Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008) • Parametric case: • Learned parameter converge to the optimal value with order . • Non-parametric case: • Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).

  32. Experiments • Setup: d-dim. Gaussian with covariance identity and • Denominator: mean (0,0,0,…,0) • Numerator: mean (1,0,0,…,0) • Ratio of kernel density estimators (RKDE): • Estimate two densities separately and take ratio. • Gaussian width is chosen by CV. • KL method: • Estimate density ratio directly. • Gaussian width is chosen by CV.

  33. Accuracy as a Functionof Input Dimensionality • RKDE: “curse of dimensionality” • KL: works well RKDE Normalized MSE KL Dim.

  34. Variations • EM algorithms for KL with • Log-linear models • Gaussian mixture models • Probabilistic PCA mixture models are also available. Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009) Yamada & Sugiyama (IEICE2009) Yamada, Sugiyama, Wichern & Jaak (Submitted)

  35. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction

  36. Squared Distance (SQ) Kanamori, Hido & Sugiyama (JMLR2009) • Put . Then BR yields • For a linear model , minimizer of SQ is analytic: • Computationally efficient!

  37. Properties Kanamori, Hido & Sugiyama (JMLR2009) Kanamori, Suzuki & Sugiyama (arXiv2009) • Optimal parametric/non-parametric rates of convergence are achieved. • Leave-one-out cross-validation score can be computed analytically. • SQ has the smallest condition number. • Analytic solution allows computation of the derivative of mutual information estimator: • Sufficient dimension reduction • Independent component analysis • Causal inference

  38. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction

  39. Density Ratio Estimation Methods • SQ would be preferable in practice.

  40. Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction

  41. Direct Density Ratio Estimationwith Dimensionality Reduction (D3) • Directly density ratio estimation without going through density estimation is promising. • However, for high-dimensional data, density ratio estimation is still challenging. • We combine direct density ratio estimation with dimensionality reduction!

  42. Hetero-distributional Subspace (HS) • Key assumption: and are different only in a subspace (called HS). • This allows us to estimate the density ratio only within the low-dimensional HS! HS : Full-rank and orthogonal

  43. HS Search Based onSupervised Dimension Reduction Sugiyama, Kawanabe & Chui (NN2010) • In HS, and are different, so and would be separable. • We may use any supervised dimension reduction method for HS search. • Local Fisher discriminant analysis (LFDA): • Can handle within-class multimodality • No limitation on reduced dimensions • Analytic solution available Sugiyama (JMLR2007)

  44. Pros and Cons of SupervisedDimension Reduction Approach • Pros: • SQ+LFDA gives analytic solution for D3. • Cons: • Rather restrictive implicit assumption that and are independent. • Separability does not always lead to HS (e.g., two overlapped, but different densities) Separable HS

  45. Characterization of HS Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010) • HS is given as the maximizer of the Pearson divergence : • Thus, we can identify HS by finding maximizer of PE with respect to . • PE can be analytically approximated by SQ!

  46. HS Search Amari (NeuralComp1998) • Gradient descent: • Natural gradient descent: • Utilize the Stiefel-manifold structure for efficiency • Givens rotation: • Choose a pair of axes and rotate in the subspace • Subspace rotation: • Ignore rotation within subspaces for efficiency Plumbley (NeuroComp2005) Rotation across the subspace Rotation within the subspace Hetero-distributional subspace

  47. Experiments True ratio (2d) Samples (2d) Plain SQ (2d) Increasing dimensionality (by adding noisy dims) Plain SQ D3-SQ (2d) D3-SQ

  48. Conclusions • Many ML tasks can be solved via density ratio estimation: • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test • Directly estimating density ratios without going through density estimation is the key. • LR, KMM, KL, SQ, and D3 .

  49. Books • Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009. • Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!) • Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)

  50. The World of Density Ratios Real-world applications: Brain-computer interface, Robot control, Image understanding, Speech recognition, Natural language processing, Bioinformatics Machine learning algorithms: Covariate shift adaptation, Outlier detection, Conditional probability estimation, Mutual information estimation Density ratio estimation: Fundamental algorithms (LR, KMM, KL, SQ) Large-scale, High-dimensionality, Stabilization, Robustification Theoretical analysis: Convergence, Information criteria, Numerical stability

More Related