统计机器学习

统计机器学习 陶卿 2006年6月 Qing.tao@mail.ia.ac.cn 中国科学院自动化研究所

Outline 中国科学院自动化研究所 • Data mining and statistical learning; • Some algorithms and its interpretation; • Statistical learning algorithms for one-class problems.

Challenging problems 中国科学院自动化研究所 • We are drowning in information and starving for knowledge !

Data Mining 中国科学院自动化研究所 • Challenges in area of data storage, organization and searching have led to the new field of data mining. • Vast amounts of data are being generated in many fields, and our job is: to exact important patterns and trends, and understand what the data says.This is calledlearning from data.

Machine Learning 中国科学院自动化研究所 • The learning problems can be roughly categorized as supervised and unsupervised. • Supervised: classification, regression and ranking; • Unsupervised: one-class, clustering and PCA.

Application in PR 中国科学院自动化研究所 • Pattern recognition system: Sampling, Pre-procession, Feature extraction, Classification.

Difference 中国科学院自动化研究所 • Statistical pattern recognition: in terms of distributions, deductive inference; • Statistical machine learning: in terms of finite samples, inductive inference.

Biometrics 中国科学院自动化研究所 • Biometrics refers to the automatic identification of a person based on his/her physiological or behavioral characteristics.

Bioinformatics 中国科学院自动化研究所 • In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes. Popular sequence databases have been growing at exponential rates. • This deluge of information has necessitated the careful storage, organization and indexing of sequence information. Information science has been applied to biology to produce the field called Bioinformatics.

ISI 中国科学院自动化研究所 • Intelligence and Security Informatics is an merging field of study aimed at developing advanced information technologies, systems, algorithms, and databases for national- and homeland-security-related applications.

Confusion 中国科学院自动化研究所 • Many researchers claim that they are studying statistical machine learning methods. • Maybe, their interest is only to apply the statistical machine learning algorithms.

统计机器学习基础研究 中国科学院自动化研究所 • 研究统计机器学习或者某一分支的普遍规律：理论依据、算法实现、先验知识应用等； • 不局限某一具体应用，但为普遍应用提供理论和算法支持； • 本次讲授仅限于统计机器学习！

Machine learning community 中国科学院自动化研究所 • To solve existing problems in machine learning; • To analyze different algorithms theoretically; • To develop new theory and learning algorithms for new problems.

我的一点声明 中国科学院自动化研究所 • 2001年起，正式接受统计机器学习的训练； • 但至目前，每次阅读Journal of Machine Learning Research上文章的时候，感觉与国际机器学习研究水平的差距越来越大； • 本次讲授的目的主要是介绍自己的感受、体会以及一些初步的研究。

Performance 中国科学院自动化研究所 • Theory：generalization ability； • Experiments：test error；

More 中国科学院自动化研究所 • Not only in terms of generalization； • But also in terms of implementation, speed, understandability etc.

Theoretical Analysis 中国科学院自动化研究所 • Model Assessment: estimating the performance of different models in order to choose the best one. • Model Selection: having chosen the model, estimating the prediction error on new data.

Ian Hacking 中国科学院自动化研究所 • The quiet statisticians have changed our world: • not by discovering new facts or technical developments, but by changing the ways that we reason, experiments and form our opinions……

Statistical learning 中国科学院自动化研究所 • Try to explain the algorithms in a statistical framework. • Not limited to statistical learning by Vapnik.

Andreas Buja 中国科学院自动化研究所 • There is no true interpretation of anything: • Interpretation is vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about the idea.

Interpretation of Algorithms 中国科学院自动化研究所 • Almost all the learning algorithms can be illustrated theoretically and intuitively; • The probability and geometric explanations not only help us to understand the algorithms theoretically and intuitively, but also motivate us to develop elegant and practical new algorithms.

Main references 中国科学院自动化研究所 • N. Cristianini and J. Schawe-Taylor. An Introduction to Support Vector Machines. Cambridge: Cambridge Univ Press. 2000. • T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer. 2001.

Main kinds of theory 中国科学院自动化研究所 • Bayesian decision theory; • Bias + Variance decomposition; • Generalization bound theory; • MDL (Minimum Description Length) • …….

Definition of Classifications 中国科学院自动化研究所 • Assumption： (xi，yi) i.i.d. • Hypothesis space： H • Loss function： • Objective function：

Definition of regression 中国科学院自动化研究所 • Assumption： (xi，yi) i.i.d. • Hypothesis space： H • Loss function： • Objective function：

Several well-known algorithms 中国科学院自动化研究所 • K-Nearest Neighbor; • LMS (Least Mean Square); • Ridge regression; • Fisher Discriminant Analysis; • Neural Networks； • Support Vector Machines and boosting.

Framework of algorithms 中国科学院自动化研究所 • The selection of hypothesis space: from simple to complex. Linear to nonlinear (neural networks, kernel); single to ensemble (boosting); pointwise to continuous (one-class problems); local to global ( KNN and LMS).

Designation of algorithms 中国科学院自动化研究所 • Usually, the algorithm under more complex hypothesis space should be a specific one under simple hypothesis space. • The algorithm under simple hypothesis space serves as a start point of the complete framework.

Bayesian：classification 中国科学院自动化研究所 • Under Bayesian rule： • Minimizing the average error rate.

Bayesian: regression 中国科学院自动化研究所 • EPE：expected prediction error • Optimal solution:

Estimating densities 中国科学院自动化研究所 • The knowledge of density functions would allow us to solve whatever problems that can be solved on the basis of available data; • Vapnik's principle: never to solve a problem that is more general than you actually need to solve.

KNN 中国科学院自动化研究所 • KNN rule for regression: • KNN rule for classification: majority voting

Interpretation：KNN 中国科学院自动化研究所 • KNN: • Assuming that the classifier is well approximated by a locally constant function, and conditioning at a point is relaxed to conditioning on some region to the target point.

LMS 中国科学院自动化研究所 • LMS：

Interpretation: LMS 中国科学院自动化研究所 • LMS： • Assuming that the classifier is well approximated by a globally linear function, and the expectation is approximated by averages over the training data.

x2 w ω1 y1 w(y) ω2 y2 x1 Fisher Discriminant Analysis 中国科学院自动化研究所 • To seek a direction for which the projected samples are well separated.

Interpretation: FDA 中国科学院自动化研究所 • Generally speaking, it is not optimal. • FDA is the Bayes optimal solution if the two classes are distributed according to a normal distribution with equal covariance.

FDA and LMS 中国科学院自动化研究所 • The FDA bears strong connections to leastsquares approaches for classification. • The solution to the least squares problem is in the same direction as the solution of Fisher’s discriminant.

FDA: a novel interpretation 中国科学院自动化研究所 • T. Centeno, N. Lawrence. Optimizing kernel parameters and regularization coefficients for non-linear discriminant analysis. JMLR, 7 (2006). • A novel Bayesian interpretation of FDA relating Rayleigh’s coefficient to a noise model that minimizes a cost based on the most probable class centers and that abandons the ‘regression to the labels’ assumption used by other algorithms.

FDA: parameters 中国科学院自动化研究所 • Going further, with the use of a Gaussian process prior, they show the equivalence of their model to a regularized kernel FDA. • A key advantage of our approach is the facility to determine kernel parameters and the regularization coefficient through the optimization of the marginal log-likelihood of the data.

FDA: framework of algorithms 中国科学院自动化研究所 • Qing Tao, et al. The Theoretical Analysis of FDA and Applications. Pattern Recognition. 39(6):1199-1204. • Similar in spirit to maximal margin algorithm, FDA with zero within-class variance is proved to serve as a start point of the complete FDA framework.

Disadvantage 中国科学院自动化研究所 • Motivation; • Inspired by …… ; • Not really based on.

Bias and variance analysis 中国科学院自动化研究所 • The bias-variance decomposition is a very powerful and widely-used tool for understanding machine-learning algorithms; • It was originally developed for squared loss.

Bias-Variance Decomposition 中国科学院自动化研究所 • Consider the regression Y=f(X)+ε

Bias-Variance Tradeoff 中国科学院自动化研究所 • Often, the variance can be significantly reduced by deliberately introducing a small amount of bias.

Interpretation: KNN 中国科学院自动化研究所 • Parameter K： Bias-Variance Tradeoff.

Ridge regression 中国科学院自动化研究所 • Hoerl and Kennard 1970 • LMS: ill-posed problem; • Compared with LMS under certain assumptions, it introduces a small amount of bias.

Interpretation: ridge regression 中国科学院自动化研究所 • Analytic solution • The technique as a way to simultaneously reduce the risk and increase the numerical stability of LMS.

Interpretation: parameter 中国科学院自动化研究所 • Referred to as a shrinkage parameter: Singular value decomposition; • Effective degrees of freedom: experiment analysis; • The key result is dramatic reduction of parameter variance.

A note 中国科学院自动化研究所 • A new class of generalized Bayes minimax ridge regression estimators. The Annals of Statistics. 2005, 33(4). • The risk reduction aspect of ridge regression was often observed in simulations but was not theoretically justified. Almost all theoretical results on ridge regression in the literature depend onnormality.

统计机器学习

统计机器学习

Presentation Transcript