1 / 116

Information Theoretic Learning

Information Theoretic Learning. Jose C. Principe Yiwen Wang Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu. Acknowledgments. Dr. Deniz Erdogmus My students: Puskal Pokharel

Albert_Lan
Télécharger la présentation

Information Theoretic Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theoretic Learning Jose C. Principe Yiwen Wang Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu

  2. Acknowledgments • Dr. Deniz Erdogmus • My students: Puskal Pokharel • Weifeng Liu Jianwu Xu • Kyu-Hwa Jeong • Sudhir Rao • Seungju Han • NSF ECS – 0300340 and 0601271 (Neuroengineering program)

  3. Resources • CNEL Website www.cnel.ufl.edu • Front page, go to ITL resources • (tutorial, examples, MATLAB code) • Publications

  4. Information Filtering Deniz Erdogmus and Jose Principe From Linear Adaptive Filtering to Nonlinear Information Processing IEEE Signal Processing MAGAZINE November 2006

  5. Outline • Motivation • Renyi’s entropy definition • A sample by sample estimator for entropy • Projections based on mutual information • Applications • Optimal Filtering • Classification • Clustering • Conclusions

  6. Information Data is everywhere! Wireless Communications Remote Sensing Speech Processing Biomedical Applications Sensor Arrays

  7. Data d Output Data x Adaptive System + - Cost function Learning Algorithm From Data to Models • Optimal Adaptive Models: y=f(x,w) Error e

  8. From Linear to Nonlinear Mappings • Wiener showed us how to compute optimal linear • projections. The LMS/RLS algorithms showed us how • to find the Wiener solution sample by sample. • Neural networks brought us the ability to work • non-parametrically with nonlinear function approximators. • Linear regression nonlinear regression • Optimum linear filtering TLFNs • Linear Projections (PCA) Princ. Curves • Linear Discriminant Analysis MLPs

  9. Adapting Linear and NonLinear Models • The goal of learning is to optimize the performance of • the parametric mapper according to some cost function. • In classification, minimize the probability of error. • In regression the goal is to minimize the error in the fit. • The cost function most widely used has been the mean • square error (MSE). It provides the Maximum Likelihood • solution when the error is Gaussian distributed. • In NONLINEAR systems this is hardly ever the case.

  10. Beyond Second Order Statistics • We submit that the goal of learning should be totransfer • as much informationas possible from the inputs to the • weights of the system (no matter if unsupervised or supervised). • As such the learning criterion should be based onentropy(single data source) ordivergence (multiple data sources). • Hence the challenge is to find nonparametric, sample- • by-sample estimators for these quantities.

  11. ITL: Unifying Learning Scheme • Normally supervised and unsupervised learning are treated differently, but there is no need to do so. One can come up with a general class of cost functions based on Information Theory that apply to both learning schemes. • Cost function (Minimize, Maximize, Nullify) 1. Entropy • Single group of RV’s 2. Divergence • Two or more groups of RV’s

  12. ITL: Unifying Learning Scheme • Function Approximation • Minimize Error Entropy • Classification • Minimize Error Entropy • Maximize Mutual Information between class labels and outputs • Jaynes’ MaxEnt • Maximize output entropy • Linsker’s Maximum Information Transfer • Maximize MI between input and output • Optimal Feature Extraction • Maximize MI between desired and output • Independent Component Analysis • Maximize output entropy • Minimize Mutual Information among outputs

  13. ITL: Unifying Learning Scheme

  14. Information Theory Is a probabilistic description of random variables that quantifies the very essence of the communication process. It has been instrumental in the design and quantification of communication systems. Information theory provides a quantitative and consistent framework to describe processes with partial knowledge (uncertainty).

  15. Information Theory Not all the random events are equally random! How to quantify this fact? Shannon proposed the concept of ENTROPY

  16. Formulation of Shannon’s Entropy • Hartley Information (1928) • Large probability  small information • Small probability  large information • Two identical channels should have twice the capacity as one • Log2 is a natural measure for additivity

  17. Formulation of Shannon’s Entropy • Expected value of Hartley Information • Communications – ultimate data compression (H - channel capacity for asymptotically error-free communication) • Measure of (relative) uncertainty • Shannon used a principled approach to define entropy

  18. Review of Information Theory • Shannon Entropy: • Mutual Information: • Kullback-Leibler Divergence:

  19. Properties of Shannon’s Entropy • Discrete RV’s • H(X) > 0 • H(X) < log N equality iff X is uniform • H(Y|X) < H(Y) equality iff X, Y indep. • H(X,Y) = H(X) + H(Y|X) • Continuous RV’s • Replace summation with integral • Differential entropy • Minimum entropy is sum of delta functions • Maximum entropy • Fixed variance  Gaussian • Fixed upper/lower limits  uniform

  20. Properties of Mutual Information • IS(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) • IS(X;Y) = IS(Y;X) • IS(X;X) = HS(X) HS(X,Y) HS(X|Y) HS(Y) IS(X;Y) HS(Y|X) HS(X)

  21. A Different View of Entropy • Shannon’s Entropy • Renyi’s Entropy • Fisher’s Entropy (local) Renyi’s entropy becomes Shannon’s as

  22. Renyi’s Entropy • Norm of the pdf: • Entropies in terms of V

  23. Geometrical Illustration of a Entropy

  24. Properties of Renyi’s Entropy • (a) Continuous function of all probability • (b) Permutationally symmetric • (c) H(1/n, …1/n) is an increasing function of n • (d) Recursivity • (e) Additivity If p and q are independent

  25. Properties of Renyi’s entropy • Renyi’s entropy provides an upper and lower bound for the probability of the error in classification unlike Shannon, which provides only a lower bound (Fano’s inequality, which is the tightest bound)

  26. Nonparametric Entropy Estimators (Only continuous variables are interesting…) • Plug in estimates • Integral estimates • Resubstitution estimates • Splitting data estimates • Cross validation estimates • Sample spacing estimates • Nearest Neighbor distances

  27. Parzen Window Method • Put a kernel over the samples, normalize and add. Entropy becomes a function of continuous RV. • A kernel is a positive function that adds to 1 and peaks at the sample location (i.e. the Gaussian)

  28. Parzen Windows Laplacian Uniform

  29. Parzen Windows • Smooth estimator • Arbitrarily close fit as N  infinity, s  0 • Curse of Dimensionality • Previous pictures for d = 1 dimension • For a linear increase in d, an exponential increase in N is required for an equally “good” approximation In ITL we use Parzen windows not to estimate the PDF but to estimate the 2-Norm of the PDF that corresponds to the first moment of the PDF.

  30. Renyi’s Quadratic Entropy Estimation • Quadratic Entropy (a=2) • Information Potential • Use Parzen window pdf estimation with a (symmetric) Gaussian kernel Information potential: think of the samples as particles (gravity or electrostatic field) that interact with others with a law given by the kernel shape.

  31. IP as an Estimator of Quadratic Entropy • Information Potential (IP) V (X) 2

  32. IP as an Estimator of Quadratic Entropy • There is NO approximation in computing the Information Potential for a = 2 besides the choice of the kernel. • This result is the kernel trick used in Support Vector Machines. • It means that we never explicitly estimate the PDF, which improves greatly the applicability of the method.

  33. Information Force (IF) • Between two Information Particles (IPTs) • Overall (X)

  34. Calculation of IP & IF

  35. Central “Moments” • Mean • Variance • Entropy

  36. Moment Estimation • Mean • Variance • Entropy

  37. Which of the two Extremes? • Estimation of pdf must be accurate for practical ITL? • ITL (Minimization/maximization) doesn’t require an accurate pdf estimate? None of the above, but still not fully characterized

  38. How to select the kernel size • Different values of s produce different entropy estimates. We suggest to use 3s ~ 0.1 dynamic range (interaction among 10 samples). • Or use Silverman’s rule • Kernel size is just a scaleparameter A stands for the minimum of the empirical data standard deviation and the data interquartile range scaled by 1.34

  39. Extension to any kernel • We do not need to use Gaussian kernels in the Parzen estimator. • Can use any kernel that is symmetric and differentiable • (k(0) > 0, k’(0) = 0 and k”(0) < 0) . • We normally work with kernels scaled from an unit size kernel

  40. Extension to any a • Redefine the Information Potential as • Using the Parzen estimator we obtain • This estimator corresponds exactly to the quadratic estimator (a = 2) with the proper kernel width, s.

  41. Extension to any a, kernel • The a- information potential • The a- information force • where F2(X) is the quadratic IF. Hence we see that the “fundamental” definition is the quadratic IP and IF, and the “natural” kernel is the Gaussian.

  42. Kullback Leibler Divergence • KL Divergence measures the “distance” between pdfs (Csiszar and Amari) • Relative entropy • Cross entropy • Information for discrimination

  43. Mutual Information & KL Divergence . • Shannon’s Mutual Information • Kullback Leibler Divergence • Statistical Independence .

  44. f3 f1 f2 KL Divergence is NOT a distance • Ideally for a distance, • Non-negative • Null only if pdf’s are equal • Symmetric • Triangular inequality • In reality,

  45. New Divergences and Quadratic Mutual Infomation • Euclidean Distance between pdfs (Quadratic Mutual Information ED-QMI) • Cauchy Schwarz divergence and CS-QMI

  46. Geometrical Explanation of MI

  47. One Example 0.4 0 0.6

  48. One Example

  49. One Example

  50. One Example

More Related