1 / 68

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2. Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center. Contents. VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results).

dezso
Télécharger la présentation

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

  2. Contents VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.

  3. Probabilistic Setting of ML

  4. Empirical Risk Minimization (ERM) Lost function:

  5. Empirical Risk Minimization (ERM) Risk Function:

  6. Empirical Risk Minimization (ERM) Risk Minimization Principle:

  7. Empirical Risk Minimization (ERM) Empirical Risk Minimization Principle:

  8. Regularization Regularized Empirical Risk:

  9. Empirical Risk Minimization (ERM) Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?

  10. ERM Consistency R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?

  11. ERM Consistency

  12. ERM Consistency

  13. ERM Consistency Consistency Definition for ERM:

  14. ERM Consistency Consistency Definition for ERM:

  15. ERM Consistency Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure  where it is equal to 0. The parameters  define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].

  16. ERM Consistency Do we need both limits hold true? We have:

  17. ERM Consistency Strict Consistency: Problem of minorization of function sets  consistency is satisfied trivially.

  18. ERM Consistency Strict Consistency: (note: only the second limit is needed)

  19. ERM Consistency Two sided Empirical Processes: Uniform convergence:

  20. ERM Consistency One-sided Empirical Processes:

  21. ERM Consistency Concentration Inequality: Hoeffding’s Inequality

  22. ERM Consistency Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.

  23. ERM Consistency Key Theorem of Learning:

  24. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.

  25. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:

  26. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each  we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.

  27. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:

  28. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions

  29. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each  vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube

  30. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

  31. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with  varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:

  32. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

  33. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

  34. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

  35. ERM Consistency Conditions of One-Sided Convergence:

  36. ERM Consistency Conditions of One-Sided Convergence:

  37. ERM Consistency Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l)

  38. ERM Consistency Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:

  39. ERM Generalization Bounds Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM

  40. ERM Generalization Bounds Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.

  41. ERM Generalization Bounds Non-asymptotic results: with probability 1-: With probability 1-2 :

  42. ERM Generalization Bounds Indicator Functions: - Distribution Dependent

  43. ERM Generalization Bounds Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:

  44. ERM Generalization Bounds Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.

  45. ERM Generalization Bounds Indicator Functions: VC dimension

  46. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  47. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  48. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  49. ERM Generalization Bounds Indicator Functions: VC dimension Example:

  50. ERM Generalization Bounds Indicator Functions: VC dimension Example:

More Related