Effective Model Selection Strategies for Reducing Overfitting in Machine Learning
This outline covers the motivation behind model selection, the dangers of overfitting in machine learning, and strategies like structural risk minimization and cross-validation. Theoretical models, penalties, and performance metrics are discussed, highlighting the importance of balancing complexity with training error. Minimum description length as a penalty and the application of the Minimum Message Length principle are also explored.
Effective Model Selection Strategies for Reducing Overfitting in Machine Learning
E N D
Presentation Transcript
Outline • Motivation • Overfitting • Structural Risk Minimization • Cross Validation • Minimum Description Length
Motivation: • Suppose we have a class of infinite Vcdim • We have too few examples • How can we find the best hypothesis • Alternatively, • Usually we choose the hypothesis class • How should we go about doing it?
Overfitting • Concept class: Intervals on a line • Can classify any training set • Zero training error: The only goal?!
Overfitting: Intervals • Can always get zero error • Are we interested?! • Recall Occam Razor!
Overfitting • Simple concept plus noise • A very complex concept • insufficient number of examples + noise 1/3
Theoretical Model • Nested Hypothesis classes • H1H2H3 … Hi • Let VC-dim(Hi)=I • For simplicity |Hi| = 2i • There is a target function c(x), • For some i, c Hi • e(h) = Pr [ h c] • ei = minhHi e(h) • e* = miniei
Theoretical Model • Training error • obs(h) = Pr [ h c] • obsi = minhHi obs(h) • Complexity of h • d(h) = mini {h Hi} • Add a penalty for d(h) • minimize: obs(h)+penalty(h)
Structural Risk Minimization • Penalty based. • Chose the hypothesis which minimizes: • obs(h)+penalty(h) • SRM penalty:
SRM: Performance • THEOROM • With probability 1- • h* : best hypothesis • g* : SRM choice • e(h*) e(g*) e(h*)+ 2 penalty(h*) • Claim: The theorem is “tight” • Hiincludes 2i coins
Proof • Bounding the error in Hi • Bounding the error across Hi
Cross Validation • Separate sample to training and selection. • Using the training • Select from each Hi a candidate gi • Using the selection sample • select between g1, … ,gm • The split size • (1-)m training set • m selection set
Cross Validation: Performance • Errors • ecv(m), eA(m) • Theorem: with probability 1- • Is CV always near-optimal ?!
Minimum Description length • Penalty: size of h • Related to MAP • size of h: log(Pr[h]) • errors: log(Pr[D|h])