Effective Model Selection Strategies for Reducing Overfitting in Machine Learning

Model Selection

Outline • Motivation • Overfitting • Structural Risk Minimization • Cross Validation • Minimum Description Length

Motivation: • Suppose we have a class of infinite Vcdim • We have too few examples • How can we find the best hypothesis • Alternatively, • Usually we choose the hypothesis class • How should we go about doing it?

Overfitting • Concept class: Intervals on a line • Can classify any training set • Zero training error: The only goal?!

Overfitting: Intervals • Can always get zero error • Are we interested?! • Recall Occam Razor!

Overfitting: Intervals

Overfitting • Simple concept plus noise • A very complex concept • insufficient number of examples + noise 1/3

Theoretical Model • Nested Hypothesis classes • H1H2H3 …  Hi • Let VC-dim(Hi)=I • For simplicity |Hi| = 2i • There is a target function c(x), • For some i, c Hi • e(h) = Pr [ h  c] • ei = minhHi e(h) • e* = miniei

Theoretical Model • Training error • obs(h) = Pr [ h  c] • obsi = minhHi obs(h) • Complexity of h • d(h) = mini {h Hi} • Add a penalty for d(h) • minimize: obs(h)+penalty(h)

Structural Risk Minimization • Penalty based. • Chose the hypothesis which minimizes: • obs(h)+penalty(h) • SRM penalty:

SRM: Performance • THEOROM • With probability 1- • h* : best hypothesis • g* : SRM choice • e(h*) e(g*) e(h*)+ 2 penalty(h*) • Claim: The theorem is “tight” • Hiincludes 2i coins

Proof • Bounding the error in Hi • Bounding the error across Hi

Cross Validation • Separate sample to training and selection. • Using the training • Select from each Hi a candidate gi • Using the selection sample • select between g1, … ,gm • The split size • (1-)m training set • m selection set

Cross Validation: Performance • Errors • ecv(m), eA(m) • Theorem: with probability 1- • Is CV always near-optimal ?!

Minimum Description length • Penalty: size of h • Related to MAP • size of h: log(Pr[h]) • errors: log(Pr[D|h])

Effective Model Selection Strategies for Reducing Overfitting in Machine Learning