Lecture 12 – Model Assessment and Selection

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006

Summary • Bias, variance, model complexity • Optimism of training error rate • Estimates of in-sample prediction error, AIC • Effective number of parameters • The Bayesian approach and BIC • Vapnik-Chernovekis dimension • Cross-Validation • Bootstrap method

Model Selection Criteria • Training Error • Loss Function • Generalization Error

Training Error vs. Test Error

Model Selection and Assessment • Model selection: • Estimating the performance of different models in order to chose the best • Model assessment: • Having chosen a final model, estimating its prediction error (generalization error) on new data • If we were rich in data: Train Validation Test

Bias-Variance Decomposition • As we have seen before, • The first term is the variance of the target around the true mean f(x0); the second term is the average by which our estimate is off from the true mean; the last term is variance of f^(x0) * The more complex f, the lower the bias, but the higher the variance

Bias-Variance Decomposition (cont’d) • For K-nearest neighbor • For linear regression

Bias-Variance Decomposition (cont’d) • For linear regression, where h(x0) is the vector of weights that produce fp(x0)=x0T(XTX)-1XTy and hence Var[(fp(x0)]=||h(x0)||22 • This variance changes with x0, but its average over the sample values xi is (p/N) 2

Example • 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1]20. • Left: Y is 0 if X11/2 and apply k-NN • Right: Y is 1 if j=110Xj is 5 and 0 otherwise Prediction error Squared bias Variance

Example – loss function Prediction error Squared bias Variance

Optimism of Training Error • The training error • Is typically less than the true error • In sample error • Optimism • For squared error, 0-1, and other losses, on can show in general

Optimism (cont’d) • Thus, the amount by which the error under estimates the true error depends on how much yi affects its own prediction • For linear model • For additive model Y=f(X)+ and thus, Optimism increases linearly with number of inputs or basis d, decreases as training size increases

How to count for optimism? • Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc. • Bootstrap and cross-validation, are direct estimates of this optimism error

Estimates of In-Sample Prediction Error • General form of in-sample estimate is computed from • Cp statistic: for an additive error model, when d parameters are fit under squared error loss, • Using this criterion, adjust the training error by a factor proportional to the number of basis • Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Errin, when the log-likelihood loss function is used

Akaike Information Criterion (AIC) • AIC relies on a relationship that holds asymptotically as N • Pr(Y) is a family of densities for Y (contains the “true” density), “ hat” is the max likelihood estimate of , “loglik” is the maximized log-likelihood:

AIC (cont’d) • For the Gaussian model, the AICCp • For the logistic regression, using the binomial log-likelihood, we have • AIC=-2/N. loglik + 2. d/N • Choose the model that produces the smallest possible AIC • What if we don’t know d? • How about having tuning parameters?

AIC (cont’d) • Given a set of models f(x) indexed by a tuning parameter , denote by err() and d() the training error and number of parameters • The function AIC provides an estimate of the test error curve and we find the tuning parameter  that maximizes it • By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

AIC- Example: Phenome recognition

The effective number of parameters • Generalize num of parameters to regularization • Effective num of parameters is: d(S) = trace(S) • In sample error is:

The effective number of parameters • Thus, for a regularized model: • Hence • and

The Bayesian Approach and BIC • Bayesian information criterion (BIC) • BIC/2 is also known as Schwartz criterion BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

BIC (cont’d) • BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N • Suppose we have a set of candidate models Mm, m=1,..,M and corresponding model parameters m, and we wish to chose a best model • Assuming a prior distribution Pr(m|Mm) for the parameters of each model Mm, compute the posterior probability of a given model!

BIC (cont’d) • The posterior probability is • Where Z represents the training data. To compare two models Mm and Ml, form the posterior odds • If the posterior greater than one, chose m, otherwise l.

BIC (cont’d) • Bayes factor: the rightmost term in posterior odds • We need to approximate Pr(Z|Mm) • A Laplace approximation to the integral gives • ^m is the maximum likelihood estimate and dm is the number of free parameters of model Mm • If the loss function is set as -2 log Pr(Z|Mm,^m), this is equivalent to the BIC criteria

BIC (cont’d) • Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability • If we compute the BIC criterion for a set of M models, BICm, m=1,…,M, then the posterior of each model is estimates as • Thus, we can estimate not only the best model, but also • asses the relative merits of the models considered

Vapnik-Chernovenkis Dimension • It is difficult to specify the number of parameters • The Vapnik-Chernovenkis (VC) provides a general measure of complexity and associated bounds on optimism • For a class of functions {f(x,)} indexed by a parameter vector , and xp. • Assume f is in indicator function, either 0 or 1 • If =(0,1) and f is a linear indicator, I(0+1Tx>0), then it is reasonable to say complexity is p+1 • How about f(x, )=I(sin .x)?

VC Dimension (cont’d)

VC Dimension (cont’d) • The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be • The VC dimension of the class {f(x,)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,)}

VC Dimension (cont’d) • A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them • Example: VC dim of linear indicator function in 2D

VC Dimension (cont’d) • Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g. • If we fit N data points using a class of functions {f(x,)} having VC dimension h, then with probability at least 1- over training sets For regression, a1=a2=1 Cherkassky and Mulier, 1998

VC Dimension (cont’d) • The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N • The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,) and hence allow for searching over the class

VC Dimension (cont’d) • Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds • SRM fits a nested sequence of models of increasing VC dimensions h1<h2<…, and then chooses the model with the smallest value of the upper bound • Drawback is difficulty in computing VC dim • A crude upper bound may not be adequate

Example – AIC, BIC, SRM

Cross Validation (CV) • The most widely used method • Directly estimate the generalization error by applying the model to the test sample • K-fold cross validation • Use part of data to build a model, different part to test • Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

CV (cont’d) • :{1,…,N}{1,…,K} divides the data to groups • Fitted function f^-(x), computed when  removed • CV estimate of prediction error is • If K=N, is called leave-one-out CV • Given a set of models f^-(x), the th model fit with the kth part removed. For this set of models we have

CV (cont’d) • CV() should be minimized over  • What should we chose for K? • With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same • Computational complexity

CV (cont’d)

CV (cont’d) • With lower K, CV has a lower variance, but bias could be a problem! • The most common are 5-fold and 10-fold!

CV (cont’d) • Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy • For linear fits (Sii is the ith on S diagonal) • The GCV approximation is GCV maybe sometimes advantageous where the trace is computed more easily than the individual Sii’s

Bootstrap • Denote the training set by Z=(z1,…,zN) where zi=(xi,yi) • Randomly draw a dataset with replacement from training data • This is done B times (e.g., B=100) • Refit the model to each of the bootstrap datasets and examine the behavior over the B replications • From the bootstrap sample, we can estimate any aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

Bootstrap - Schematic For e.g.,

Bootstrap (Cont’d) • Bootstrap to estimate the prediction error • E^rrboot does not provide a good estimate • Bootstrap dataset is acting as both training and testing and these two have common observations • The overfit predictions will look unrealistically good • By mimicking CV, better bootstrap estimates • Only keep track of predictions from bootstrap samples not containing the observations

Bootstrap (Cont’d) • The leave-one-out bootstrap estimate of prediction error • C-i is the set of indices of the bootstrap sample b that do not contain observation I • We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C-i|’s that are zero

Bootstrap (Cont’d) • The leave-one-out bootstrap solves the overfitting problem, we has a training size bias • The average number of distinct observations in each bootstrap sample is 0.632.N • Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error • There are a number of proposed methods to alleviate this problem, e.g., .632 estimator, information error rate (overfitting rate)

Bootstrap (Example) • Five-fold CV and .632 estimate for the same problems as before • Any of the measures could be biased but not affecting, as long as relative performance is the same

Lecture 12 – Model Assessment and Selection

Lecture 12 – Model Assessment and Selection

Presentation Transcript

Model Evaluation and Selection

Lecture 12 Time Series Model Estimation

Model Assessment and Selection

Lecture 4 Model Selection and Multimodel Inference

Model Assessment, Selection and Averaging

Model Selection

Lecture 14. MLP (VI): Model Selection

Model Uncertainty and Model Selection

EM and model selection

EM and model selection

EM and model selection

Model Assessment and Selection

Model Selection

Lecture 8. Model Assessment and Selection

Chapter 7 : Model Assessment and Selection

Model selection

Model Assessment, Selection and Averaging

Model Assessment and Selection

Lecture 5 Maximum Likelihood and model selection

Model Selection

Model Selection and Validation

Lecture 4 Model Selection and Multimodel Inference