What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin

What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin Woojae Kim

What makes a good model? • How do scientists judge? • How should scientists judge?

Model selection involves many high level factors, but let me begin with a more narrow focus on statistical inference: • model comparison • model estimation • data prediction

I will focus today on quantitative models, models that make quantitative predictions for quantitative data, predictions that are exact once all parameters are assigned values. • Non-experts often find modern model selection an intimidating subject, filled with arcane terminology and difficult and complex methods for implementation. And experts argue endlessly about merits of the many approaches.

SOME METHODS: • ML (Maximum Likelihood) • AIC (Akaike Information Criterion • BIC (Bayesian Information Criterion) • BMS (Bayesian Model Selection) • FIA (Fisher Information Approximation) • NML (Normalized Maximum Likelihood) • Prequential Prediction • Cross-validation • Extrapolation • Generalization • PBCM (Parametric Bootstrap Cross-fitting Method)

Let me start by discussing the two ‘best’ methods: • MDL (Minimum Description Length) BMS (Bayesian Model Selection) • (and cross validation)

Good source: • Peter Grunwald: The Minimum Description Length principle (2007) [For background and a great deal of insight into Minimum Description Length (MDL) and its relation to Bayesian Model Selection (BMS), we highly recommend a book by Peter Grunwald: The Minimum Description Length principle, a 2007 MIT Press book that makes reasonably successful attempts to describe much of its material in side boxes and chapters that are less technical.]

A quantitative model for a given task or tasks specifies the probability of each data set, for all possible data sets that could have been found. • ‘model’ denotes a given multidimensional parameter—i.e. with all parameter values specified • ‘model class’ denotes a collection of such models • Thus y = ax+b is a class of linear models, and y = 2x+4 is a model in that class

A hierarchical model usually has some parameters assign probabilities to other parameters. • But all of the values for the parameters and hyperparameters are captured as a single multidimensional parameter (one column of the descriptive matrix I will present shortly). • All of the data for all subjects are captured as a single multidimensional data description (one row of the matrix).

Statistical model selection in its most advanced form is at heart very simple, basing goodness on the joint probability of the data and the model: • P(Di,θj) • In BMS, P(Di,θj) = P(Di|θj)Po(θj) • Po(θj) is termed the ‘prior’ probability of model θj

MODEL I and II: Joint Probability Matrix Parameter Priors Model Class I Model Class II Table entries give the joint values: The probability of a given data outcome AND the particular parameter value Data ‘Prior’

The entries are the joint probability of the model and the data. Where do these come from? In traditional BMS, they are simply the prior times the likelihood: The probability of the data given the model times the prior probability of the model. • Although one might think the joint probability should also reflect the prior probability of the data, doing so in any simple way will distort the definition of the model and the model class, so we will keep the traditional approach.

Model selection is based on a comparison of two (or more) model classes. • The classes typically differ in complexity. E.g. a data set could be fit by a linear model (simpler) or a seventh degree polynomial (more complex). • How compare? • Judge a model class by its best member (MDL/NML)? • Judge by weighted average of its members (BMS)? • How balance good fit and complexity?

BMS and NML both use the joint probability matrix for model selection. It is of course equivalent to separately give the conditional probabilities and the prior probabilities of the models, but I find it simpler to couch discussion directly in terms of the joint probabilities.

MODEL I and II: Joint Probability Matrix Parameter Priors Model Class I Model Class II Table entries give the joint values: The probability of a given data outcome AND the particular parameter value Data ‘Prior’

Of course it is critical to take prior probabilities into account to carry out sensible inference.

You are all familiar with the rare disease example: A test is 80% accurate: 80% of the time you have the disease the test says so; 80% of the time you do not have the disease the test says so. • The test says you have the disease. Should you be worried? • The incidence of the disease in the population is 1 in 1000. This is the ‘prior’ probability, and needs to be taken into account: P(disease) = 0.004 (not 0.8).

In general, what we know about data from history, and what we know about parameters from history will NOT be consistent with each other, because they are typically based on different sources of prior knowledge. Also the dimensionality of the two priors differs markedly; models are used to ‘compress’ the data. Proper inference requires that the both priors be taken into account, but the field has not taken up ways to carry out inference when these are not mutually consistent. I will soon suggest a way to take data priors into account, but for now let us follow convention and focus only on the parameter (i.e. model) prior, ignoring any data prior.

When we consider the models together, we must not be confused by the fact that a given column of joint probabilities might be identical in two model classes (if the parameter priors are the same for those columns) or related by a constant multiplier (if the parameter priors for those columns differ). {E.g. A column in each model class might be identical}. This situation occurs routinely, as when one model class is nested inside another. When we realize we are selecting model classes, not a particular model, the problem dissolves. If may help to think of two identical models in different model classes as just very similar (differing by an infinitesimal amount). We are now ready to describe BMS and MDL in terms of the joint probability matrix:

The BMS Model selection criterion is now simple: Sum the joint probabilities in the row for the observed data for model class 1, and separately form this sum for model class 2. • We prefer the class with the larger sum. More precisely, the posterior probability for class 1 is its sum divided by the sum of both sums.

MODEL Classes I and II: Joint Probability Matrix Parameter Priors Model Class I Model Class II Table entries give the joint values: The probability of a given data outcome AND the particular parameter value Data ‘Prior’

I will return to BMS after first discussing model selection approaches that base inference on the maximum probability assigned to a given data outcome within a given model class. The best such method (see Grunwald) is MDL as approximated by a particular form of NML (normalized maximum likelihood). • This is easy to describe with our matrix:

MODEL Class I: Joint Probability Matrix Parameter Priors Max in this row Table entries give the joint values: The probability of a given data outcome AND the particular parameter value Data ‘Prior’

All of the modern model selection methods balance good fit and complexity. It is easy to see how NML does this: The max fit for the observed data represents good fit: larger is better. But this is divided by the sum of maxes for all possible data outcomes: We dislike models that predict everything, and want the grand sum to be as small as possible.

The way that BMS balances good fit and complexity is called the Bayesian Occam’s Razor, and operates similarly, though it can be harder to see. • It is easiest to see the close connection of BMS and NML by re-describing the BMS model selection criterion in a new way that is nonetheless mathematically equivalent to the usual description.

In the joint probability table, take the mean joint probability value for the observed data and divide by the sum of such means for all data outcomes.

MODEL Class I: Joint Probability Matrix Parameter Priors M1 M2 M3 Mn Table entries give the joint values: BMS score for Model I is the mean for the observed data (say Row 1) divided by the sum of means for all rows Data ‘Prior’

We end with a rather simple and fairly remarkable conceptual convergence of the NML and BMS methods: • Both use the joint probability matrix. Both divide a statistic for the observed data by a sum of those statistics for all data outcomes. • The statistic for NML is the max of the distribution, and the statistic for BMS is the mean (both of the joint probability values).

The description in terms of max and mean allows us to compare the two approaches easily. • Occam’s Razor becomes clear in both BMS and NML: • Fit to the observed data is Good, • Fit to all possible data is Bad.

The way BMS and NML balance fit and complexity has many connections to another model selection criterion, prediction, often implemented in one or another form of cross-validation: • A model class is good if the fit to the current set of data predicts new data well. • Thus we might split the data and fit the first half, and prefer the model that based on that fit predicts the other half best,

Note that BMS, MDL, and goodness of prediction (e.g. cross validation) are different criteria. They usually make similar model selection choices, but are not identical (I will say more about this later). • E.g. One can predict using Bayesian Model Averaging (integrating predictions over the posterior), but this will not necessarily produce the ‘best’ predictions. (Some recent research by Grunwald shows how to ‘fix’ BMS to predict better).

I have noted the need for inference to include prior knowledge. There has been much ‘philosophical’ argumentation about the Bayesian interpretation of priors. • E.g Is it sensible to assign degrees of belief to a model we know is wrong? Thus Grunwald calls the priors ‘weights’ and does not assume they must add to 1.0. • But since BMS and NML both divide a quantity by a sum of like quantities, only the relative size of the weights/prior probabilities matter. We might as wwell think of the priors as weights.

Because all our models are known to be wrong, you may dislike assigning posterior degrees of belief to such models, as is done in BMS. If this bothers you, use the MDL/NML justification for model selection, and consider BMS a close approximation that is easier to calculate.

It has been claimed that BMS does not depend on the intent of the experimenter (the Likelihood Principle) but NML does. • However, if the difference between the two approaches is one of max vs mean, then the difference due to intent is limited to differences in max vs mean calculations.

E.g. one can carry out a Binomial study: N trials of successes and failures, observing a string of N-1 failures and then a success, or carry out a Negative Binomial study sampling until a first success occurs, also observing N-1 failures and then a success. • Given the same data, the BMS model selection score is of course the same for the two intents. • It is generally the case that this is not true for NML. • However, if the difference between the two approaches is one of max vs mean, then the difference due to intent is limited to differences in max vs mean calculations. Such differences are typically modest and we therefore regard the NML intent differences to be an aside rather than of deep fundamental importance. • (We will discuss later situations in which intent really ought to matter, though that issue is orthogonal to the present one.)

Generalizing StatisticalModel Selection:I: Data Priors

One could imagine data priors and parameter priors that are consistent: Take the joint probability matrix: The column sums are the parameter (model) priors and the row sums are the data priors, and these are then consistent with each other. • This begs the question: From where do the joint probabilities arise?

Going into an experiment what we know about models and what we know about data are (almost) always based on different sources of knowledge, and will not be consistent with each other. • In actual practice, we usually know more and are more confident about probable data outcomes than model parameter values. After all, our models are reflections of, and attempts to characterize, the real world– i.e. data.

No model selection methods, including BMS and MDL, provide a means for dealing with data priors. • There are several ways we have considered for doing so. This is research in progress. Let me mention one reasonable possibility. Consider BMS first.

Suppose our knowledge of likely data is not based on an earlier replication of the present study, but instead on vague inference from general knowledge and prior studies in other paradigms. • Such knowledge has two main dimensions: • The relative shape of data outcomes • The strength of belief in such inference

We can represent both by imagining we had a prior study: • assume the prior study had m trials (representing the strength of knowledge) • assign different probabilities to data outcomes of that study (representing shape knowledge)

For expanded inference, we select one of the prior study outcomes, call it D*j, and combine that data with each of the actual and potential data outcomes of the present study: I.e. the i rows of the matrix which previously represented Di now represent Di+D*j. • We now carry out Bayesian inference on this matrix as usual, obtaining a posterior based on both the present study and data, and one of the imagined data samples.

We do this for every imagined data sample, obtaining M* posteriors, the probability of each given by the data prior. • These posteriors are weighted by the data prior probabilities, and averaged. • The model selection criteria is as usual the sum of the resultant (average) posterior across the models in a class: BMS*DP(K) = ΣkΣip’(Dobs, Di, θk)po(Di)

To do this as stated would generally not be computationally feasible, due to the large number of imagined data outcomes. • I believe, though this is not yet confirmed, that the proposed system will work pretty well if we represent the data prior with just a few representative imagined data outcomes.

There is an analogous expanded formulation within NML, but I will not discuss that today, to save time.

To summarize, we represent prior data knowledge by imagining a prior study with size of study representing our strength of knowledge (relative to the present study), and with outcome structure representing the form of the knowledge. • We assume one of those imagined outcomes occurred, combine that outcome with the actual and possible outcomes of the present study, and carry out BMS, obtaining a posterior. • Then these posteriors are averaged.

Generalizing StatisticalModel Selection:II: Data Validity

In any real life model selection situation we not only have to consider inference based on the observed and virtual data, but also the reality that the observed data might be invalid. For example programming errors might have been made anywhere from experimental design to implementation to data analysis. • All of us have experienced cases where a research assistant brings us results we do not believe, and most often further checking reveals problems that show such data to have been invalid.

Other common cases occur with study replications where the outcomes are inconsistent to a degree unlikely to have occurred by chance. We probably trust our own study more than a study by someone else, but in truth we should allow for the possibility that either is invalid (or even that both are). • Of course our validity inferences should be governed in part by the number of studies whose results are consistent with each other: One deviant study among n consistent studies is likely the invalid one.

What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin

What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin

Presentation Transcript

Diagrammatic Reasoning and System Visualisation

works

Theories of Practice: The Human Resources Frame

Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Comp

Moral Reasoning and Ethical Theories

Richard Suchman: Inquiry

What makes a good English teacher?

CHAPTER - 11 THE HUMAN EYE AND THE COLOURFUL WORLD

Good Manufacturing Practices (GMP), Good Laboratory Practices (GLP) and Good Tissue Practices (GTP)

3.2 Black and Cox Model

Pro-life Reasoning

GAISEing into the Statistics Common Core Day 2: Statistical Association

Argumentation

PRINCIPLES OF HUMAN RELATIONS

Principles of Game Design

Introduction to Rule-Based Reasoning

MODEL OF HUMAN OCCUPATION (MOHO)

Computational models of cognitive control (I)

Reasoning lesson 1

The Unifying Concept in Biology

Pro-life Reasoning

BizSmart Value Proposition