Who Nose What Tomorrow Brings?

Who Nose What Tomorrow Brings? David J. Weiss

Some Predictions from Swami Weiss • There will be a severe hurricane threat to Florida during the second week of October 2014 • The LA Lakers will fail to make the playoffs in the 2013-2014 season • The married folks in the audience will buy an expensive gift for their spouse during December 2014 • During 2015, Michael Birnbaum will publish an excellent paper presenting results that cannot be accounted for by SEU or prospect theory

The Expert Forecaster • Weatherperson, sports bookmaker, investment advisor, intelligence analyst, safety engineer, personnel analyst, admissions director, marriage counselor, parole board member, custody judge • The first two professions make predictions about short-term, unitary events. • The others make predictions about events that will play out over a relatively long term. They also usually recommend actions based on their probability estimates.

Everyone Forecasts • Amateurs also make predictions. • Are the professionals really expert? • How can we tell?

Three General Approaches • Credentials • Experience • Performance-Based Assessment • Scoring outcomes (prediction accuracy) • This talk examines some of the challenges in scoring outcomes (Hammond’s “Correspondence”)

Technical Matters • Specificity of the prediction • Determining whether the predicted event did in fact occur • Duration of the observation period. For a prediction that unfolds over time, the ultimate result can change • These ambiguities can usually be resolved (Tetlock), but in practice are often overlooked • Announcing the prediction can affect the outcome

Scoring Index • Percent correct (batting average) is the easy solution, but • Scoring over the person’s predictions assumes the events are comparable • (as Moneyball highlighted, baseball batting average has a similar shortcoming) • Some of the Swami’s predictions may be more likely to come true than others

The Playing Field is Not Level • Base-rate differences • Weather is easy to predict in Fullerton • The road not taken • Almost every applicant accepted by Princeton graduates, not so at Cal State • Would Princeton’s rejects have graduated from Cal State?

Calibration Saves the Day? • When an expert repeatedly makes predictions of similar events, one can evaluate accuracy at a finer level. • Calibration imposes a different standard from batting average • “Well-calibrated” sounds like “expert” • Is calibration a clever way to discount errors?

Remembering Our Founder • Experts and the public resist forecasts expressed in probabilistic terms • They may be right to do so, because calibration makes sense only when both kinds of errors are equally costly. As Ward kept telling us, utilities are the proper basis for decisions. This limitation applies similarly to sophisticated “skill scores” such as those of Murphy (1988) and Stewart and Lusk (1994). • Predictions also vary in importance

The “Bold Prediction” • If calibration overweights the mundane, then should we judge forecasters by how well they predict the spectacular? We might care more about predicting hurricanes than partly cloudy days • Be sure to address false alarms (crying “wolf” too often leads to being ignored) • Because rare events are, well, rare, they provide little data

Is Forecasting Even Possible? • Taleb’s turkey highlights the danger in using past results to predict future outcomes • But what else is there to guide us but the past? • Perspective matters: When the event to be predicted is under the control of a human (such as killing a turkey or planting a bomb), someone with knowledge about that human might be able to predict it without historical information.

An Expert Turkey • Learns about the farmer’s (or other farmers’) plans • Observes that there are no old turkeys around, and draws an inference • These methods do not use observations of the focal event (previous turkey beheadings) to predict the future.

Two Kinds of Environment (Taleb) • Mediocristan, where single observations do little to change the aggregate. Processes are stationary, and statistical models are descriptive. • Regression-based prediction works in Mediocristan. Experts can potentially use better prediction models. Better can mean either a more accurate model or better parameter estimates. One can learn from results.

The More Challenging Environment • Extremistan, where the total can be significantly impacted by a single observation • The world of the Black Swan, where the past is not a good guide to the future • Even successful prediction may represent a case of being fooled by randomness (no re-test is available)

The Environment and Evaluation • Scoring outcomes is feasible in Mediocristan. Probabilistic forecasts can be compared to observed relative frequencies. • Aggregation over instances is meaningful because the utilities are comparable. • But not so in Extremistan, where a single inaccurate prediction may be much more consequential than a host of accurate ones. • Probabilistic forecasts in Extremistan are opinions, and cannot be compared to observed frequencies

Which Environment Do Experts Inhabit? • Both, of course. But Extremistan is where some really important events reside, and where accuracy of previous predictions need not be informative. • While weather forecasting is certainly useful, it is not typical of the kind of forecast we get excited about, namely predicting the black swan. For example, the furor over the failure of US intelligence to anticipate 9/11 was an indictment of prediction in Extremistan. • Predicting back swans that are the result of human action calls for getting an insider’s perspective. Meehl’s “broken leg” cue, insider trading, infiltration

Sorry, Ken Hammond • Because scoring outcomes in Extremistan is so problematic, I suggest evaluating performance by examining coherence (process) instead. • The expert turkey was able to predict catastrophe by doing what looks like what intelligence officers are supposed to do – learning about plans (spying) and drawing causal inferences from observations.

Evaluating Coherence • If we think we know the correct process, we could evaluate the expert on the basis of adherence to that process. • If we lack that confidence, what can be done? • We can examine the discrimination and consistency in the predictions (you may have heard this before). All this really means is that predictions should be responsive to the relevant evidence.

Reality Check • It is unlikely that users of predictions will be satisfied to know that their forecaster discriminates and is consistent. People like correspondence. Most will demand a track record of accurate predictions. • Unfortunately, such a record is unlikely ever to be available in Extremistan.

Cart Before Horse • Millions of tax dollars are currently being spent on an IARPA project whose goal is to determine how best to aggregate predictions made by individual forecasters. • IMHO, that money is going down a well. Without better understanding of how to evaluate predictions in Extremistan, one cannot say whether one aggregation method is better than another. IARPA assumes all environments are Mediocristan.

Summary • Batting average, while basically appropriate for evaluating expert predictors in Mediocristan, should be improved by incorporating utilities • Calibration, a sophisticated version of batting average, might be generalized to include utilities. • Probabilities cannot be compared to observed frequencies in Extremistan. Predictors need to realize when they have crossed the border.

Who Nose What Tomorrow Brings?