PM 2.5 Model Performance: Lessons Learned and Recommendations

PM2.5 Model Performance: Lessons Learned and Recommendations Naresh Kumar Eladio Knipping EPRI February 11, 2004

Acknowledgements • Atmospheric & Environmental Research, Inc. (AER) • Betty Pun, Krish Vijayaraghavan and Christian Seigneur • Tennessee Valley Authority (TVA) • Elizabeth Bailey, Larry Gautney, Qi Mao and others • University of California, Riverside • Zion Wang, Chao-Jung Chien and Gail Tonnesen

Overview • Model Performance Issues • Need for Performance Guidelines/Benchmarking • Review of Statistics • Summary

Model Performance Issues • Evaluation of Modeling Systems • Local vs. Regional Evaluation • Daily/Episodic/Seasonal/Annual Averaging • Threshold and Outliers • What Species to Evaluate? • Sampling/Network Issues

Examples from Model Applications • Two applications of CMAQ-MADRID • Southeastern U.S. (SOS 1999 Episode) • Big Bend National Park, Texas (BRAVO); Four-Month Study • Statistical performance for SO42–, EC, OM, PM2.5

Application in Southeastern U.S. • Southern Oxidant Study (SOS 1999) • June 29 to July 10, 1999 • Meteorology processed from MM5 simulations using MCIP2.2 • Emissions files courtesy of TVA • Simulation • Continental U.S. Domain • 32-km horizontal resolution without nesting

45 49 Application to Big Bend National Park REMSAD CMAQ-MADRID • The Georgia Tech/Goddard Global Ozone Chemistry Aerosol Radiation Transport (GOCART) model prescribed boundary conditions for SO2 and SO42– to the REMSAD domain. • Preliminary Base Case simulation used boundary conditions as prescribed from a simulation of the larger outer domain by REMSAD. • SO2 and SO42– concentrations were scaled at CMAQ-MADRID boundary according to CASTNet and IMPROVE Network observations.

Wichita Mtns Hagerman Wright Guadalupe Lake Colorado City Patman Mtns Purtis Stephenville Creek Center Monahans Ft McKavett McDonald Esperanza Stillhouse Ft Stockton Ft La ncaster Somerville Marathon Sanderson Big Thicket Persimmon Gap LBJ Langtry Everton Presidio Amistad Ranch Brackettville Rio Grande Big Bend K - Bar Pleasanton San Bernard Eagle Pass Aransas Lake Corpus Christi Laredo Padre Island Falcon Dam Laguna Atascosa BRAVO Monitoring Network

rural suburban urban Yorkville (YRK) North Birmingham (BHM) Jefferson Street (JST) Centreville (CTR) Oak Grove (OAK) Outlying Landing Field #8 (OLF) Gulfport (GFP) Pensacola (PNS) Local vs. Regional (SOS 1999)

Spatial Distribution of Mean Normalized Bias for SO42– Spatial Distribution of Mean Normalized Errorfor SO42– Local vs. Regional (BRAVO)

MNB = 37% MNE = 65% MNB = 28% MNE = 43% y = 1.32x - 0.11 R2 = 0.56 y = 1.11x + 0.54 R2 = 0.47 MNB = 29% MNE = 35% MNB = 26% MNE = 26% y = 1.73x - 1.41 R2 = 0.87 y = 1.30x + 0.01 R2 = 0.50 Daily SO42– P:O Pairs with Different Averaging

MNB = 2% MNE = 52% MNB = 8% MNE = 49% y = 0.71x + 0.46 R2 = 0.26 y = 0.99x + 0.22 R2 = 0.58 MNB = 37% MNE = 61% MNB = 85% MNE = 92% y = 1.08x + 0.46 R2 = 0.51 y = 1.53x + 0.77 R2 = 0.51 Daily SO42– P:O Pairs for Each Month

Effect of Threshold

Mean-Normalized/Fractional Statistics

Need for Model Performance Guidelines • If no guidelines exist • Conduct model simulation with best estimate of emissions and meteorology • Perform model evaluation using favorite statistics • Difficult to compare across models • State that model performance is “quite good” or “adequate” or “reasonable” or “not bad” or “as good as it gets” • Use relative reduction factors • With guidelines for ozone modeling • If model didn’t perform within specified guidelines • Extensive diagnostics performed to understand poor performance • Improved appropriate elements of modeling system • Enhanced model performance

Issues with Defining Performance Guidelines for PM2.5 Models • What is “reasonable”, “acceptable” or “good” model performance? • Past experience: How well have current models done? • What statistical measures should be used to evaluate the models?

Criteria to Select Statistical Measures I • Simple yet Meaningful • Easy to Interpret • Relevant to Air Quality Modeling Community • Properties of Statistics • Normalized vs. Absolute • Paired vs. Unpaired • Non-Fractional vs. Fractional • “Symmetry” • Underestimates and overestimates must be perceived equally • Underestimates and overestimates must be weighted equally • Scalable: biases scale appropriately in statistics

Criteria to Select Statistical Measures II • Statistics that can attest to • Bias • Error • Ability to capture variability • Peak accuracy (to some extent) • Normalizes daily predictions paired with corresponding daily observations • Inherently minimizes effect of outliers • Some statistics/figures may be preferable for EVALUATION, whereas others may be preferred for DIAGNOSTICS

Problems with Thresholds & Outliers • Issues with addressing low-end comparisons via threshold • Instrumental uncertainty: detection limit, signal-to-noise • Operational uncertainty • Additional considerations: network, background concentration, geography, demographics • Inspection for outliers • Outlier vs. valid high observation • Definition of outlier must be objective and unambiguous • Clear guidance necessary for performance analysis.

Review of Statistics • Ratio of Means (Bias of Means) or Quantile-Quantile Comparisons • Defeats purpose of daily observations: completely unpaired • “Hides” any measure of true model performance • Normalized Mean Statistics (not to confuse with Mean Normalized) • Defeats purpose of daily observations: Equally weighs all errors regardless of magnitude of individual daily observations • Masks results in bias (e.g., numerator zero effect) • Based on Linear Regressions • Slope of Least Squares Regression; Root (Normalized) Mean Square Error • Slope of Least Median of Squares Regression (Rousseeuw regression) • Can be skewed; neglects magnitude of observations; good for cross-comparisons. • Fractional Statistics • Taints integrity of statistics by placing predictions in denominator; not scalable

Bias Statistics • Mean Normalized Bias/Arithmetic Bias Factor • Same statistic: ABF is the style for “symmetric” perception • ABF = 2:1 for 100% MNB, ABF = 1:2 for –50% MNB • MNB in % can be useful during diagnostics due to simple and meaningful comparison to MNE, but the comparison is flawed. • The statistics give less weight to underpredictions than to overpredicitions. • Logarithmic Bias Factor/Logarithmic-Mean Normalized Bias • Wholly symmetric representation of bias that satisfies all criteria • Can be written in “factor” form or in “percentage” form

Error Statistics • Mean Normalized Error • Each data point normalized with paired observation • Similar flaw as Arithmetic Mean Normalized Bias: The statistic gives less weight to underpredictions than to overpredicitions. • Logarithmic Error Factor/Logarithmic-Mean Normalized Error • Satisfies all criteria • Comparisons between logarithmic-based statistics (bias and error) are visibly meaningful when expressed in “factor” form

Comparing Bias and Error Statistics Based on Arithmetic and Logarithmic Means

Mean Normalized/Fractional Statistics

Logarithmic/Arithmetic Statistics

Bias Statistics Error Statistics Logarithmic/Arithmetic Statistics Note: MNB/ABF & MNE use 95% data interval. FB, FE, LMNB/LBF and LMNE/LEF use 100% of data.

Relating Criteria for LBF/LMNB and LEF/LMNE • Criterion for Logarithmic EF/MNE can be Established from Criterion for Logarithmic BF/MNB • For example: Error twice the amplitude of Bias • Logarithmic Bias Factor/Logarithmic-Mean Normalized Bias • LBF: 1.25:1 to 1:1.25 = LMNB: 25% to -20% • Logarithmic Error Factor/Logarithmic-Mean Normalized Error • LEF: ≤ 1.56 = LMNE: ≤ 56%

Relating Criteria for LBF/LMNB and LEF/LMNE • Criterion for Logarithmic EF/MNE can be Established from Criterion for Logarithmic BF/MNB • For example: Error twice the amplitude of Bias • Logarithmic Bias Factor/Logarithmic-Mean Normalized Bias • LBF: 1.50:1 to 1:1.50 = LMNB: 50% to -33% • Logarithmic Error Factor/Logarithmic-Mean Normalized Error • LEF: ≤ 2.25 = LMNE: ≤ 125%

Variability Statistics • Coefficient of Determination: R2 • Should not be used in absence of previous statistics • Coefficient of Determination of Linear Regressions • Least Squares Regression through Origin: Ro2 • Used by some in global model community as a measure of performance and ability to capture variability • Least Median of Squares Regression • More robust, inherently minimizes effects of outliers • Comparison of Coefficients of Variation • Comparison of Standard Deviation/Mean of predictions and observations • Other statistical metrics?

Summary Items for Discussion • What spatial scales to use for model performance? • Single Site; Local/Region of Interest • Large Domain/Continental • What statistics should be used? • What are the guidelines/benchmarks for performance evaluation? • Should the same guidelines be used for all components: • Sulfate, Nitrate, Carbonaceous, PM2.5 • Ammonium, Organic Mass, EC, “Fine Soil”, “Major Metal Oxides” • How are network considerations taken into account in guidelines? • Should models meet performance guidelines for an entire year and/or other time scales (monthly, seasonal)? • Should there be separate guidelines for different time scales? • Statistics based on daily P:O pairs • Average daily results to create weekly, monthly, seasonal or annual statistics

More Examples • More examples of Comparison of Statistics: • Fractional • Arithmetic-Mean Normalized • Logarithmic-Mean Normalized

PM 2.5 Model Performance: Lessons Learned and Recommendations