Create Presentation
Download Presentation

Download Presentation
## Beyond MARLAP: New Statistical Tests For Method Validation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Beyond MARLAP:New Statistical TestsFor Method Validation**NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53rd Annual RRMC**Outline**• The method validation problem • MARLAP’s test • And its peculiar features • New approach – testing mean squared error (MSE) • Two possible tests of MSE • Chi-squared test • Likelihood ratio test • Power comparisons • Recommendations and implications for MARLAP**The Problem**• We’ve prepared spiked samples at one or more activity levels • A lab has performed one or more analyses of the samples at each level • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level**MARLAP’s Test**• In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6 • Chose a very simple criterion • Original criterion was whether every result was within ±3uReq of the target • Modified slightly to keep false rejection rate ≤ 5 % in all cases**Equations**• Acceptance range is TV ± kuReq where • TV = target value (true value) • uReq = required uncertainty at TV, and • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03 • For smaller n we get slightly smaller k**Required Uncertainty**• The required uncertainty, uReq, is a function of the target value • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR) • φMR is the corresponding relative method uncertainty**Alternatives**• We considered a chi-squared (χ2) test as an alternative in 2003 • Accounted for uncertainty of target values using “effective degrees of freedom” • Rejected at the time because of complexity and lack of evidence for performance • Kept the simple test that now appears in MARLAP Chapter 6 But we didn’t forget about the χ2 test**Peculiarity of MARLAP’s Test**• Power to reject a biased but precise method decreases with number of analyses performed (n) • Because we adjusted the acceptance limits to keep false rejection rates low • Acceptance range gets wider as n gets larger**Biased but Precise**This graphic image was borrowed and edited for the RRMC workshop presentation. Please view the original now at despair.com. http://www.despair.com/consistency.html**Best Use of Data?**• It isn’t just about bias • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose) • The statistic – in effect – is just the worst normalized deviation from the target value • Wastes a lot of useful information**Example: The MARLAP Test**• Suppose we perform a level D method validation experiment • UBGR = AL = 100 pCi/L • uMR = 10 pCi/L • φMR= 10/100 = 0.10, or 10 % • Three activity levels (L = 3) • 50 pCi/L, 100 pCi/L, and 300 pCi/L • Seven replicates per level (N = 7) • Allow 5 % false rejections (α = 0.05)**Example (continued)**• For 21 measurements, calculate • When evaluating measurement results for target value TV, require for each result Xj: • Equivalently, require**Example (continued)**• We’ll work through calculations at just one target value • Say TV = 300 pCi/L • This value is greater than UBGR (100 pCi/L) • So, the required uncertainty is 10 % of 300 pCi/L • uReq = 30 pCi/L**Example (continued)**• Suppose the lab produces 7 results Xj shown at the right • For each result, calculate the “Z score” • We require |Zj| ≤ 3.0 for each j**Example (continued)**• Every Zj is smaller than ±3.0 • The method is obviously biased (~15 % low) • But it passes the MARLAP test**2007**• In early 2007 we were developing the new method validation guide • Applying MARLAP guidance, including the simple test of Chapter 6 • Someone suggested presenting power curves in the context of bias • Time had come to reconsider MARLAP’s simple test**Bias and Imprecision**• Which is worse: bias or imprecision? • Either leads to inaccuracy • Both are tolerable if not too large • When we talk about uncertainty (à la GUM), we don’t distinguish between the two**Mean Squared Error**• When characterizing a method, we often consider bias and imprecision separately • Uncertainty estimates combine them • There is a concept in statistics that also combines them: mean squared error**Definition of MSE**• If X is an estimator for a parameter θ, the mean squared error of X is • MSE(X) = E((X − θ)2) by definition • It also equals • MSE(X) = V(X) + Bias(X)2= σ2 + δ2 • If X is unbiased, MSE(X) = V(X)= σ2 • We tend to think in terms of the root MSE, which is the square root of MSE**New Approach**• For the method validation guide we chose a new conceptual approach A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level • We don’t care whether the MSE is dominated by bias or imprecision**Root MSE v. Standard Uncertainty**• Are root MSE and standard uncertainty really the same thing? • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related • We think our approach – testing uncertainty by testing MSE – is reasonable**Chi-squared Test Revisited**• For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 • Ignore uncertainties of target values, which should be small • Just use a straightforward χ2 test • Presented as an alternative in App. E • But the document still uses MARLAP’s simple test**The Two Hypotheses**• We’re now explicitly testing the MSE • Null hypothesis (H0): • Alternative hypothesis (H1): • In MARLAP the 2 hypotheses were not clearly stated • Assumed any bias (δ) would be small • We were mainly testing variance (σ2)**A χ2 Test for Variance**• Imagine we really tested variance only • H0: • H1: • We could calculate a χ2 statistic • Chi-squared with N − 1 degrees of freedom • Presumes there may be bias but doesn’t test for it**MLE for Variance**• The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is: • Notice similarity to χ2 from preceding slide**Another χ2 Test for Variance**• We could calculate a different χ2 statistic • N degrees of freedom • Can be used to test variance if there is no bias • Any bias increases the rejection rate**MLE for MSE**• The MLE for the MSE is: • Notice similarity to χ2 from preceding slide • In the context of biased measurements, χ2 seems to assess MSE rather than variance**Our Proposed χ2 Test for MSE**• For a given activity level (TV), calculate a χ2 statistic W: • Calculate the critical value of W as follows: • N = number of replicate measurements • α = max false rejection rate at this level**Multiple Activity Levels**• When testing at more than one activity level, calculate the critical value as follows: • Where L is the number of levels and N is the number of measurements at each level • Now α is the maximum overall false rejection rate**Evaluation Criteria**• To perform the test, calculate Wi at each activity level TVi • Compare each Wi to wC • If Wi > wC for any i, reject the method • The method must pass the test at each spike activity level • Don’t allow bad performance at one level just because of good performance at another**Lesson Learned**• Don’t test at too many levels • Otherwise you must choose: • High false acceptance rate at each level, • High overall false rejection rate, or • Complicated evaluation criteria • Prefer to keep error rates low • Need a low level and a high level • But probably not more than three levels (L=3)**Better Use of Same Data**• The χ2 test makes better use of the measurement data than the MARLAP test • The statistic W is calculated from all the data at a given level – not just the most extreme value**Caveat**• The distribution of W is not completely determined by the MSE • Depends on how MSE is partitioned into variance and bias components • Our test looks like a test of variance • As if we know δ = 0 and we’re testing σ2 only • But we’re actually using it to test MSE**False Rejections**• If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0 • But you’ll never have this situation in practice • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0 • This is the usual situation • Why we can assume the null distribution is χ2 • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero • This situation is unlikely in practice**To Avoid High Rejection Rates**• We must have wC≥N+2 • This will always be true if α<0.08, even if L=N=1 • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2 • Not stated explicitly in App. E, because: • We didn’t have a proof at the time • Not an issue if you follow the procedure • Now we have a proof**Example: Critical Value**• Suppose L = 3 and N = 7 • Let α = 0.05 • Then the critical value for W is • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates Since α < 0.08, we didn’t really have to check**Some Facts about the Power**• The power always increases with |δ| • The power increases with σ if or if • For a given bias δ with , there is a positive value of σ that minimizes the power • If , even this minimum power exceeds 50 % • Power increases with N**Power Comparisons**• We compared the tests for power • Power to reject a biased method • Power to reject an imprecise method • The χ2 test outperforms the simple MARLAP test on both counts • Results of comparisons at end of this presentation**False Rejection Rates**H1 Rejection rate = α Rejection rate < α H0 Rejection rate = 0**Region of Low Power**H1 Rejection rate = α H0**Region of Low Power (MARLAP)**H1 Rejection rate = α H0**Example: Applying the χ2 Test**• Return to the scenario used earlier for the MARLAP example • Three levels (L = 3) • Seven measurements per level (N = 7) • 5 % overall false rejection rate (α = 0.05) • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L**Example (continued)**• Reuse the data from our earlier example • Calculate the χ2 statistic • Since W > wC (17.4 > 17.1), the method is rejected • We’re using all the data now – not just the worst result**Likelihood Ratio Test for MSE**• We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods • By Danish authors Erik Holst and Poul Thyregod • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing**Likelihood Ratio Tests**• To test a hypothesis about a parameter θ, such as the MSE • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data • Based on the probability mass function or probability density function for the data**Test Statistic**• Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0 • Can use the ratio of these two maxima as a test statistic • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE**Critical Values**• It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both • They used numerical integration to approximate percentiles of λ, which serve as critical values**Equations**• For the two-sided test statistic, λ: • Where is the unique real root of the cubic polynomial • See Holst & Thyregod for details**One-Sided Test**• We actually need the one-sided test statistic: • This is equivalent to:**Issues**• The distribution of either λ or λ* is not completely determined by the MSE • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value • Apparently we improved on the authors’ method of calculating this maximum