Wilcoxon Rank Sum Test: Misconceptions and Application

The Wilcoxon rank sum test:What does it really test? George W. Divine, Henry Ford Hospital, Detroit MI gdivine1@hfhs.org Anna Baron, Colorado School of Public Health Elizabeth Juarez-Colunga Colorado School of Public Health H. James Norton, Carolinas Medical Center, Charlotte, NC

“It's not what you don't know that hurts you, it's what you know that ain't so!”—Will Rogers

Have you ever seen any the following in a text book or in the documentation for a statistical package? • The null hypothesis for the Wilcoxon rank sum test is the median of the two populations are equal. • The null hypothesis for the Wilcoxon rank sum test is random samples have been drawn from populations having identical distributions. • The null hypothesis for the Wilcoxon signed rank sum test is that the pair differences come from a population having median zero.

Some textbook quotes Re: WMW “The null hypothesis is that the two populations have the same median.”In: “Statistics: Concepts and Applications for Science”, by David LeBlanc, Jones and Barlett Publishers, Sudsbury, MA, 2004.“The two samples come from populations with equal medians.”In: “Biostatistics for the Biological and Health Sciences”, by Marc M Triola, M.D. & Mario F. Triola, Pearson Education, Addison Wesley, N.Y., 2006

Other quotes Re: WMW From the Minitab manual Stat > Nonparametrics > Mann-Whitney You can perform a 2-sample rank test (also called the Mann-Whitney test, or the two-sample Wilcoxon rank sum test) of the equality of two population medians, and calculate the corresponding point estimate and confidence interval. The hypotheses are H0: η1 = η2 versus H1: η1 ≠ η2 , where η is the population median.

Ralph O’Brien Quote: • “Even worthy statistics books (and knowledgeable statisticians!) state that the WMW test compares the two medians, but this is only true in the rarest of cases in which the population distributions of the two groups are merely shifted versions of each other (i.e., differing only in location, and not shape or scale).”

Which statistical test has as its null hypothesis: • The median of the two (or more) populations are equal? The median test for independent samples. 2. The pair differences come from a population having median zero? The sign test.

Aromatherapy for PONV(Post operative nausea & vomiting) • Primary outcome measure: a four level verbal descriptive scale (VDS) for nausea • The PONV scale has as possible values: 0: none 1: some 2: a lot 3: severe

Patients Randomized toFour Treatment Arms • Normal Saline (placebo) • Isopropol Alcohol • Essential Oil of Ginger • A Blend of Essential Oils of Ginger, Spearmint, Peppermint and Cardamom

WSR-test p < .0001 Ginger vs. Alcohol WRS-test p = 0.017

What does the WMW do? • The Wilcoxon-Mann-Whitney test is a “non-parametric” test. • It does not depend upon any particular distributional form (or parameters) in order to generate the test statistic and p-value • However, it does make assumptions about the distributions being compared • (In fact, the whole distributions are being compared, rather than their parameters)

What a WMW Test Really Does • The WMW procedure tests the null hypothesis that Prob(X<Y) = 0.5, where X and Y are random observations from the populations being compared • Conveniently the Mann-Whitney U statistic divided by mn, gives an estimate of Prob(X<Y)

m and n – sample sizes for groups 1 and 2 Compute the Mann-Whitney U statistic Compute p''= Prob(X<Y), from formula Compute The WMW Test Calculation(s)

Table 1A Pr(X < Y) and Associated Odds of a Greater Reduction in PONV Score Comparing Ginger vs. Alcohol, p” = 0.60 Odds = .6/.4 = 1.5

The WMW as a Median test? • Problem 1: The WMW calculation/test statistic is not a function of the sample medians nor their difference • Problem 2: The assumption of identical distribution shapes very often (usually?) fails to hold. (For example, for bounded outcomes such as Likert scales.) • Problem 3: If G(x) = F(x+Δ), Δ is equal to the difference in the medians, but it is also equal to the difference in the means, or in the 40th percentiles, or in the 5th percentiles, or in the 43rd percentiles, etc., etc.

Counterexample 1: Equal Medians, but a significant difference • We can have equal medians, … but a significant WMW test. Consider • Population 1 {1-7, 14.5,16-22 } • Population 2 { 8-14,14.5, ,23-30} • Observations from Popn 1 clearer tend to be lower than from Popn 2, i.e. U = 7x7+7.5+7x15 = 161.5, U/mn = 161.5/225 = Pr(X<Y) = 0.72the WMW test is significant (z=2.01, p=0.042) • Both populations have 15 observations, and the 8th (median) observation is 14.5 for both

Counterexample 2: Unequal Medians & no significant difference • We can have very unequal medians, … but a non-significant WMW test … consider • Population 1 {1-7, 14.5, ,123-130} • Population 2 { 8-14,114.5,116-122 } • The medians are quite different: 14.5 vs 114.5 • But otherwise, overall the observations from Popn 1 are no higher or lower than those from Popn 2, U = 7x7+8+7x8 = 113/225 = Pr(X<Y)=0.502 • Hence, median 1 << median 2, despite a non-significant WMW test result (z=0.0, p=1.0)

Counterexample 3: Unequal Medians, a significant test, & they disagree! • We can have very unequal medians, … but a WMW test in the other direction … consider • Population 1 {1-7, 114.5,116-122, } • Population 2 { 8-14,14.5 , 123-130} • The Popn 1 median is much higher than that for Popn 2, 114.5 >> 14.5 • However, overall the observations from Popn 1 are much lower than those from Popn 2, i.e. U=7x7+7+7x15=161, and U/mn=161/225=0.72 = Pr(X<Y), and z=1.99, p=0.044

Counterexample 4: • Prof. Chase M. Itail and his assistant (M.C.E.) observed the data below Observations Median • Set A: {1 7 8 9 } 7.5 • Set B: { 2 3 10 11 } 6.5 • Set C: { 4 5 6 12} 5.5 • For A vs B, U=2+2x4=10, p''=10/16=0.625 • For B vs C, U=2x3+4=10, p''=10/16=0.625 • For C vs A, U=0+3x3= 9, p''= 9/16=0.5625

Counterexample 4 (continued) • For Medians: A > B > C • But with 41 observations tied at each value shown (i.e. m=n=164), the WMW test p-values associated with p''AvsB, p''BvsC, and p''CvsA, are <0.001, <0.001 and 0.0485, respectively • Therefore, A < B

Counterexample 4 (continued) • For Medians: A > B > C • But with 41 observations tied at each value shown (i.e. m=n=164), the WMW test p-values associated with p''AvsB, p''BvsC, and p''CvsA, are <0.001, <0.001 and 0.0485, respectively • Therefore, A < B < C

Counterexample 4 (continued) • For Medians: A > B > C • But with 41 observations tied at each value shown (i.e. m=n=164), the WMW test p-values associated with p''AvsB, p''BvsC, and p''CvsA, are <0.001, <0.001 and 0.0485, respectively • Therefore, A < B < C < A

A Visual “Proof” from M.C.E.?

What is going on? • The WMW procedure tests the location of one distribution relative to another • But this relative location relationship is only defined when the two distributions are specified and ranked together • (A very rough analogy might be age standardization, where results can vary greatly depending upon the standard population selected)

Why is the WMW considered a test of Medians? • It “is” if the assumptions hold. (At least it is of the population medians). • If the assumptions are mostly true, then this assertion is not that far off? • Authoritative sources say that the WMW tests medians.

Why is the WMW considered a test of Medians? • Flawed syllogism? #1For normally distributed data, means and t-tests should be reported.For skewed data, medians and WMW tests should be reported.Therefore, the WMW is connected to medians, so it must be good to test medians!

Why is the WMW considered a test of Medians? • Flawed syllogism? #2The median is computed by ordering observations.The WMW test is rank statistic.Therefore, the WMW tests medians!

Why is the WMW considered a test of Medians? • A final conjecture:We want and (need!) a test of central tendency when data is not normally distributed.If the WMW isn’t this, we’re left with no satisfactory alternative?

Summary • The WMW only tests medians conceptually, not computationally • It even fails to test medians conceptually, unless the true target populations differ only in location • The WMW really tests Pr(X<Y)=0.5 (both conceptually and computationally)

Wilcoxon Rank Sum Test: Misconceptions and Application

Wilcoxon Rank Sum Test: Misconceptions and Application

Presentation Transcript

Wilcoxon’s Rank-Sum Test (two independent samples) n1 + n2 ≤ 25: Same Distributions

Nonparametric tests II

Chapter 2

Test Development

CPSY 501: Lecture 11, Nov14

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Test Case

TEST

System Test Specification

National Percentile Rank

Statistical Methods II

Introduction to Biostatistics

Power of 1

Test Security

Chapter 19 Nonparametric Methods

Warm up

Krisztina Boda PhD Department of Medical Informatics, University of Szeged