A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007

Summary • Motivation • Significance Testing • General Approach • Significance Test’s • Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test; • Results • Discussion • Conclusions

Motivation • Goal => Promote retrieval methods that truly are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation. • Given two information retrieval (IR) systems, how can we determine which one is better than the other? • Common approaches like TREC use the difference of the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests! • What significance test should IR researchers use? • Student’s paired test t? Wilcoxon signed ranked test? Sing test? bootstrap? Fisher’s randomization?

Significance Testing • Significance Testing • 1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric. • 2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems • 3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.

General Approach

Randomization test p-value = 0.0138

Wilcoxon Test p-value = 0.0560

Sign Test p-value = 0.3222 p-value = 0.3604

Bootstrap Test p-value = 0.0107

Student’s Paired t-test p-value = 0.0153

Results

Discussion • Sing and Wilcoxon tests: • The use this tests should not be use because they test criteria that do not match the criteria of interest. • Randomization and Bootstrap tests: • This tests can use whatever criterion we specify while the other tests are fixed in their test statistics. • Bootstrap test and Student’s t test: • The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria. • Student’s t test: • This test can only be used for the difference between means and not for median or other test statistics. • At smaller sample sizes, violations in normality may result in errors in the t-test.

Conclusion • The Randomization test is the recomendaded test to used to compare two IR systems. • The Wilcoxon Signed Ranked Test and Sign tests should no longer be used in this context. • The Randomization test, Bootstrap shifted method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them! • The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results.

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Presentation Transcript

Statistical Learning Methods for Information Retrieval

6. Statistical Inference: Significance Tests

Statistical Significance: Tests for Spatial Randomness

Tests of Significance for Proportions:

Tests of Significance for Means:

Tests of Significance

6. Statistical Inference: Significance Tests

Evaluation of Information Retrieval Systems

Tests of Significance

Information Retrieval Evaluation

Evaluation in Information Retrieval

TESTS OF STATISTICAL SIGNIFICANCE

Statistical Significance of Sequence Comparison Results

6. Statistical Inference: Significance Tests

Tests of significance

Tests of Significance

Tests of Significance

6. Statistical Inference: Significance Tests

Significance Tests

Statistical Significance: Tests for Spatial Randomness

Evaluation of Information Retrieval Systems