1 / 13

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. CIKM´07, November 2007. Summary. Motivation Significance Testing General Approach Significance Test’s Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test; Results Discussion

hester
Télécharger la présentation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007

  2. Summary • Motivation • Significance Testing • General Approach • Significance Test’s • Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test; • Results • Discussion • Conclusions

  3. Motivation • Goal => Promote retrieval methods that truly are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation. • Given two information retrieval (IR) systems, how can we determine which one is better than the other? • Common approaches like TREC use the difference of the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests! • What significance test should IR researchers use? • Student’s paired test t? Wilcoxon signed ranked test? Sing test? bootstrap? Fisher’s randomization?

  4. Significance Testing • Significance Testing • 1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric. • 2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems • 3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.

  5. General Approach

  6. Randomization test p-value = 0.0138

  7. Wilcoxon Test p-value = 0.0560

  8. Sign Test p-value = 0.3222 p-value = 0.3604

  9. Bootstrap Test p-value = 0.0107

  10. Student’s Paired t-test p-value = 0.0153

  11. Results

  12. Discussion • Sing and Wilcoxon tests: • The use this tests should not be use because they test criteria that do not match the criteria of interest. • Randomization and Bootstrap tests: • This tests can use whatever criterion we specify while the other tests are fixed in their test statistics. • Bootstrap test and Student’s t test: • The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria. • Student’s t test: • This test can only be used for the difference between means and not for median or other test statistics. • At smaller sample sizes, violations in normality may result in errors in the t-test.

  13. Conclusion • The Randomization test is the recomendaded test to used to compare two IR systems. • The Wilcoxon Signed Ranked Test and Sign tests should no longer be used in this context. • The Randomization test, Bootstrap shifted method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them! • The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results.

More Related