1 / 16

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall)

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall). Stefano Mizzaro Department of Mathematics and Computer Science University of Udine mizzaro@dimi.uniud.it http://www.dimi.uniud.it/~mizzaro. Outline.

gilon
Télécharger la présentation

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University of Udine mizzaro@dimi.uniud.it http://www.dimi.uniud.it/~mizzaro

  2. Outline • Introduction: Measures of retrieval effectiveness... motivation for... • ...a new measure: Average Distance Measure (ADM) • Discussion • Theoretical and practical adequacy of ADM • ADM vs. precision and recall • Pbms. with P & R • Conclusions and future work S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  3. “Less” retrieved Not retrieved “More” retrieved Retrieved Not relevant “Less” relevant Relevant “More” relevant Documents database [Salton & McGill, 84] From binary to continuous relevance & retrieval S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  4. Continuous relevance & retrieval • SRE = System Relevance Estimate (aka RSV) • URE = User Relevance Estimate “More” relevant “Less” relevant SRE 1.0 “More” retrieved 0.5 “Less” retrieved 0 URE 1.0 0.5 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  5. R =RetRel/(RetRel+NRetRel) P = RetRel/(RetRel+RetNRel) Thresholds on URE & SRE: why? “More” relevant “Less” relevant SRE 1.0 ... and historical reasons Retrieved & relevant? Retrieved & nonrelevant? “More” retrieved 0.5 “Less” retrieved Nonretrieved& nonrelevant? Nonretrieved & relevant? 0 URE 0.5 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  6. Average Distance Measure (ADM) • SRE: • URE: • ADM = average “distance” between URE and SRE values S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  7. Exactly evaluated ADM: graphical representation SRE 1.0 0 URE 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  8. Docs. d1 d2 d3 ADM URE 0.8 0.4 0.1 IRS1 0.9 0.5 0.2 0.9 IRS2 1.0 0.6 0.3 0.8 IRS3 0.8 0.4 1.0 0.7 ADM: An example S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  9. Adequacy of ADM • One single number • Allows complete ordering of different performances • ... • ADM vs. P & R • No hyper-sensitiveness to small variations close to borders • No lack of sensitiveness to big variations inside “equivalence” regions • Wrong thresholds S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  10. P R E ADM SRE IRS1 0.67 1 0.84 0.83 1.0 IRS2 1 0.5 0.75 0.83 IRS3 0.5 0.5 0.5 0.826 0.5 0.49 unstable stable 0 URE 0.5 1.0 0 0.49 Hyper-sensitiveness: Three very similar IRSs S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  11. SRE unstable 1.0 stable 0.5 P R E ADM IRS1 1 1 1 1 IRS2 1 1 1 0.5 0 URE 0.5 1.0 0 Lack of sensitiveness:two very different IRSs S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  12. Again on the thresholds... “More” relevant “Less” relevant SRE 1.0 Retrieved & relevant? Retrieved & nonrelevant? “More” retrieved 0.5 “Less” retrieved Nonretrieved& nonrelevant? Nonretrieved & relevant? 0 URE 0.5 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  13. The “right” thresholds “More” relevant “Less” relevant SRE 1.0 E =CE / (OE + UE) OverEvaluated “More” retrieved Correctly Evaluated 0.5 “Less” retrieved UnderEvaluated 0 URE 0.5 0 1.0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  14. ADM in practice • How to get URE values? Either • asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or • averaging dichotomous/discrete relevance judgments • UREs for all the documents in the database? Impossible!! • Sampling • (that takes place with P & R too, anyway) S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  15. Conclusions • ADM, a new measure of retrieval effectiveness • Adequacy • Improvements w.r.t. P & R: avoids hyper-sensitiveness and lack of sensitiveness • Practical usability (continuous relevance judgments, sampling) • Very preliminary work S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

  16. Future work • Theoretical variations and improvements • Standard deviation in place of the difference of absolute values? • Which sampling? • Re-examine the data of some evaluation experiments (any volunteers?) • Using ADM in real life S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

More Related