Understanding Retrieval Evaluation in IR Systems: Techniques and Benchmarks

Lecture 3: Retrieval Evaluation Maya Ramanath

Benchmarking IR Systems Result Quality • Data Collection • Ex: Archives of the NYTimes • Query set • Provided by experts, identified from real search logs, etc. • Relevance judgements • For a given query, is the document relevant?

Evaluation for Large Collections • Cranfield/TREC paradigm • Pooling of results • A/B testing • Possible for search engines • Crowdsourcing • Let users decide

Precision and Recall • Relevance judgements are binary – “relevant” or “not-relevant”. • Partition the collection into 2 parts. • Precision • Recall Can a search engine guarantee 100% recall?

F-measure • F-Measure: Weighted harmonic mean of Precision and Recall Why use harmonic mean instead of arithmetic mean?

Precision-Recall Curves • Using precision and recall to evaluate ranked retrieval Source: Introduction to Information Retrieval. Manning, Raghavan and Schuetze, 2008

Single measures Precision at k, P@10, P@100, etc. and others…

Graded Relevance – NDCG • Highly relevant documents should have more importance • Higher the rank of a relevant document, more valuable it is to the user

Inter-judge Agreement – Fleiss’ Kappa N– number of results n – number of ratings/result k – number of grades nij – no. of judges who agree that the ithresult should have grade j.

Tests of Statistical Significance • Wilcoxon signed rank test • Student’s paired t-test • …and more

end of module “IR from 20000ft”

Understanding Retrieval Evaluation in IR Systems: Techniques and Benchmarks