Retrieval Evaluation

Retrieval Evaluation

Introduction • Evaluation of implementations in computer science often is in terms of time and space complexity. • With large document sets, or large content types, such performance evaluations are valid. • In information retrieval, we also care about retrieval performance evaluation, that is how well the retrieved documents match the goal.

Retrieval Performance Evaluation • We discussed overall system evaluation previously • Traditional vs. berry-picking models of retrieval activity • Metrics include time to complete task, user satisfaction, user errors, time to learn system • But how can we compare how well different algorithms do at retrieving documents?

Precision and Recall • Consider if we have a document collection, a query and its results, and a task and its relevant documents. Relevant Documents in Answer Set |Ra| RelevantDocuments|R| Retrieved Documents|A| DocumentCollection

Precision • Precision – the percentage of retrieved documents that are relevant. • = |Ra| / |A| Relevant Documents in Answer Set |Ra| RelevantDocuments|R| Retrieved Documents|A| DocumentCollection

Recall • Recall – the percentage of relevant documents that are retrieved. • = |Ra| / |R| Relevant Documents in Answer Set |Ra| RelevantDocuments|R| Retrieved Documents|A| DocumentCollection

Precision/Recall Trade-Off • We can guarantee 100% recall by returning all documents in the collection … • Obviously, this is a bad idea! • We can get a high precision rate by only returning documents that we are sure of. • Maybe a bad idea • So, retrieval algorithms are characterized by their recall and precision curve

Plotting Precision/Recall Curve • 11-Level Precision/Recall Graph • Plot precision at 0%, 10%, 20%, …, 100% recall. • Normally averages over a set of standard queries are used. • Pavg(r) = Σ ( Pi(r) / Nq ) • Example (using one query): • Relevant Documents (Rq) = {d1, d2, d3, d4, d5, d6, d7, d8, d9, d10} • Ordered Ranking by Retrieval Algorithm (Aq) = {d10, d27, d7, d44, d35, d3, d73, d82, d19, d4 , d29, d33, d48, d54, d1}

Plotting Precision/Recall Curve • Example (second query): • Relevant Documents(Rq) = {d1, d7, d82} • Ordered Ranking by Retrieval Algorithm(Aq) = {d10, d27, d7, d44, d35, d3, d73, d82, d19, d4 , d29, d33, d48, d54, d1} • Need to interpolate. • Now plot the average of a set of queries that matches expected usage and distribution

Evaluating Interactive Systems • Empirical data involving human users is time consuming to gather and difficult to draw universal conclusions from. • Evaluation metrics for user interfaces • Time required to learn the system • Time to achieve goals on benchmark tasks • Error rates • Retention of the use of the interface over time • User satisfaction

Retrieval Evaluation