Evaluation in Information Retrieval

Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Outlines • Basics on IR evaluation • Introduction of TREC (Text Retrieval Conference) • One selected paper • Select-the-Best-Ones: A new way to judge relative relevance

Motivated Examples • Which set is better? • S1={r, r, r, n, n} vs. S2={r, r, n, n, n} • S3={r} vs. S4={r, r, n} • Which ranking list is better? • L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • r: relevant n: non-relevant h: highly relevant

Precision & Recall • Precision is fraction of the retrieved document which is relevant • Recall is fraction of the relevant document which has been retrieved R (Relevant Set) A (Answer Set) Ra

Precision & Recall (cont.) • Assume there are 10 relevant documents in judgments • Example 1: S1={r, r, r, n, n} vs. S2={r, r, n, n, n} • P1= 3/5 = 0.6; R1= 3/10 = 0.3 • P2= 2/5 = 0.4; R2= 2/10 = 0.2 • S1 > S2 • Example 2: S3={r} vs. S4={r, r, n} • P3= 1/1 = 1; R3= 1/10 = 0.1 • P4= 2/3 = 0.667; R4= 2/10 = 0.2 • ? (F1-Measure) • Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • ? • r: relevant n: non-relevant h: highly relevant

Mean Average Precision • Defined as the mean of Average Precision for a set of queries • Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • AP1=(1/1+2/2+3/3)/10=0.3 • AP2=(1/3+2/4+3/5)/10=0.143 • L1 > L2

Other Metrics based on Binary Judgments • P@10 (Precision at 10) is the number of relevant documents in the top 10 documents in the ranked list returned for a topic • e.g. there is 3 relevant documents at the top 10 retrieved documents • P@10=0.3 • MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries • RR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic • e.g. the first relevant document is ranked as No.4 • RR = ¼ = 0.25

Metrics based on Graded Relevance • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • r: relevant n: non-relevant h: highly relevant • Which ranking list is better? • Cumulated Gains based metrics • CG, DCG, and nDCG • Two assumptions about ranked result list • Highly relevant document are more valuable • The greater the ranked position of a relevant document , the less valuable it is for the user

CG • Cumulated Gains • From graded-relevance judgments to gain vectors • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1> • CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

DCG • Discounted Cumulated Gains • Discounted function • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • DG3 = <1, 0, 0.63, 0, 0.86>, DG4 =<2, 0, 0, 0.5, 0.43> • G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1> • DCG3 = <1, 1, 1.63, 1.63, 2.49>, DCG4 =<2, 2, 2, 2.5, 2.93> • CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

nDCG • Normalized Discounted Cumulated Gains • Ideal (D)CG vector • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • Lideal = <h, r, r, n, n> • Gideal = <2, 1, 1, 0, 0>; DGideal = <2, 1, 0.63, 0, 0> • CGideal = <2, 3, 4, 0, 0>; DCGideal = <2, 3, 3.63, 3.63, 3.63>

nDCG • Normalized Discounted Cumulated Gains • Normalized (D)CG • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • DCGideal = <2, 3, 3.63, 3.63, 3.63> • nDCG3 = <1/2, 1/3, 1.63/3.63, 1.63/3.63, 2.49/3.63> • = <0.5, 0.33, 0.45, 0.45, 0.69> • nDCG4 =<2/2, 2/3, 2/3.63, 2.5/3.63, 2.93/3.63> • = <1, 0.67, 0.55, 0.69, 0.81> • L3 < L4

p(.) score p(.) score Something Important • Dealing with small data sets • Cross validation • Significant test • Paired, two tailed t-test Green < Yellow ? The difference is significant or just caused by chance

Any questions?

Introduction of TREC By Ruihua Song Web Data Management Group, MSR Asia march 30, 2010

Text Retrieval Conference • Homepage: http://trec.nist.gov/ • Goals • To encourage retrieval research based on large test collection • To increase communication among industry, academia, and government • To speed the transfer of technology from research labs into commercial products • To increase the availability of appropriate evaluation techniques for use by industry and academia

Yearly Cycle of TREC

The TREC Tracks

TREC 2009 • Tracks • Blog track • Chemical IR track • Entity track • Legal track • “Million Query” track • Relevance Feedback track • Web track • Participants • 67 groups representing 19 different countries

TREC 2010 • Schedule • By Feb 18 – submit your application to participate in TREC 2010 • Beginning March 2 • Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md. USA • What’s new • Session track • To test whether systems can improve their performance for a given query by using a previous query • To evaluate system performance over an entire query session instead of a single query • Track web page: http://ir.cis.udel.edu/sessions

Why TREC • To obtain public data sets (most frequently used in IR papers) • Pooling makes judgments unbiased for participants • To exchange ideas in emerging areas • A strong Program Committee • A healthy comparison of approaches • To influence evaluation methodologies • By feedback or proposals

TREC 2009 Program Committee Ellen Voorhees, chair James Allan Chris Buckley Gord Cormack Sue Dumais Donna Harman Bill Hersh David Lewis Doug Oard John Prager Stephen Robertson Mark Sanderson Ian Soboroff Richard Tong

Any questions?

Select-the-Best-Ones: A new way to judge relative relevance Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon Information Processing and Management, 2010

Absolute Relevance Judgments

Relative Relevance Judgments • Problem formulation • Connections between Absolute and Relative • A can be transformed to R as follows: • R can be transformed to A, if the assessors assign a relevance grade to each set.

B B B B R S S S R P S vs. S S P W W W … W W W P W W W W Quick-Sort: a pairwisestrategy

B B B B P P P P P P P P … P P P P P P P P P P P P P Select-the-Best-Ones: a proposed new Strategy

User Study • Experiment Design • Latin Square design to minimize possible practice effects and order effects • Each tool has been used to judge all three query sets; • Each query has been judged by three subjects; • Each subject has used every tool and judged every query, but there are no overlapped queries when he/she uses two different tools

User Study • Experiment Design • 30 Chinese queries are divided into three balanced sets, and cover both popular queries and long-tail queries

Scene of User Study

Basic Evaluation Results • Efficiency • Majority agreement • Discriminative power

Further Analysis on Discriminative Power • Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not • More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO

Evaluation Experiment on Judgment Quality • Collecting expert’s judgments • 5 experts, for 15 Chinese queries • Partial orders • Judge individually + discuss as a group • Experimental results

Discussion • Absolute relevance judgment method • Fast and easy-to-implement • Loses some useful order information • Quick-sort method • Light cognitive load and scalable • High complexity and unstable standard • Select-the-Best-Ones method • Efficient with good discriminative power • Heavy cognitive load and not scalable

Conclusion • We propose a new strategy called Select-the-Best-Ones to address the problem of relative relevance judgment • A user study and an evaluation experiment show that the SBO method • Outperforms the absolute method in terms of agreement and discriminative power • Dramatically improves the efficiency over the pairwise relative method QS strategy • Reduces half of the discordant pairs, compared to the QS method

Thank you! rsong@microsoft.com

Evaluation in Information Retrieval