1 / 45

Evaluation in Information Retrieval

Evaluation in Information Retrieval. Speaker: Ruihua Song Web Data Management Group, MSR Asia. Outlines. Basics on IR evaluation Introduction of TREC (Text Retrieval Conference) One selected paper Select-the-Best-Ones: A new way to judge relative relevance. Motivated Examples.

callum
Télécharger la présentation

Evaluation in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

  2. Outlines • Basics on IR evaluation • Introduction of TREC (Text Retrieval Conference) • One selected paper • Select-the-Best-Ones: A new way to judge relative relevance

  3. Motivated Examples • Which set is better? • S1={r, r, r, n, n} vs. S2={r, r, n, n, n} • S3={r} vs. S4={r, r, n} • Which ranking list is better? • L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • r: relevant n: non-relevant h: highly relevant

  4. Precision & Recall • Precision is fraction of the retrieved document which is relevant • Recall is fraction of the relevant document which has been retrieved R (Relevant Set) A (Answer Set) Ra

  5. Precision & Recall (cont.) • Assume there are 10 relevant documents in judgments • Example 1: S1={r, r, r, n, n} vs. S2={r, r, n, n, n} • P1= 3/5 = 0.6; R1= 3/10 = 0.3 • P2= 2/5 = 0.4; R2= 2/10 = 0.2 • S1 > S2 • Example 2: S3={r} vs. S4={r, r, n} • P3= 1/1 = 1; R3= 1/10 = 0.1 • P4= 2/3 = 0.667; R4= 2/10 = 0.2 • ? (F1-Measure) • Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • ? • r: relevant n: non-relevant h: highly relevant

  6. Mean Average Precision • Defined as the mean of Average Precision for a set of queries • Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> • AP1=(1/1+2/2+3/3)/10=0.3 • AP2=(1/3+2/4+3/5)/10=0.143 • L1 > L2

  7. Other Metrics based on Binary Judgments • P@10 (Precision at 10) is the number of relevant documents in the top 10 documents in the ranked list returned for a topic • e.g. there is 3 relevant documents at the top 10 retrieved documents • P@10=0.3 • MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries • RR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic • e.g. the first relevant document is ranked as No.4 • RR = ¼ = 0.25

  8. Metrics based on Graded Relevance • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • r: relevant n: non-relevant h: highly relevant • Which ranking list is better? • Cumulated Gains based metrics • CG, DCG, and nDCG • Two assumptions about ranked result list • Highly relevant document are more valuable • The greater the ranked position of a relevant document , the less valuable it is for the user

  9. CG • Cumulated Gains • From graded-relevance judgments to gain vectors • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1> • CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

  10. DCG • Discounted Cumulated Gains • Discounted function • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • DG3 = <1, 0, 0.63, 0, 0.86>, DG4 =<2, 0, 0, 0.5, 0.43> • G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1> • DCG3 = <1, 1, 1.63, 1.63, 2.49>, DCG4 =<2, 2, 2, 2.5, 2.93> • CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

  11. nDCG • Normalized Discounted Cumulated Gains • Ideal (D)CG vector • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • Lideal = <h, r, r, n, n> • Gideal = <2, 1, 1, 0, 0>; DGideal = <2, 1, 0.63, 0, 0> • CGideal = <2, 3, 4, 0, 0>; DCGideal = <2, 3, 3.63, 3.63, 3.63>

  12. nDCG • Normalized Discounted Cumulated Gains • Normalized (D)CG • Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> • DCGideal = <2, 3, 3.63, 3.63, 3.63> • nDCG3 = <1/2, 1/3, 1.63/3.63, 1.63/3.63, 2.49/3.63> • = <0.5, 0.33, 0.45, 0.45, 0.69> • nDCG4 =<2/2, 2/3, 2/3.63, 2.5/3.63, 2.93/3.63> • = <1, 0.67, 0.55, 0.69, 0.81> • L3 < L4

  13. p(.) score p(.) score Something Important • Dealing with small data sets • Cross validation • Significant test • Paired, two tailed t-test Green < Yellow ? The difference is significant or just caused by chance

  14. Any questions?

  15. Introduction of TREC By Ruihua Song Web Data Management Group, MSR Asia march 30, 2010

  16. Text Retrieval Conference • Homepage: http://trec.nist.gov/ • Goals • To encourage retrieval research based on large test collection • To increase communication among industry, academia, and government • To speed the transfer of technology from research labs into commercial products • To increase the availability of appropriate evaluation techniques for use by industry and academia

  17. Yearly Cycle of TREC

  18. The TREC Tracks

  19. TREC 2009 • Tracks • Blog track • Chemical IR track • Entity track • Legal track • “Million Query” track • Relevance Feedback track • Web track • Participants • 67 groups representing 19 different countries

  20. TREC 2010 • Schedule • By Feb 18 – submit your application to participate in TREC 2010 • Beginning March 2 • Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md. USA • What’s new • Session track • To test whether systems can improve their performance for a given query by using a previous query • To evaluate system performance over an entire query session instead of a single query • Track web page: http://ir.cis.udel.edu/sessions

  21. Why TREC • To obtain public data sets (most frequently used in IR papers) • Pooling makes judgments unbiased for participants • To exchange ideas in emerging areas • A strong Program Committee • A healthy comparison of approaches • To influence evaluation methodologies • By feedback or proposals

  22. TREC 2009 Program Committee Ellen Voorhees, chair James Allan Chris Buckley Gord Cormack Sue Dumais Donna Harman Bill Hersh David Lewis Doug Oard John Prager Stephen Robertson Mark Sanderson Ian Soboroff Richard Tong

  23. Any questions?

  24. Select-the-Best-Ones: A new way to judge relative relevance Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon Information Processing and Management, 2010

  25. Absolute Relevance Judgments

  26. Relative Relevance Judgments • Problem formulation • Connections between Absolute and Relative • A can be transformed to R as follows: • R can be transformed to A, if the assessors assign a relevance grade to each set.

  27. B B B B R S S S R P S vs. S S P W W W … W W W P W W W W Quick-Sort: a pairwisestrategy

  28. B B B B P P P P P P P P … P P P P P P P P P P P P P Select-the-Best-Ones: a proposed new Strategy

  29. User Study • Experiment Design • Latin Square design to minimize possible practice effects and order effects • Each tool has been used to judge all three query sets; • Each query has been judged by three subjects; • Each subject has used every tool and judged every query, but there are no overlapped queries when he/she uses two different tools

  30. User Study • Experiment Design • 30 Chinese queries are divided into three balanced sets, and cover both popular queries and long-tail queries

  31. Scene of User Study

  32. Basic Evaluation Results • Efficiency • Majority agreement • Discriminative power

  33. Further Analysis on Discriminative Power • Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not • More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO

  34. Evaluation Experiment on Judgment Quality • Collecting expert’s judgments • 5 experts, for 15 Chinese queries • Partial orders • Judge individually + discuss as a group • Experimental results

  35. Discussion • Absolute relevance judgment method • Fast and easy-to-implement • Loses some useful order information • Quick-sort method • Light cognitive load and scalable • High complexity and unstable standard • Select-the-Best-Ones method • Efficient with good discriminative power • Heavy cognitive load and not scalable

  36. Conclusion • We propose a new strategy called Select-the-Best-Ones to address the problem of relative relevance judgment • A user study and an evaluation experiment show that the SBO method • Outperforms the absolute method in terms of agreement and discriminative power • Dramatically improves the efficiency over the pairwise relative method QS strategy • Reduces half of the discordant pairs, compared to the QS method

  37. Thank you! rsong@microsoft.com

More Related