Chapter 3 Retrieval Evaluation

Chapter 3 Retrieval Evaluation Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Evaluation • Function analysis • Time and space • The shorter the response time, the smaller the space used, the better the system is • Performance evaluation (for data retrieval) • Performance of the indexing structure • The interaction with the operating systems • The delays in communication channels • The overheads introduced by software layers • Performance evaluation (for information retrieval) • Besides time and space, retrieval performance is an issue

Retrieval Performance Evaluation • Retrieval task • Batch mode • The user submits a query and receives an answer back • How the answer set is generated • Interactive mode • The user specifies his information need through a series of interactive steps with the system • Aspects • User effort • characteristics of interface design • guidance provided by the system • duration of the session

Recall and Precision • Recall • the fraction of the relevant documents which has been retrieved • Precision • the fraction of the retrieved documents which is relevant collection Relevant Docs in Answer Set |Ra| Relevant Docs |R| Answer Set |A|

precision versus recall curve • The user is not usually presented with all the documents in the answer set A at once • ExampleRq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123}Ranking for query q by a retrieval algorithm1. d123  6. d9  11. d382. d84 7. d511 12. d483. d56  8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25  15. d3  (100%,10%) (precision, recall) (50%,30%) (66%,20%) (40%,40%) (33%,50%)

11 standard recall levels for a query • precision versus recall based on 11 standard recall levels: 0%, 10%, 20%, …, 100% interpolation p r e c i s i o n 120 100 80 60 40 20 0 120 20 40 60 80 100 recall

11 standard recall levels for several queries • average the precision figures at each recall level • P(r): the average precision at the recall level r • Nq: the number of queries used • Pi(r): the precision at recall level r for the i-th query

necessity ofinterpolation procedure • Rq={d3,d56,d129}1. d123 6. d9 11. d382. d84 7. d511 12. d483. d56  8. d129  13. d2504. d6 9. d187 14. d1135. d8 10. d25 15. d3  (precision, recall) (25%,66.6%) (33.3%,33.3%) (20%,100%) How about the precision figures at the recall levels 0, 0.1, 0.2, 0.3, …, 1?

interpolation procedure • rj (j  {0,1,2,…,10}): a reference to the j-th standard recall level (e.g., r5 references to the recall level 50%) • P(rj)=maxrjrrj+1P(r) • Example d56  (33.3%,33.3%) d129  (25%,66.6%) d3  (20%,100%) r0: (33.33%,0%) r1: (33.33%,10%) r2: (33.33%,20%) r3: (33.33%,30%) r4: (25%,40%) r5: (25%,50%) r6: (25%,60%) r7: (20%,70%) r8: (20%,80%) r9: (20%,90%) r10: (20%,100%) interpolated precision

Precision versus recall figurescompare the retrieval performance of distinct retrieve algorithms over a set of example queries • The curve of precision versus recall which results from averaging the results for various queries 100 90 p r e c i s i o n 80 70 60 50 40 30 20 10 0 120 60 20 40 80 100 recall

Average Precision at given Document Cutoff Values • Compute the average precision when 5, 10, 15, 20, 30, 50 or 100 relevant documents have been seen. • Provide additional information on the retrieval performance of the ranking algorithm

Single Value Summariescompare the retrieval performance of a retrieval algorithm for individual queries • Average precision at seen relevant documents • Generate a single value summary of the ranking by averaging the precision figures obtained after each new relevant document is observed • Example1. d123 (1) 6. d9  (0.5) 11. d382. d84 7. d511 12. d483. d56  (0.66) 8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25  (0.4)15. d3  (0.33) (1+0.66+0.5+0.4+0.33)/5=0.57 Favor systems which retrieve relevant documents quickly

Single Value Summaries(Continued) • Reciprocal Rank (RR) • Equals to precision at the 1st retrieved relevant document • Useful for tasks need only 1 relevant document ex: Question & Answering • Mean Reciprocal Rank (MRR) • The mean of RR over several queries

Single Value Summaries(Continued) • R-Precision • Generate a single value summary of ranking by computing the precision at the R-th position in the ranking, where R is the total number of relevant documents for the current query 1. d123  6. d9 2. d84 7. d511 3. d56  8. d129 4. d6 9. d187 5. d8 10. d25  • 1. d123 2. d84 3. 56  R=3 and # relevant=1 R-precision=1/3=0.33 R=10 and # relevant=4 R-precision=4/10=0.4

Single Value Summaries(Continued) • Precision Histograms • A R-precision graph for several queries • Compare the retrieval history of two algorithms • RPA/B=0: both algorithms have equivalent performance for the i-the query • RPA/B>0: A has better retrieval performance for query i • RPA/B<0: B has better retrieval performance for query i

Single Value Summaries(Continued) 1.5 8 1.0 0.5 0.0 3 4 5 6 7 8 9 10 1 2 -0.5 -1.0 2 -1.5 Query Number

Summary Table Statistics • Statistical summary regarding the set of all the queries in a retrieval task • the number of queries used in the task • the total number of documents retrieved by all queries • the total number of relevant documents which were effectively retrieved when all queries are considered • the total number of relevant documents which could have been retrieved by all queries • …

Precision and Recall Appropriateness • Estimation of maximal recall requires knowledge of all the documents in the collection • Recall and precision capture different aspects of the set of retrieved documents • Recall and precision measure the effectiveness over queries in batch mode • Recall and precision are defined under the enforcement of linear ordering of the retrieved documents

The Harmonic Mean • harmonic mean F(j) of recall and precision • R(j): the recall for the j-th document in the ranking • P(j): the precision for the j-th document in the ranking

1. d123 6. d9 11. d382. d84 7. d511 12. d483. d56  8. d129  13. d2504. d6 9. d187 14. d1135. d8 10. d25 15. d3  Example (33.3%,33.3%) (25%,66.6%) (20%,100%)

The E Measure • E evaluation measure • Allow the user to specify whether he is more interested in recall or precision

User-oriented measures • Basic assumption of previous evaluation • The set of relevant documents for a query is the same, independent of the user • User-oriented measures • coverage ratio • novelty ratio • relative recall • recall effort

high coverage ratio: system finds most of the relevant documents the user expected to see high novelty ratio: the system reveals many new relevant documents which were previously unknown Relevant Docs |R| Answer Set |A| (proposed by system) relative recall= recall effort: # of relevant docs the user expected to find/# of docs examined to find the expected relevant docs Relevant Docs known to the user |U| Relevant Docs previously unknown to the user which were retrieved |Ru| Relevant Docs known to the User which were retrieved |Rk|

A More Modern Relevance Metric for Web Search • Normalized Discounted Cumulated Gain (NDCG) • K. Jaervelin and J. Kekaelaeinen (TOIS 2002) • Gain: relevance of a document is no more binary • Sensitive to the position of highest rated documents • Log-discounting of gains according to the positions • Normalize the DCG with the “ideal set” DCG.

NDCG Example • Assume that the relevance scores 0 – 3 are used. G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …> • Cumulated Gain (CG) CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …>

NDCG Example(Continued) • Discounted Cumulated Gain (DCG) let b=2, DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …> • Normalized Discounted Cumulated Gain (NDCG) Ideal vector I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …> CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …> DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 1.83, …>  NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …>

測試集 (Test Collections) • 組成要素 • 文件集 (Document Set; Document Collection) • 查詢問題 (Query; Topic) • 相關判斷 (Relevant Judgment) • 用途 • 設計與發展: 系統測試 • 評估: 系統效能(Effectiveness)之測量 • 比較: 不同系統與不同技術間之比較 • 評比 • 根據不同的目的而有不同的評比項目 • 量化的測量準則，如Precision與Recall

測試集(Test Collections) (續) • 小型測試集 • 早期: Cranfield • 英文: SMART Collections, OHSUMED, Cystic Fibrosis, LISA…. • 日文: BMIR-J2 • 大型評比環境: 提供測試集及研討的論壇 • 美國: TREC • 日本: NTCIR, IREX • 歐洲: AMARYLLIS, CLEF

早期測試集： • 簡短書目資料，如題名，摘要，關鍵詞等組成 • 專門主題領域 • 近期測試集： • 多主題全文及詳細的查詢問題 • 大規模

Cranfield II(ftp://ftp.cs.cornell.edu/pub/smart/cran/) • 比較33種不同索引方式之檢索效益 • 蒐集1400篇有關太空動力學的文件(摘要形式)，請每位作者根據這些文件與其當時研究的主題提出問題，經篩選後產生200多個查詢問題 • .I 001 • .W • what similarity laws must be obeyed when constructing • aeroelastic models of heated high speed aircraft?

Cranfield II (Continued) • Cranfield II測試集中相關判斷建立四個步驟 • 首先請提出查詢問題的建構者，對文件後所附之引用及參考文獻進行，相關判斷 • 接著請五位該領域的研究生，將查詢問題與每篇文件逐一檢視，共花了1500小時，進行了50萬次以上的相關判斷，希望能找出所有的相關文件。 • 為了避免前述過程仍有遺漏，又利用文獻耦合的概念，計算文件間之相關性。發掘更多可能的相關文件。若有兩篇以上的文獻，共同引用了一篇或多篇論文，則稱這些文獻間具有耦合關係。 • 最後，將以上找出的所有文件，再一併送回給原作者進行判斷。找可能相關的文件驗證

TREC～簡介 • TREC: Text REtrieval Conference • 主辦: NIST及DARPA，為TIPSTER文件計劃之子計劃之一 • Leader: Donna Harman (Manager of The Natural Language Processing and Information Retrieval Group of the Information Access and User Interfaces Division, NIST) • 文件集 • 5GB以上 • 數百萬篇文件

History • TREC-1 (Text Retrieval Conference) Nov 1992 • TECC-2 Aug 1993 • TREC-3 • TREC-7January 16, 1998 -- submit application to NIST.Beginning February 2 -- document disks distributed to those new participants who have returned the required forms. June 1 -- 50 new test topics for ad hoc task distributedAugust 3 -- ad hoc results due at NISTSeptember 1 -- latest track submission deadline. September 4 -- speaker proposals due at NIST.October 1 -- relevance judgments and individual evaluation scores due back to participantsNov. 9-11-- TREC-7 conference at NIST in Gaithersburg, Md. TREC-8 (1999) TREC-9 (2000) TREC-10 (2001) …

The Test Collection • the documents • the example information requests (called topics in TREC) • the relevant judgments (right answers)

The Documents • Disk 1 (1GB) • WSJ: Wall Street Journal (1987, 1988, 1989) 華爾街日報 • AP: AP Newswire (1989) 美聯社 • ZIFF: Articles from Computer Select disks (Ziff-Davis Publishing) • FR: Federal Register (1989) 美國聯邦政府公報 • DOE: Short abstracts from DOE publications • Disk2 (1GB) • WSJ: Wall Street Journal (1990, 1991, 1992) • AP: AP Newswire (1988) • ZIFF: Articles from Computer Select disks • FR: Federal Register (1988)

The Documents (Continued) • Disk 3 (1 GB) • SJMN: San Jose Mercury News (1991) 聖荷西水星報 • AP: AP Newswire (1990) • ZIFF: Articles from Computer Select disks • PAT: U.S. Patents (1993) • Statistics • document lengthsDOE (very short documents) vs. FR (very long documents) • range of document lengthsAP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

DOE (very short documents) vs. FR (very long documents) TREC 文件集 AP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

Document Format (in Standard Generalized Mark-up Language, SGML) <DOC> <DOCNO>WSJ880406-0090</DOCNO> <HL>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL> <AUTHOR>Janet Guyon (WSJ staff) </AUTHOR> <DATELINE>New York</DATELINE> <TEXT> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications . . </TEXT> </DOC>

TREC之文件標示

The Topics • Issue 1 • allow a wide range of query construction methods • keep the topic (user need)distinct from the query (the actual text submitted to the system) • Issue 2 • increase the amount of information available about each topic • include with each topic a clear statement of what criteria make a document relevant • TREC • 50 topics/year, 400 topics (TREC1~TREC7)

Sample Topics used in TREC-1 and TREC-2 <top> <head>Tipster Topic Description <num>Number: 066 <dom>Domain: Science and Technology <title>Topic: Natural Language Processing <desc>Description: (one sentence description) Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. <narr>Narrative: (complete description of document relevance for assessors) A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company’s product. <con>Concepts: (a mini-knowledge base about topic such as a real searcher 1. natural language processing might possess) 2. translation, language, dictionary, font 3. software applications

<fac> Factors (allow easier automatic query building by listing specific <nat> Nationality: U.S. items from the narrative that </fact> constraint the documents that <def>Definition(s): are relevant) </top>

TREC-1 and TREC-2查詢主題

TREC-3查詢主題

Sample Topics used in TREC-3 <num>Number: 168 <title>Topic: Financing AMTRAK <desc>Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <narr>Narrative:A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsides. Document comparing government subsides given to air and bus transportation with those provided to AMTRAK would also be relevant.

Features of topics in TREC-3 • The topics are shorter. • The topics miss the complex structure of the earlier topics. • The concept field has been removed. • The topics were written by the same group of users that didassessments. • Summary: • TREC-1 and 2 (1-150): suited to the routing task • TREC-3 (151-200): suited to the ad-hoc task

TREC-4查詢主題 TREC-4只留下主題欄位，TREC-5將查詢主題調整回TREC-3 相似結構，但平均長度較短。

TREC～查詢主題 • 主題結構與長度 • 主題建構 • 主題篩選 • pre-search • 判斷相關文件的數量

TREC-6之主題篩選程序

The Relevance Judgments • For each topic, compile a list of relevant documents. • approaches • full relevance judgments (impossible)judge over 1M documents for each topic, result in 100M judgments • random sample of documents (insufficient relevance sample)relevance judgments done on the random sample only • TREC approach (pooling method)make relevance judgments on the sample of documents selected byvarious participating systemsassumption: the vast majority of relevant documents have been found andthat documents that have not been judged can be assumed to be no relevant • pooling method • Take the top 100 documents retrieved by each system for a given topic. • Merge them into a pool for relevance assessment. • The sample is given to human assessors for relevance judgments.

Chapter 3 Retrieval Evaluation