CS 430: Information Discovery

CS 430: Information Discovery Lecture 7 Evaluation of Retrieval Effectiveness: Cranfield and TREC

Course administration • Cancellation of Discussion Class 3 • There was a delay in setting up email address cs430-1@cornell.edu. Until this address is available, send your assignments to cs430@cs.cornell.edu.

Oxford English Dictionary

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning

Tries: Sistrings A binary example String: 01 100 100 010 111 Sistrings: 1 01 100 100 010 111 2 11 001 000 101 11 3 10 010 001 011 1 4 00 100 010 111 5 01 000 101 11 6 10 001 011 1 7 00 010 111 8 00 101 11

Tries: Lexical Ordering 7 00 010 111 4 00 100 010 111 8 00 101 11 5 01 000 101 11 1 01 100 100 010 111 6 10 001 011 1 3 10 010 001 011 1 2 11001 000 101 11 Unique string indicated in blue

Trie: Basic Concept 1 0 1 0 1 0 2 0 1 0 1 0 7 5 1 1 0 0 6 3 0 1 4 8

Patricia Tree 4 3 3 2 2 5 1 1 0 1 0 1 00 2 0 1 1 0 0 10 7 5 1 6 3 0 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.

Discussion Class 3 Stemming Algorithms

Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b) Define the terms in the following diagram: Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

Question 2: CATALOG System Search Term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 Which term (0 = none, CR = all): (a) The CATALOG stemmer differs in a fundamental way from other tools that we have seen in this course. What is it? (b) What impact does this have on measurements of precision and recall?

Question 3: Successor variety methods Test word: FINDABLE Corpus: ABLE, APE, DAB, DABBLE, FIN, FIND, FINDABLE, FINDER, FOUND, FINISH, FIXABLE Prefix Successor Variety Letters F FI FIN FIND FINDA FINDAB FINDABL FINDABLE (a) Fill in this table

Question 3 (continued): Successor variety methods (a) Segment FINDABLE using the complete word segmentation method. (b) Segment FINDABLE using the peak and plateau method.

Question 4: n-gram methods (a) Explain the following notation: statistics => st ta at ti is st ti ic cs unique diagrams =>at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti (b) Calculate the similarity using Dice's coefficient: S = 2C A + B A is the number of unique diagrams in the first term B is the number of unique diagrams in the second term C is the number of shared unique diagrams (c) Explain the statement that it is a bit confusing to call this a "stemming method".

Question 5: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How is longest match achieved in the Porter algorithm?

Question 6: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?

Question 7: Evaluation (a) In Web search engines, the tendency is not to use stemming. Why? (There are at least three answers.) (b) Does your answer to part (a) mean that stemming is no longer useful?

CS 430: Information Discovery Lecture 7 Evaluation of Retrieval Effectiveness: Cranfield and TREC

Retrieval Effectiveness Designing an information retrieval system has many decisions: Manual or automatic indexing? Natural language or controlled vocabulary? What stoplists? What stemming methods? What query syntax? etc. How do we know which of these methods are most effective? Is everything a matter of judgment?

Studies of Retrieval Effectiveness • The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 -1968 • SMART System, Gerald Salton, Cornell University, 1964-1988 • TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 -

Cranfield Experiments (Example) Comparative efficiency of indexing systems: (Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing) Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics. Indexes for testing were prepared on index cards and other cards. Verycareful control of indexing procedures.

Cranfield Experiments (continued) Searching: • 1,200 test questions, each satisfied by at least one document • Reviewed by expert panel • Searches carried out by 3 expert librarians • Two rounds of searching to develop testing methodology • Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

The Cranfield Data The Cranfield data was made widely available and used by other researchers • Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing • Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

Some Cranfield Results • The various manual indexing systems have similar retrieval efficiency • Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies -> original results from the Cranfield experiments -> considered counter-intuitive -> other results since then have supported this conclusion

Cranfield Experiments -- Analysis Cleverdon introduced recall and precision, based on concept of relevance. recall (%) practical systems precision (%)

The Cranfield methodology • Recall and precision: depend on concept of relevance -> Is relevance a context-, task-independent property of documents? "Relevance is the correspondence in context between an information requirement statement (a query) and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement." F. W. Lancaster, 1979

Relevance • Recall and precision values are for a specific set of documents and a specific set of queries • Relevance is subjective, but experimental evidence suggests that for textual documents different experts have similar judgments about relevance • Estimates of relevance level are less consistent • Query types are important, depending on specificity -> subject-heading queries -> title queries -> paragraphs Tests should use realistic queries

Text Retrieval Conferences (TREC) • Led by Donna Harman (NIST), with DARPA support • Annual since 1992 • Corpus of several million textual documents, total of more than five gigabytes of data • Researchers attempt a standard set of tasks -> search the corpus for topics provided by surrogate users -> match a stream of incoming documents against standard queries • Participants include large commercial companies, small information retrieval vendors, and university research groups.

The TREC Corpus Source Size # Docs Median (Mbytes) words/doc Wall Street Journal, 87-89 267 98,732 245 Associated Press newswire, 89 254 84,678 446 Computer Selects articles 242 75,180 200 Federal Register, 89 260 25,960 391 abstracts of DOE publications 184 226,087 111 Wall Street Journal, 90-92 242 74,520 301 Associated Press newswire, 88 237 79,919 438 Computer Selects articles 175 56,920 182 Federal Register, 88 209 19,860 396

The TREC Corpus (continued) Source Size # Docs Median (Mbytes) words/doc San Jose Mercury News 91 287 90,257 379 Associated Press newswire, 90 237 78,321 451 Computer Selects articles 345 161,021 122 U.S. patents, 93 243 6,711 4,445 Financial Times, 91-94 564 210,158 316 Federal Register, 94 395 55,630 588 Congressional Record, 93 235 27,922 288 Foreign Broadcast Information 470 130,471 322 LA Times 475 131,896 351

The TREC Corpus (continued) Notes: 1. The TREC corpus consists mainly of general articles. The Cranfield data was in a specialized engineering domain. 2. The TREC data is raw data: -> No stop words are removed; no stemming -> Words are alphanumeric strings -> No attempt made to correct spelling, sentence fragments, etc.

TREC Topic Statement <num> Number: 409 <title> legal, Pan Am, 103 <desc> Description: What legal actions have resulted from the destruction of Pan Am Flight 103 over Lockerbie, Scotland, on December 21, 1988? <narr> Narrative: Documents describing any charges, claims, or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant. A sample TREC topic statement

TREC Experiments 1. NIST provides text corpus on CD-ROM Participant builds index using own technology 2. NIST provides 50 natural language topic statements Participant converts to queries (automatically or manually) 3. Participant run search, returns up to 1,000 hits to NIST. NIST analyzes for recall and precision (all TREC participants use rank based methods of searching)

Relevance Assessment For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. In a TREC-8 example, with 71 participants: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant

A Cornell Footnote The TREC analysis uses a program developed by Chris Buckley, who spent 17 years at Cornell before completing his Ph.D. in 1995. Buckley has continued to maintain the SMART software and has been a participant at every TREC conference. SMART is used as the basis against which other systems are compared. During the early TREC conferences, the tuning of SMART with the TREC corpus led to steady improvements in retrieval efficiency, but after about TREC-5 a plateau was reached. TREC-8, in 1999, was the final year for this experiment.

CS 430: Information Discovery