Methods for Selecting Weights in Information Retrieval

CS 430: Information Discovery Lecture 4 Ranking

Course Administration • The slides for Lecture 3 have been reposted with slightly revised notation. • The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address. • Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours. • Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

Choice of Weights query qant dog document text terms d1ant ant beeant bee d2dog bee dog hog dog ant dogant bee dog hog d3cat gnu dog eel foxcat dog eel fox gnu ant bee cat dog eel fox gnu hog q ? ? d1? ? d2? ? ? ? d3? ? ? ? ? What weights lead to the best information retrieval?

Methods for Selecting Weights Empirical Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.) Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

Weighting1. Term Frequency Suppose term j appears fij times in document i. What weighting should be given to a term j? Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

Term Frequency: Free-text Document Length of document Simple method (as illustrated in Lecture 3) is to use fij as the term frequency. ...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length. i

Term Frequency: Free-text Document Standard method for free-text documents Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents. Let mi = max (fij) i.e., miis the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi whenfij > 0 Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. i

Weighting2. Inverse Document Frequency Suppose term j appears fij times in document i. What weighting should be given to a term j? Inverse Document Frequency: Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is nj. A possible method might be to use n/nj as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idfj = log2 (n/nj) + 1 nj > 0 Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

Example of Inverse Document Frequency Example n = 1,000 documents term jnjidfj A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 From: Salton and McGill

Full Weighting: Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) * (inverse document frequency) The standard tf.idf weighting scheme, for free text documents, is: tij = tfij *idfj = (fij / mi) * (log2 (n/nj) + 1) when nj > 0

Structured Text Structured text Structured texts, e.g., queries or catalog records, have different distribution of terms from free-text. A modified expression for the term frequency is: tfij = K + (1 - K)*fij / mi whenfij > 0 K is a parameter between 0 and 1 that can be tuned for a specific collection. Query To weigh terms in the query, Salton and Buckley recommend K equal to 0.5. i

Similarity The similarity between query q and document i is given by:  tqktik |dq| |di| Where dq and di are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by: tqk = (0.5+ 0.5*fqk / mq)*(log2 (n/nk) + 1) when fqk > 0 tik = (fik / mi) * (log2 (n/nk) + 1) when fik > 0 n cos(dq, di) = k=1

Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacusandactor abacusoractor (abacus and actor)or(abacus and atoll) not actor

Boolean Diagram not (A or B) A and B A B A or B

Adjacent and Near Operators abacusadjactor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacusnear 4actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Inverted File Inverted file: A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

Inverted File -- Basic Concept • Word Document • abacus 3 • 19 • 22 • actor 2 • 19 • 29 • aspen 5 • atoll 11 • 34 Stop words are removed and stemming carried out before building the index.

Inverted List -- Concept • Inverted List: All the entries in an inverted file that apply to a specific word, e.g. • abacus 3 • 19 • 22 Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

Evaluating a Boolean Query 3 19 22 2 19 29 Examples: abacusandactor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". To evaluate the and operator, merge the two inverted lists with a logical AND operation.

Enhancements to Inverted Files -- Concept Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization

Inverted File -- Concept (Enhanced) • Word Postings Document Location • abacus 4 3 94 • 19 7 • 19 212 • 22 56 • actor 3 2 66 • 19 213 • 29 45 • aspen 1 5 43 • atoll 3 11 3 • 11 70 • 34 40

Evaluating an Adjacency Operation 3 94 19 7 19 212 22 56 2 66 19 213 29 45 Examples: abacusadjactor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

Methods for Selecting Weights in Information Retrieval

Methods for Selecting Weights in Information Retrieval

Presentation Transcript

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery