CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #7 February 15, 2000

Probabilistic information retrieval • The model • Binary independence model • Non-binary independence models

Optimal Retrieval • Given a query Q and a collection • Optimal document retrieval principle • Arrange documents in descending order of probability of relevance to Q • Let OP-list denote the resulting list • If k documents are required • Take the first k documents from the OP-list

The model • Probability of relevance of each document to the query not available in practice • The following slide shows they are not needed

The model cont. • Let rel denote the event that a document is relevant to the user • P(rel| Di) is the probability that Di is relevant to the user • We need a similarity function s so that: • P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)

The similarity function

The similarity function cont.

The similarity function cont. • For every document D we need to compute:

Some History • Maron and Kuhns 1960 • Robertson and Sparck Jones 1976 • Croft and Harper 1979 • Yu Meng and Park 1989 • Other models (Robertson and Walker, Kwok)

The model • The set of all documents is partitioned with respect to the query Q into the sets rel and nonrel. • The sets rel and nonrel change from query to query

An Independence Assumption • The set of all terms are distributed independently in both rel and nonrel • Very strong assumption • Q= “What is happening with the impeachment trial?” • Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”

The model • xi = di is the event that D has di occurrences of term i. • From independence assumption:

The model • Let g(x)=log(P(x|rel)/P(x|nonrel)) • The logarithm is used to make the calculations simpler by changing multiplications to sums

Computing g(x)

Binary independence model • In this model a term occurs or does not occur in a document • Let x = (x1,…,xt) denote any document in the collection, where xi is 1 or 0.

Computing g(x)

Computing the rank of D

Term relevance weights tri • The first sum in the formula for g(x = D): • Depends only on pi and qi which are the probabilities that the ith term occurs in the relevant and the non relevant documents • It is independent of the occurrence of terms in document D

Term relevance weights tri • The second sum depends on the actual terms which appear in D.

Term relevance weights tri • tri can be interpreted as the power of term i to discriminate between the relevant and the nonrelevant documents

Notation • N is the total number of documents • R is the total number of relevant documents • ri is the number of relevant documents containing term i • dfi is the number of documents in which term i occurs

Term frequencies Occurrence Relevant Non relevant Total documents documents xi =1 ridfi-ridfi xi =0 R-riN-R-dfi+riN-dfi Total RN-RN

Computing tri • We can use the previous foil to compute: pi= ri / R qi = (dfi-ri) / (N - R) 1-pi= (R - ri) / R and 1- qi= (N - R - dfi+ ri) / (N - R)

Computing tri

The weight of a term - example

Estimating pi and qi • ri and R are not known, before the system has been used extensively and collected relevance results • Various proposals have been made for estimating pi and qi. • We discuss some of them

Estimating qi • Most documents in which term i occurs are non relevant to the average query • N is large

Estimating qi • qi can be estimated by the occurrence probability of the term in the entire collection • qi= dfi / N

Estimating tri • tri=log(pi /(1- pi)+log ((1- qi)/ qi) =C+log((1- qi) / qi)= = C+log ((1- dfi / N)/(dfi / N)) = C+log ((N - dfi) / dfi) • When pi is close to 1, C is very large

Estimating pi • Assume no relevance information • The probability of term i either occurring or not occurring in the smaller set of relevant documents can be assumed to be equal. • So pi=1/2

Estimating tri • In this case the term-relevance weight is: • tri=log1+ log ((N - dfi) / dfi) = = log ((N - dfi) / dfi) • This formula is a form of idf

Estimating r • If term i is an ideal indexing term, it occurs only in relevant documents. • In this case ri = dfi, • pi = dfi/ R and • qi= 0.

Estimating r • If term i is a poor indexing term, it is sprinkled evenly among the relevant and non relevant documents. In this case, it can be estimated by • ri= (R / N) dfi

Estimating r • We can assume that an index term is in-between an ideal and a poor one • The constants a, b, and c have some medium value

Estimating r r=a(df) for 0<=df<=R and R/N<a<1 r=b+cdf for R<df<N and 0<c<R/N r r=df R r=(R/N)df df R N

Robertson and Sparck Jones

Croft’s probabilistic model • Introduced term frequency into probabilistic model • Relevance estimated by including probability that a term appears in a document

Croft’s probabilistic model

Croft’s probabilistic model • Initial search • wijk = (C + idfi)fik where • i denotes the ith term in query j and document k, • C is a constant

Croft’s probabilistic model • fik=K+(1-K)freqik/max freqik • freqik is the frequency of occurrence of term i in document k • max freqik is the maximum frequency of any term in document k. • K is a constant

Croft’s probabilistic model • Search using feedback • pij (qij) is the probability that term i occurs in the set of relevant (non relevant) documents for query j

Croft’s probabilistic model

Croft’s probabilistic model • It is assumed that nonretrieved documents are not relevant • So both R and r can be estimated from the provided feedback • A problem arises when the user indicates that no relevant documents were retrieved

CS533 Information Retrieval