Modeling
E N D
Presentation Transcript
Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, 1999. (Chapter 2)
Introduction • Ranking algorithms • The central problem regarding IR systems is the issue of predicting which documents are relevant and which are not. • Taxonomy of IR Models • Boolean: set theoretic • Vector: algebraic • Probabilistic
Retrieval • Ad hoc • the documents in the collection remain relatively static while new queries are submitted to the system • Filtering (Routing) • the queries remain relatively static while new documents come into the system • construction of user profile
Basic Concepts • In the classic models • each document is described by a set of representative keywords called index terms • index terms are mainly nouns • distinct index terms have varying relevance • index term weights are usually assumed to be mutually independent
Boolean Model • Binary decision criterion • Data retrieval model • A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors • Advantage • clean formalism, simplicity • Disadvantage • exact matching may lead to retrieval of too few or too many documents
Vector Model (1/4) • Index terms are assigned non-binary weights • Term weights are used to compute the degree of similarity between documents and the user query • Then, retrieved documents are sorted in decreasing order. • Definition For the vector model, the weight wi,j is associated with term ki and document dj
Vector Model (2/4) • Degree of similarity
Vector Model (3/4) • Salton • IR vs. clustering • intra-clustering similarity: tf factor (term frequency) • inter-cluster dissimilarity: idf factor (inverse document frequency) • Definition • normalized frequency • inverse document fequency • term-weighting schemes • query-term weights
Vector Model (4/4) • Advantages • its term-weighting scheme improves retrieval performance • its partial matching strategy allows retrieval of documents that approximate the query conditions • its cosine ranking formula sorts the documents according to their degree of similarity to the query • Disadvantage • The assumption of mutual independence between index terms
Probabilistic Model (1/7) • Introduced by Roberston and Sparck Jones, 1976 • Also called binary independence retrieval (BIR) model • Idea: Given a user query q, and the ideal answer set of the relevant documents, the problem is to specify the properties for this set. • i.e.the probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/7) • Definition • All index term weights are all binary i.e., wi,j {0,1} • Let R be the set of documents know to be relevant to query q • Let be the complement of R • Let be the probability that the document dj is relevant to the query q • Let be the probability that the document dj is nonelevant to query q
Probabilistic Model (3/7) • The similarity sim(dj,q) of the document dj to the query q is defined as the ratio • Using Bayes’ rule, • P(R) stands for the probability that a document randomly selected from the entire collection is relevant • stands for the probability of randomly selecting the document dj from the set R of relevant documents
Probabilistic Model (4/7) • Assuming independence of index terms and given q=(d1, d2, …, dt),
Probabilistic Model (5/7) • Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R • stands for the probability that the index term ki is not present in a document randomly selected from the set R • let Pr(ki |R)=pi di is either 0 or 1 0: di is absent from q 1: di is present in q
Probabilistic Model (7/7) • The retrieval value of each ki present in a document (i.e., di=1) is term relevance weight • pj= 0.5, qj= dfj/ N