1 / 45

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #7 February 15, 2000. Probabilistic information retrieval. The model Binary independence model Non-binary independence models. Optimal Retrieval. Given a query Q and a collection Optimal document retrieval principle

franklester
Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #7 February 15, 2000

  2. Probabilistic information retrieval • The model • Binary independence model • Non-binary independence models

  3. Optimal Retrieval • Given a query Q and a collection • Optimal document retrieval principle • Arrange documents in descending order of probability of relevance to Q • Let OP-list denote the resulting list • If k documents are required • Take the first k documents from the OP-list

  4. The model • Probability of relevance of each document to the query not available in practice • The following slide shows they are not needed

  5. The model cont. • Let rel denote the event that a document is relevant to the user • P(rel| Di) is the probability that Di is relevant to the user • We need a similarity function s so that: • P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)

  6. The similarity function

  7. The similarity function cont.

  8. The similarity function cont.

  9. The similarity function cont. • For every document D we need to compute:

  10. Some History • Maron and Kuhns 1960 • Robertson and Sparck Jones 1976 • Croft and Harper 1979 • Yu Meng and Park 1989 • Other models (Robertson and Walker, Kwok)

  11. The model • The set of all documents is partitioned with respect to the query Q into the sets rel and nonrel. • The sets rel and nonrel change from query to query

  12. An Independence Assumption • The set of all terms are distributed independently in both rel and nonrel • Very strong assumption • Q= “What is happening with the impeachment trial?” • Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”

  13. The model • xi = di is the event that D has di occurrences of term i. • From independence assumption:

  14. The model • Let g(x)=log(P(x|rel)/P(x|nonrel)) • The logarithm is used to make the calculations simpler by changing multiplications to sums

  15. Computing g(x)

  16. Binary independence model • In this model a term occurs or does not occur in a document • Let x = (x1,…,xt) denote any document in the collection, where xi is 1 or 0.

  17. Computing g(x)

  18. Computing the rank of D

  19. Computing the rank of D

  20. Term relevance weights tri • The first sum in the formula for g(x = D): • Depends only on pi and qi which are the probabilities that the ith term occurs in the relevant and the non relevant documents • It is independent of the occurrence of terms in document D

  21. Term relevance weights tri • The second sum depends on the actual terms which appear in D.

  22. Term relevance weights tri • tri can be interpreted as the power of term i to discriminate between the relevant and the nonrelevant documents

  23. Notation • N is the total number of documents • R is the total number of relevant documents • ri is the number of relevant documents containing term i • dfi is the number of documents in which term i occurs

  24. Term frequencies Occurrence Relevant Non relevant Total documents documents xi =1 ridfi-ridfi xi =0 R-riN-R-dfi+riN-dfi Total RN-RN

  25. Computing tri • We can use the previous foil to compute: pi= ri / R qi = (dfi-ri) / (N - R) 1-pi= (R - ri) / R and 1- qi= (N - R - dfi+ ri) / (N - R)

  26. Computing tri

  27. The weight of a term - example

  28. Estimating pi and qi • ri and R are not known, before the system has been used extensively and collected relevance results • Various proposals have been made for estimating pi and qi. • We discuss some of them

  29. Estimating qi • Most documents in which term i occurs are non relevant to the average query • N is large

  30. Estimating qi • qi can be estimated by the occurrence probability of the term in the entire collection • qi= dfi / N

  31. Estimating tri • tri=log(pi /(1- pi)+log ((1- qi)/ qi) =C+log((1- qi) / qi)= = C+log ((1- dfi / N)/(dfi / N)) = C+log ((N - dfi) / dfi) • When pi is close to 1, C is very large

  32. Estimating pi • Assume no relevance information • The probability of term i either occurring or not occurring in the smaller set of relevant documents can be assumed to be equal. • So pi=1/2

  33. Estimating tri • In this case the term-relevance weight is: • tri=log1+ log ((N - dfi) / dfi) = = log ((N - dfi) / dfi) • This formula is a form of idf

  34. Estimating r • If term i is an ideal indexing term, it occurs only in relevant documents. • In this case ri = dfi, • pi = dfi/ R and • qi= 0.

  35. Estimating r • If term i is a poor indexing term, it is sprinkled evenly among the relevant and non relevant documents. In this case, it can be estimated by • ri= (R / N) dfi

  36. Estimating r • We can assume that an index term is in-between an ideal and a poor one • The constants a, b, and c have some medium value

  37. Estimating r r=a(df) for 0<=df<=R and R/N<a<1 r=b+cdf for R<df<N and 0<c<R/N r r=df R r=(R/N)df df R N

  38. Robertson and Sparck Jones

  39. Croft’s probabilistic model • Introduced term frequency into probabilistic model • Relevance estimated by including probability that a term appears in a document

  40. Croft’s probabilistic model

  41. Croft’s probabilistic model • Initial search • wijk = (C + idfi)fik where • i denotes the ith term in query j and document k, • C is a constant

  42. Croft’s probabilistic model • fik=K+(1-K)freqik/max freqik • freqik is the frequency of occurrence of term i in document k • max freqik is the maximum frequency of any term in document k. • K is a constant

  43. Croft’s probabilistic model • Search using feedback • pij (qij) is the probability that term i occurs in the set of relevant (non relevant) documents for query j

  44. Croft’s probabilistic model

  45. Croft’s probabilistic model • It is assumed that nonretrieved documents are not relevant • So both R and r can be estimated from the provided feedback • A problem arises when the user indicates that no relevant documents were retrieved

More Related