1 / 60

Non-binary Independence Models in Information Retrieval

This lecture covers non-binary independence models, term relationships in indexing, and fuzzy information retrieval.

Télécharger la présentation

Non-binary Independence Models in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #8 February 17, 1999

  2. This lecture • Non-binary independence probabilistic models • Term relationship in indexing • Fuzzy information retrieval

  3. The non-binary independence model • Yu, Meng and Park • wi depends on the term frequency di and on how the frequency of occurrence is distributed in the sets of relevant and nonrelevant documents

  4. The non-binary independence model

  5. The non-binary independence model • When di= 0, wi(di= 0) is not necessarily 0 • We can modify the weights so the wi’ = wi- wi(dI=0) • This does not change the relevance order and makes the computation more efficient

  6. The weight of a term - example • Assume: R = 6. N=14. • 3 relevant documents have di = 2, 2 have di = 1, and one has di = 0 • One nonrelevant document with di = 2, one with di = 1 and 6 with di = 0

  7. The weight of a term - example • Let p2, p1, p0 denote the probabilities that relevant documents have 2, 1, 0 occurrences of a term • Let q2, q1, q0 denote the probabilities that nonrelevant documents have 2, 1, 0 occurrences of a term

  8. The weight of a term - example • p2 = 3/6, p1 = 2/6, and p0 = 1/6 • q2 = 1/8, q1 = 1/8 and q0 = 6/8 • w2 = log(3/6)/(1/8) = log4, • w1 = log(2/6)/(1/8) = log8/3, • w0 =log(1/6)/(6/8) = log2/9 • w2’ = w2 - w0 = log18, w1’ = log12, w0’ = 0

  9. Term relationship in indexing • Assume that terms do not occur in text independently

  10. Term relationship in indexing

  11. Fuzzy Boolean Models • Limitations of the Boolean model • Introduction to fuzzy sets • Fuzzy models • basic • MMM • Paice • p-norm

  12. Boolean model limitations 1. AND query Given the query: • “fuzzy” AND “logic” AND “approximate” AND “reasoning” AND “possibility” AND “theory”, • D is not retrieved when indexed by all the terms except “possibility”

  13. Boolean model limitations 2. OR query Given for example the query: • (“fuzzy” AND “logic”) OR (“approximate” AND “reasoning”), • D1 indexed by all the terms • D2 indexed only by “fuzzy” and “logic” • D1, D2 retrieved in arbitrary order

  14. Boolean model limitations 3. Query term importance • Searchers can rate term importance • If query term A is more important than term B, D1 with only term A should rank higher than D2 that contains only B

  15. Boolean model limitations • During Boolean indexing a term is either chosen to represent a document or not. • We would like to be able to represent the importance of a term to a document.

  16. Introduction to fuzzy sets • We discuss • the difference between conventional (crisp) sets and fuzzy sets • fuzzy set operations

  17. Crisp sets • Sets in which an object is either a member of a set or not are called crisp sets • In fuzzy sets an item may be a partial member of a set • Each object in the universe can be partially compatible with some attribute

  18. Limitations of crisp sets • To decide membership in sets TALL, OLD, or WEALTHY, need a threshold • $1,000,000 for a wealthy person • So a person with $999,999 is not wealthy (poor) • With fuzzy sets a person with $500,000 belongs WEALTHY with some degree of membership

  19. Fuzzy sets • The degree of membership of x in fuzzy set A is mA(x) : X-> [0,1] • where X is the universal set and • [0,1] denotes the interval of real numbers from 0 to 1

  20. Example • The discrete fuzzy set TALL • Let the universe of heights be U={4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8} • TALL={0/4.5, 0.2/5, .5/5.5, .7/6, 1/6.5, 1/7, 1/7.5, 1/8} • The first number in each pair is membership degree

  21. Example • In reality height is a continuous function • Next transparency describes the fuzzy set TALL as a continuous function

  22. A membership function 1.0 mTALL 0.7 0.5 0 0 4.5 5.5 6 6.5 Height in feet

  23. Fuzzy set operations • Set operations can be defined in a variety of ways. • Most common ones are: • The membership function of AÇB is: mAÇB(x)= min{mA(x), mB(x)} or mAÇB(x)= mA(x)mB(x) for all xÎX

  24. Fuzzy set operations • Usually fuzzy operations are compatible with crisp set operations If A and B are crisp, mAÇB(x)=1 iff xÎ AÇB mAÇB(x)=0 iff xÏAÇB • This is satisfied by both definitions

  25. Fuzzy set operations • The membership function of AÈB is: mAÈB(x)= max{mA(x), mB(x)} or mAÈB(x)= mA(x)+mB(x)- mA(x)mB(x) If A and B are crisp, mAÈB(x)=1 iff xÎ AÈ B mAÈB(x)=0 iff xÏAÈ B

  26. Fuzzy set operations • The membership function of the complement A’ is: mA’(x)= 1-mA(x) For crisp A mA’(x)=1 iff xÏA mA’(x)=0 iff xÎA

  27. Information retrieval • A document D is represented by a weight vector (w1,…,wt) where wi = mTi(D) is • the “degree of membership” of D in the fuzzy set for concept Ti • POLITICS={mpolitics(D1)/ D1 , mpolitics(D2)/ D2 ,…, mpolitics(DN)/ DN}

  28. Information retrieval • User can specify a fuzzy value for each query term • To calculate fuzzy weights for document terms use statistical measures tf, idf, normalization, etc

  29. Basic fuzzy Boolean model • The query (Ti AND Tj) is computed for document D by min(wi, wj), • (Ti OR Tj) is computed by max(wi, wj), • (NOT Ti) is computed by 1-wi

  30. Retrieval examples • D1: elephant/0.8 + mammals/0.5 + Asia/0.2 + ... • D2: elephant/0.3 + mammals/0.5 + Asia/0.3 + ... • Q1= elephants • D1 similarity 0.8 • D2 similarity 0.3

  31. Basic fuzzy Boolean model • Model does not solve first three limitations of Boolean retrieval. • A document will not be retrieved for an AND query • if one term has 0 weight • Single value dependency

  32. Retrieval examples • D1: elephant/1 + Asia/0.2 + ... • D2: elephant/0.2 + Asia/0.2 + ... • Q2= elephants AND Asia • D1 retrieved with min(1, 0.2)=0.2. • D2 retrieved with min(0.2,0.2)=0.2 • D1 better

  33. Basic fuzzy Boolean model • A document with all OR query terms may be retrieved with a smaller weight than a document that contains only one query term • User’s subjective value of query terms ignored

  34. Retrieval examples • D1:elephant/0.8 + hunting/0.1 + ... • D2: elephant/0.7 + hunting/0.7 + ... • Q3= elephants OR hunting • D1 max(0.8, 0.1)=0.8, and • D2 with max(0.7, 0.7)=0.7 • D2 better

  35. Retrieval examples • D1: mammals/0.5+Asia/0.2+... • D2: mammals/0.51+Asia/0.49+... • Q4 = (mammals AND NOT Asia) • D1 min(0.5, 1-0.2) = 0.5 • D2 min(0.51, 1-0.49) = 0.51

  36. Mixed min and max model • The MMM model (Fox) takes into account the maximum value for an AND query and the minimum for an OR query. • Deals with missing term limitation of Boolean

  37. Mixed min and max model • AND or OR query • QAND=(A1 AND A2 AND … AND An) • SIM(QAND, D)= CAND1*min(wA1, wA2,…, wAn)+ CAND2*max(wA1, wA2,…, wAn) • CAND1 > CAND2 and CAND1 + CAND2 =1

  38. Mixed min and max model • QOR =(A1 OR A2 OR … OR An) • SIM(QOR ,D)= COR1*max(wA1, wA2,…, wAn)+ COR2*min(wA1, wA2,…, wAn) • COR1 > COR2 and COR1 + COR2 =1

  39. Mixed min and max model • An AND query with a missing term retrieved with a value which depends on the maximum. • Similarly value of OR query reduced if missing query terms

  40. Retrieval example • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • CAND1=0.6 • Q3= fuzzy AND logic AND sets • D1 and D2 same rank 0.6*0.2+0.4*0.8 = 0.44. (D2 better) • D3 rank is .4*.8 = .32

  41. Retrieval example • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • COR1=0.6 • Q4= fuzzy OR logic OR sets • D1 and D2 same rank 0.6*0.8+0.4*0.2=.56 • D3 0.6*0.8=.48

  42. Paice model • Improves on both the basic model and the MMM model • Takes into account all query terms • AND or OR queries. • QAND=(A1 AND A2 AND … AND An), • QOR =(A1 OR A2 OR … OR An)

  43. Paice model • Values are sorted in ascending order for AND queries • Descending order for OR queries. • Slower • sorting query terms (O(nlogn)) • and computing exponents

  44. Paice model • r=1, Sim(Q, D) is the average • r<1 Sim(Q, D) determined by terms with low exponent

  45. Paice Model • Experiments determined: • r=1 for AND queries (average) • r=0.7 for OR queries • No query weights.

  46. Retrieval • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • Q3= fuzzy AND logic AND sets • D1 (0.8+0.2+0.2)/3=.4 • D2 (0.8+0.7+0.2)/3=0.56 • D3 (0.8+0.7+0)/3=0.5

  47. Retrieval • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • Q4= fuzzy OR logic OR sets • D1:

  48. The P-norm model • Salton and Fox • Weights for query and document terms • Very good retrieval results • Drawback computation time • http://www.individual.com/

  49. Two Term Queries • OR query • If both query terms have 0 weights do not retrieve D • Similarity distance of document vector from (0,0)

  50. Two Term Queries • AND query • If both query terms have high weights (close to 1) retrieve • Similarity 1 minus the distance of the vector from (1,1)

More Related