1 / 32

The Boolean Model

The Boolean Model. Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a  (k b  k c ) Terms are either present or absent. Thus, w ij  {0,1} Consider q = k a  (k b  k c )

sunila
Télécharger la présentation

The Boolean Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics • neat formalism • q = ka (kb  kc) • Terms are either present or absent. Thus, wij {0,1} • Consider • q = ka (kb  kc) • vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) • vec(qcc) = (1,1,0) is a conjunctive component • Each query can be transformed in DNF form

  2. The Boolean Model Ka Kb • q = ka (kb  kc) • sim(q,dj) = 1, if document satisfies the boolean query • 0 otherwise • - no in-between, only 0 or 1 (1,1,0) (1,0,0) (1,1,1) Kc

  3. Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information ¬computer”

  4. Exercise กำหนด Index term ของแต่ละเอกสาร D1 = {love, need, person, possess, understand} D2 = {heart, listen, love, practice, suffer} D3 = {compassion, love, mind, person, practice} D4 = {death, health, languor, life, suffer} D5 = {energy, love, nourish, practice, teach} Q = {love ^ suffer}

  5. Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

  6. Drawbacks of the Boolean Model • The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two extensions of boolean model: • Fuzzy Set Model • Extended Boolean Model

  7. Set Theoretic Models

  8. Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic Boolean Vector Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing

  9. Set Theoretic Models • The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two set theoretic models for this: • Fuzzy Set Model • Extended Boolean Model

  10. Fuzzy Set Model • Queries and docs represented by sets of index terms: matching is approximate from the start • This vagueness can be modeled using a fuzzy framework, as follows: • with each term is associated a fuzzy set • each doc has a degree of membership in this fuzzy set • This interpretation provides the foundation for many models for IR based on fuzzy theory • In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)

  11. Fuzzy Set Theory • Framework for representing classes whose boundaries are not well defined • Key idea is to introduce the notion of a degree of membership associated with the elements of a set • This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership • Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic

  12. Fuzzy Set Theory • Model • A query term: a fuzzy set • A document: degree of membership in this test • Membership function • Associate membership function with the elements of the class • 0: no membership in the test • 1: full membership • 0 ~1: marginal elements of the test documents

  13. Fuzzy Set Theory for query term a class • A fuzzy subsetA of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1] • complement: • union: • intersection: document collection

  14. Examples • Assume U={d1, d2, d3, d4, d5, d6} • Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively. • Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0} • = {d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1} • = {d1:0.8, d2:0.7, d3:0.8, d4:0.9, d5:0, d6:0} • ={d1:0, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}

  15. Fuzzy Information Retrieval • basic idea • Expand the set of index terms in the query with related terms (from the thesaurus) such that additional relevant documents can be retrieved • A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection keyword connection matrix

  16. Fuzzy Information Retrieval(Continued) • normalized correlation factor ci,l between two terms ki and kl (0~1) • In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j ni is # of documents containing term ki where nl is # of documents containing term kl ni,l is # of documents containing ki and kl

  17. Fuzzy Information Retrieval(Continued) • physical meaning • A document dj belongs to the fuzzy set associated to the term ki if its own terms are related to ki, i.e., i,j=1. • If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1. ki is a good fuzzy index • When all index terms of dj are only loosely related to ki, i,j0. ki is not a good fuzzy index

  18. Example q = (ka (kb  kc))= (ka kb  kc)  (ka kb   kc) (ka kb kc)= cc1+cc2+cc3 Da: the fuzzy set of documents associated to the index ka cc2 cc3 Da djDa has a degree of membership a,j > a predefined threshold K cc1 Db Da: the fuzzy set of documents associated to the index ka (the negation of index term ka) Dc

  19. Example Query q=ka (kb   kc) disjunctive normal form qdnf=(1,1,1)  (1,1,0)  (1,0,0) (1) the degree of membership in a disjunctive fuzzy set is computed using an algebraic sum (instead of max function) more smoothly (2) the degree of membership in a conjunctive fuzzy set is computed using an algebraic product (instead of min function) Recall

  20. Fuzzy Set Model • Q: “gold silver truck”D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck” • IDF (Select Keywords) • a = in = of = 0 = log 3/3arrived = gold = shipment = truck = 0.176 = log 3/2damaged = delivery = fire = silver = 0.477 = log 3/1 • 8 Keywords (Dimensions) are selected • arrived(1), damaged(2), delivery(3), fire(4), gold(5), silver(6), shipment(7), truck(8)

  21. Fuzzy Set Model

  22. Fuzzy Set Model

  23. Fuzzy Set Model • Sim(q,d): Alternative 1 Sim(q,d3) > Sim(q,d2) > Sim(q,d1) • Sim(q,d): Alternative 2 Sim(q,d3) > Sim(q,d2) > Sim(q,d1)

  24. Extended Boolean Model • Disadvantages of “Boolean Model” : • No term weight is used • Counterexample: query q=Kx AND Ky. Documents containing just one term, e,g, Kx is considered as irrelevant as another document containing none of these terms. • The size of the output might be too large or too small

  25. Extended Boolean Model • The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu • The idea is to make use of term weight as vector space model. • Strategy: Combine Boolean query with vector space model. • Why not just use Vector Space Model? • Advantages: It is easy for user to provide query.

  26. Extended Boolean Model • Each document is represented by a vector (similar to vector space model.) • Remember the formula. • Query is in terms of Boolean formula. • How to rank the documents?

  27. Fig. Extended Boolean logicconsidering the space composed of two terms kx and ky only. ky ky kx kx

  28. Extended Boolean Model • For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use to rank the documents • The bigger the better.

  29. Extended Boolean Model • For query q=Kx and Ky, (1,1) is the most desirable point. • We use to rank the documents. • The bigger the better.

  30. Extend the idea to m terms • qor=k1 p k2 p … p Km • qand=k1 p k2 p … p km

  31. Properties • The p norm as defined above enjoys a couple of interesting properties as follows. First, when p=1 it can be verified that • Second, when p= it can be verified that • Sim(qor,dj)=max(xi) • Sim(qand,dj)=min(xi)

  32. Example • For instance, consider the query q=(k1 k2)  k3. The similarity sim(q,dj) between a document dj and this query is then computed as • Any boolean can be expressed as a numeral formula.

More Related