1 / 75

Lecture 5: Boolean and Extended Boolean

Prof. Ray Larson University of California, Berkeley School of Information. Lecture 5: Boolean and Extended Boolean. Principles of Information Retrieval. Today. Review IR Components Inverted Files IR Models The Boolean Model Fuzzy sets, Rubric, P-norm, etc. Structure of an IR System.

Télécharger la présentation

Lecture 5: Boolean and Extended Boolean

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson University of California, Berkeley School of Information Lecture 5: Boolean and Extended Boolean Principles of Information Retrieval

  2. Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.

  3. Structure of an IR System Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Search Line Adapted from Soergel, p. 19

  4. Document Processing Steps

  5. Boolean Implementation: Inverted Files • We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows

  6. How Are Inverted Files Created • Documents are parsed to extract words (or stems) and these are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Text Proc Steps

  7. How Inverted Files are Created • After all document have been parsed the inverted file is sorted

  8. How Inverted Files are Created • Multiple term entries for a single document are merged and frequency information added

  9. Inverted Files • The file is commonly split into a Dictionary and a Postings file

  10. Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2

  11. Inverted Files • Lots of alternative implementations • E.g.: Cheshire builds within-document frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB B-tree index keyed by the term.

  12. Btree (conceptual) F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers

  13. Btree with Postings F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers 2,4,8,12 2,4,8,12 2,4,8,12 2,4,8,12 8,120 2,4,8,12 2,4,8,12 5, 7, 200 2,4,8,12 2,4,8,12

  14. Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2

  15. Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.

  16. IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)

  17. Boolean Model for IR • Based on Boolean Logic (Algebra of Sets). • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)

  18. Boolean Operations on Sets • Intersection – Boolean ‘AND’ -- -- • Union – Boolean ‘OR’ -- -- • Negation – Boolean ‘NOT’ -- -- • Usually means “AND NOT” in IR • Exclusive OR – ‘XOR’ – seldom used, • Instead

  19. Boolean Logic A B

  20. Query Languages • A way to express the query (formal expression of the information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)

  21. Simple query language: Boolean • Terms + Connectors • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT • parentheses (for grouping operations)

  22. Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

  23. Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works:

  24. Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations works:

  25. Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

  26. Boolean Searching Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete Relaxed Query: (CAND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

  27. Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 m1= t1t2t3 D11 D4 m2= t1 t2t3 D5 m3= t1 t2t3 D3 m1 D6 m4= t1t2t3 m2 m4 D10 m5 = t1t2t3 m6 = t1t2t3 m7 m8 m7 = t1t2t3 D8 D7 m8= t1t2t3 t3

  28. Precedence Ordering • In what order do we evaluate the components of the Boolean expression? • Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) • Usually start from the left and work right (in case of ties) • Usually (if there are no parentheses) • NOT before AND • AND before OR

  29. Faceted Boolean Query • Strategy: break query into facets (polysemous with earlier meaning of facets) • conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) • each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND AND

  30. Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term? • Fancier methods have been investigated • p-norm is most famous • usually impractical to implement • usually hard for user to understand

  31. Faceted Boolean Query • Query still fails if one facet missing • Alternative: • Coordination level ranking • Order results in terms of how many facets (disjuncts) are satisfied • Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: • Assign weights to facets

  32. Boolean Processing • Boolean Processing (classic Boolean) • Data structures for Query representation and Boolean Operations • Boolean processing logic and algorithms • Extended Boolean Models • Fuzzy Logic • Others

  33. Boolean Processing • All processing takes place on postings lists • Different methods can be used for sorted or unsorted postings lists

  34. Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs. • Example…

  35. Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

  36. Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

  37. RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

  38. Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

  39. Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

  40. Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Operator: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX

  41. Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

  42. Boolean AND (Sorted) Algorithm • Choose the shortest list (why?) • Create new list the same length as the short list • For each item in the short list • Compare next item in longer list • If greater than – go to next item in longer list • If equal - add to new list and go to next item in both lists • If less than - go to next item in short list

  43. Boolean AND Algorithm = AND

  44. Boolean OR (Sorted) Algorithm • Choose the longer list • Create new list the same length both lists combined • For each item in the longer list • If less than or equal to the first item in the short list • Add to new list • Otherwise • Add item from short list • Compare next items in short and long lists • If long item less then short item add long item and go to next long item • Otherwise – add from short list and go to next short item • Once the short list runs out, add the remaining items in the long list

  45. Boolean OR Algorithm = OR

  46. Boolean AND NOT(Sorted) Algorithm Create new list the same length as the left-hand list • For each item in the left-hand list • Compare next item in not list • If greater than – add to new list and go to next item in not list • If equal - go to next item in both lists • If less than - go to next item in not list

  47. Boolean AND NOTAlgorithm = AND NOT

  48. Hashed Boolean AND (unsorted) • Put each item in shortest list into hash table • For each item in other lists • If hash entry exists, set flag in hash table entry (or increment counter) • Scan hash table contents • If flag set (or counter == number of lists) add to new list

  49. Hashed Boolean OR (unsorted) • Put each item in EACH list into hash table • If match increment counter (optional) • Scan hash table contents and add to new list

  50. Hashed Boolean AND NOT (unsorted) • Put each item in left-hand list into hash table • For each item in NOT list • If hash entry exists, remove it • Scan hash table contents and add to new list

More Related