770 likes | 975 Vues
Prof. Ray Larson University of California, Berkeley School of Information. Lecture 5: Boolean and Extended Boolean. Principles of Information Retrieval. Today. Review IR Components Inverted Files IR Models The Boolean Model Fuzzy sets, Rubric, P-norm, etc. Structure of an IR System.
E N D
Prof. Ray Larson University of California, Berkeley School of Information Lecture 5: Boolean and Extended Boolean Principles of Information Retrieval
Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.
Structure of an IR System Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Search Line Adapted from Soergel, p. 19
Boolean Implementation: Inverted Files • We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows
How Are Inverted Files Created • Documents are parsed to extract words (or stems) and these are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Text Proc Steps
How Inverted Files are Created • After all document have been parsed the inverted file is sorted
How Inverted Files are Created • Multiple term entries for a single document are merged and frequency information added
Inverted Files • The file is commonly split into a Dictionary and a Postings file
Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2
Inverted Files • Lots of alternative implementations • E.g.: Cheshire builds within-document frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB B-tree index keyed by the term.
Btree (conceptual) F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers
Btree with Postings F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers 2,4,8,12 2,4,8,12 2,4,8,12 2,4,8,12 8,120 2,4,8,12 2,4,8,12 5, 7, 200 2,4,8,12 2,4,8,12
Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2
Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.
IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)
Boolean Model for IR • Based on Boolean Logic (Algebra of Sets). • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)
Boolean Operations on Sets • Intersection – Boolean ‘AND’ -- -- • Union – Boolean ‘OR’ -- -- • Negation – Boolean ‘NOT’ -- -- • Usually means “AND NOT” in IR • Exclusive OR – ‘XOR’ – seldom used, • Instead
Boolean Logic A B
Query Languages • A way to express the query (formal expression of the information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)
Simple query language: Boolean • Terms + Connectors • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT • parentheses (for grouping operations)
Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works:
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations works:
Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a
Boolean Searching Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete Relaxed Query: (CAND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)
Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 m1= t1t2t3 D11 D4 m2= t1 t2t3 D5 m3= t1 t2t3 D3 m1 D6 m4= t1t2t3 m2 m4 D10 m5 = t1t2t3 m6 = t1t2t3 m7 m8 m7 = t1t2t3 D8 D7 m8= t1t2t3 t3
Precedence Ordering • In what order do we evaluate the components of the Boolean expression? • Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) • Usually start from the left and work right (in case of ties) • Usually (if there are no parentheses) • NOT before AND • AND before OR
Faceted Boolean Query • Strategy: break query into facets (polysemous with earlier meaning of facets) • conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) • each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND AND
Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term? • Fancier methods have been investigated • p-norm is most famous • usually impractical to implement • usually hard for user to understand
Faceted Boolean Query • Query still fails if one facet missing • Alternative: • Coordination level ranking • Order results in terms of how many facets (disjuncts) are satisfied • Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: • Assign weights to facets
Boolean Processing • Boolean Processing (classic Boolean) • Data structures for Query representation and Boolean Operations • Boolean processing logic and algorithms • Extended Boolean Models • Fuzzy Logic • Others
Boolean Processing • All processing takes place on postings lists • Different methods can be used for sorted or unsorted postings lists
Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs. • Example…
Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}
Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}
RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }
Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}
Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}
Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Operator: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX
Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ
Boolean AND (Sorted) Algorithm • Choose the shortest list (why?) • Create new list the same length as the short list • For each item in the short list • Compare next item in longer list • If greater than – go to next item in longer list • If equal - add to new list and go to next item in both lists • If less than - go to next item in short list
Boolean AND Algorithm = AND
Boolean OR (Sorted) Algorithm • Choose the longer list • Create new list the same length both lists combined • For each item in the longer list • If less than or equal to the first item in the short list • Add to new list • Otherwise • Add item from short list • Compare next items in short and long lists • If long item less then short item add long item and go to next long item • Otherwise – add from short list and go to next short item • Once the short list runs out, add the remaining items in the long list
Boolean OR Algorithm = OR
Boolean AND NOT(Sorted) Algorithm Create new list the same length as the left-hand list • For each item in the left-hand list • Compare next item in not list • If greater than – add to new list and go to next item in not list • If equal - go to next item in both lists • If less than - go to next item in not list
Boolean AND NOTAlgorithm = AND NOT
Hashed Boolean AND (unsorted) • Put each item in shortest list into hash table • For each item in other lists • If hash entry exists, set flag in hash table entry (or increment counter) • Scan hash table contents • If flag set (or counter == number of lists) add to new list
Hashed Boolean OR (unsorted) • Put each item in EACH list into hash table • If match increment counter (optional) • Scan hash table contents and add to new list
Hashed Boolean AND NOT (unsorted) • Put each item in left-hand list into hash table • For each item in NOT list • If hash entry exists, remove it • Scan hash table contents and add to new list