1 / 24

CSE 535 Information Retrieval

CSE 535 Information Retrieval. Lecture 2: Boolean Retrieval Model. Term-document incidence. 1 if play contains word , 0 otherwise. Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers. Brutus. Calpurnia.

jonny
Télécharger la présentation

CSE 535 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 535Information Retrieval Lecture 2: Boolean Retrieval Model

  2. Term-document incidence 1 if play contains word, 0 otherwise

  3. Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings Inverted index 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

  4. Tokenizer Friends Romans Countrymen Token stream. Linguistic modules More on these later. friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

  5. Sequence of (Modified token, Document ID) pairs. Indexer steps Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  6. Sort by terms. Core indexing step.

  7. Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.

  8. The result is split into a Dictionary file and a Postings file.

  9. Where do we pay in storage? Will quantify the storage, later. Terms Pointers

  10. How do we process a Boolean query? Later - what kinds of queries can we process? The index we just built Today’s focus

  11. Consider processing the query: BrutusANDCaesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: actually, intersecting them 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing 128 Brutus Caesar 34

  12. Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

  13. Basic postings intersection

  14. Queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., Lawyers) still like Boolean queries: You know exactly what you’re getting. Boolean queries: Exact match

  15. Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search Example: WestLaw http://www.westlaw.com/

  16. Exercise: Adapt the merge for the queries: BrutusAND NOTCaesar BrutusOR NOTCaesar Can we still run through the merge in time O(x+y)? More general merges

  17. What about an arbitrary Boolean formula? (BrutusOR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “linear” time? Linear in what? Can we do better? Merging

  18. What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. 2 4 8 16 32 64 128 1 2 3 5 8 16 21 34 Query optimization Brutus Calpurnia Caesar 13 16 Query: BrutusANDCalpurniaANDCaesar

  19. Process in order of increasing freq: start with smallest set, then keepcutting further. Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Query optimization example This is why we kept freq in dictionary Execute the query as (CaesarANDBrutus)ANDCalpurnia.

  20. Query optimization

  21. e.g., (madding OR crowd) AND (ignoble OR strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its freq’s (conservative). Process in increasing order of OR sizes. More general optimization

  22. Exercise • Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

  23. What about phrases? Proximity: Find GatesNEAR Microsoft. Need index to capture position information in docs. More later. Zones in documents: Find documents with (author = Ullman) AND (text contains automata). Beyond Boolean term search

  24. 1 vs. 0 occurrence of a search term 2 vs. 1 occurrence 3 vs. 2 occurrences, etc. Need term frequency information in docs. Used to compute a score for each document Matching documents rank-ordered by this score. Evidence accumulation

More Related