1 / 28

Information Retrieval and Web Search

Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/. Information Retrieval and Web Search. Indexing and Searching. Outline. Sequential online text searching Raw text Appropriate for small texts or fast-changing texts Use index structures for fast access

gaskell
Télécharger la présentation

Information Retrieval and Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Information Retrieval and Web Search

  2. Indexing and Searching Outline

  3. Sequential online text searching Raw text Appropriate for small texts or fast-changing texts Use index structures for fast access Appropriate for large and semi-static collections Hybrid methods: online+indexed search Searching

  4. Inverted files Suffix arrays Signature files Indexing

  5. IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking

  6. Text Operations forms index words (tokens) Tokenization Stopword removal Stemming Indexing constructs an inverted index of word to document pointers Mapping from keywords to document ids IR System Components Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  7. Searching retrieves documents that contain a given query token from the inverted index Ranking scores all retrieved documents according to a relevance metric User Interface manages interaction with the user: Query input and document output Relevance feedback Visualization of results Query Operations transform the query to improve retrieval: Query expansion using a thesaurus Query transformation using relevance feedback IR System Components

  8. Simple, mathematically based approach Considers both local (tf) and global (idf) word importance measures Provides partial matching and ranked results Tends to work quite well in practice despite obvious weaknesses Allows efficient implementation for large document collections Comments on Vector Space Models

  9. Missing semantic information (e.g. word sense) Missing syntactic information (e.g. phrase structure, word order, proximity information) Assumption of term independence (e.g. ignores synonymy) Lacks the control of a Boolean model (e.g., requiring a term to appear in a document) Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently Problems with Vector Space Model

  10. Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V Convert query to a tf-idf-weighted vector q For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score Present top ranked documents to the user Time complexity: O(|V|·|D|)Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000 Naïve Implementation

  11. Based on the observation that documents containing none of the query keywords do not affect the final ranking Try to identify only those documents that contain at least one query keyword Actual implementation of an inverted index Practical Implementation

  12. Implement the preprocessing functions: For tokenization For stop word removal For stemming (Map everything to lowercase; Keep only sequences of alphabetical characters) Input: Documents that are read one by one from the collection Output: Tokens to be added to the index No punctuation, no stop-words, stemmed Step 1: Preprocessing

  13. Build an inverted index, with an entry for each word in the vocabulary Input: Tokens obtained from the preprocessing module Output: An inverted index for fast access Step 2: Indexing

  14. Many data structures are appropriate for fast access B-trees, skipped lists, hashtables We need: One entry for each word in the vocabulary For each such entry: Keep a list of all the documents where it appears together with the corresponding frequency  TF For each such entry, keep the total number of occurrences in all documents:  DF Step 2 (cont’d)

  15. Dj, tfj df Index terms D7, 4 3 computer database D1, 3 2    D2, 4 4 science system 1 D5, 2 lists Index file Step 2 (cont’d)

  16. TF and IDF for each token can be computed in one pass Cosine similarity also requires document lengths Need a second pass to compute document vector lengths Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens Remember the weight of a token is: TF * IDF Therefore, must wait until IDFs are known (and therefore until all documents are indexed) before document lengths can be determined Do a second pass over all documents: keep a list or hash table with all document id-s, and for each document determine its length Step 2 (cont’d)

  17. Let N be the total number of Documents; For each token, T, in V: Determine the total number of documents, m, in which T occurs Set the IDF for T to log(N/m); Computing IDF

  18. Assume the length of all document vectors are initialized to 0.0; For each token T in V: Let, I, be the IDF weight of T; For each token T in document D Let, C, be the count of T in D; Increment the length of D by (I*C)2; For each document D: Set the length of D to be the square-root of the current stored length; Computing Document Lengths

  19. Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V|  m n, complete process is O(m n), which is also the complexity of just reading in the corpus. Time Complexity of Indexing

  20. Use inverted index (from step 2) to find the limited set of documents that contain at least one of the query words Incrementally compute cosine similarity of each indexed document as query words are processed one by one To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where the document id is the key, and the partial accumulated score is the value Input: Query and Inverted Index (from Step 2) Output: Similarity values between query and documents Step 3: Retrieval

  21. Tokens that are not in both the query and the document do not effect cosine similarity Product of token weights is zero and does not contribute to the dot product Usually the query is fairly short, and therefore its vector is extremely sparse Use inverted index to find the limited set of documents that contain at least one of the query words Retrieval with an Inverted Index

  22. Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N Inverted Query Retrieval Efficiency Q = q1 q2 … qn D11…D1B D21…D2B Dn1…DnB

  23. Incrementally compute cosine similarity of each indexed document as query words are processed one by one To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value Processing the Query

  24. Initialize an empty list R of retrieved documents For each token, T, in query Q: Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Let L be the list of document frequencies for T; For each entry, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * I * C; (product of T-weight in Q and D) Inverted-Index Retrieval Algorithm

  25. Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights) For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D; Normalize D’s final score to S/(L * Y); Sort retrieved documents in R by final score and return results Retrieval Algorithm (cont)

  26. Sort the hashtable including the retrieved documents based on the value of cosine similarity sort {$retrieved{$b}  $retrieved{$a}} keys %retrieved Return the documents in descending order of their relevance Input: Similarity values between query and documents Output: Ranked list of documents in reversed order of their relevance Step 4: Ranking

  27. Indexing and Searching Summary

  28. Intro to Web Search Next

More Related