1 / 32

Indexing and Searching

Indexing and Searching. Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8. Outline. Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression. Inverted Files.

mavisr
Télécharger la présentation

Indexing and Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing and Searching Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8

  2. Outline • Inverted Files • Other Indices for Text • Sequential Searching • Pattern Matching • Compression

  3. Inverted Files • And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. • Structure:vocabulary and occurrences • Block addressing • The text is divided in blocks, and the occurrences point to the blocks • Full inverted indices:exact occurrences

  4. Inverted Files • The search algorithm on an inverted index • Vocabulary search • Retrieval of occurrences • Manipulation of occurrences • Construction (split the index into two files) • Posting file:the lists of occurrences are stored contiguously • The vocabulary is stored in lexicographical order and points to its list.

  5. Inverted Files • For Large texts • Partial index • Merging two indices consists of merging the sorted vocabularies.

  6. Other Indices for Text • Suffix Trees • Suffix Arrays • Signature Files

  7. Suffix Trees and Suffix Arrays • Each position in the text is considered as a text suffix • Index points are selected form the text, which point to the beginning of the text positions which will be retrievable

  8. Suffix arrays • The main drawbacks of Suffix Array are its costlyconstruction process. • Allow binary searches done by comparing the contents of each pointer. • Supra-indices (for large suffix array)

  9. Construction of Suffix Arrays for Large Texts

  10. Signature Files • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word

  11. Sequential Searching • Brute Force • Knuth-Morris-Pratt • Boyer-Moore Family • Shift-Or • Suffix Automaton • Backward DAWG matching (BDM) • BNDM

  12. Knuth-Morris-Pratt

  13. Boyer-Moore Family

  14. Shift-Or

  15. Suffix Automaton

  16. Pattern Matching • Searching allowing errors • Dynamic Programming • Automaton • Regular Expressions and Extended patterns • Pattern Matching Using Indices • Inverted files • Suffix Trees and Suffix Arrays

  17. Dynamic Programming

  18. Automaton

  19. Regular Expressions

  20. Pattern Matching Using Indices • Inverted Files • The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search • The restriction is to find approximate matches or regular expressions that span many word.

  21. Pattern Matching Using Indices • Suffix Trees • Suffix trees are able to perform complex searches • Word, prefix, suffix, substring, and Range queries • Regular expressions • Unrestricted approximate string matching • Useful in specific areas • Find the longest substring • Find the most common substring of a fixed size

  22. Pattern Matching Using Indices • Suffix Arrays • Some patterns can be searched directly in the suffix array without simulation the suffix tree • Word, prefix, suffix, subword search and range search

  23. Compression • Compressed text--Huffman coding • Taking words as symbols • Use an alphabet of bytes instead of bits • Compressed indices • Inverted Files • Suffix Trees and Suffix Arrays • Signature Files

More Related