330 likes | 363 Vues
Indexing and Searching. Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8. Outline. Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression. Inverted Files.
E N D
Indexing and Searching Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8
Outline • Inverted Files • Other Indices for Text • Sequential Searching • Pattern Matching • Compression
Inverted Files • And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. • Structure:vocabulary and occurrences • Block addressing • The text is divided in blocks, and the occurrences point to the blocks • Full inverted indices:exact occurrences
Inverted Files • The search algorithm on an inverted index • Vocabulary search • Retrieval of occurrences • Manipulation of occurrences • Construction (split the index into two files) • Posting file:the lists of occurrences are stored contiguously • The vocabulary is stored in lexicographical order and points to its list.
Inverted Files • For Large texts • Partial index • Merging two indices consists of merging the sorted vocabularies.
Other Indices for Text • Suffix Trees • Suffix Arrays • Signature Files
Suffix Trees and Suffix Arrays • Each position in the text is considered as a text suffix • Index points are selected form the text, which point to the beginning of the text positions which will be retrievable
Suffix arrays • The main drawbacks of Suffix Array are its costlyconstruction process. • Allow binary searches done by comparing the contents of each pointer. • Supra-indices (for large suffix array)
Signature Files • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word
Sequential Searching • Brute Force • Knuth-Morris-Pratt • Boyer-Moore Family • Shift-Or • Suffix Automaton • Backward DAWG matching (BDM) • BNDM
Pattern Matching • Searching allowing errors • Dynamic Programming • Automaton • Regular Expressions and Extended patterns • Pattern Matching Using Indices • Inverted files • Suffix Trees and Suffix Arrays
Pattern Matching Using Indices • Inverted Files • The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search • The restriction is to find approximate matches or regular expressions that span many word.
Pattern Matching Using Indices • Suffix Trees • Suffix trees are able to perform complex searches • Word, prefix, suffix, substring, and Range queries • Regular expressions • Unrestricted approximate string matching • Useful in specific areas • Find the longest substring • Find the most common substring of a fixed size
Pattern Matching Using Indices • Suffix Arrays • Some patterns can be searched directly in the suffix array without simulation the suffix tree • Word, prefix, suffix, subword search and range search
Compression • Compressed text--Huffman coding • Taking words as symbols • Use an alphabet of bytes instead of bits • Compressed indices • Inverted Files • Suffix Trees and Suffix Arrays • Signature Files