100 likes | 239 Vues
This guide provides a comprehensive overview of constructing inverted lists, a fundamental data structure in web-based information retrieval systems. It includes tasks such as building indices, document parsing, identifying document boundaries, and processing each document to extract valuable data like term frequency and word positions. The guide also discusses performance metrics, such as build times on varying hardware and memory efficiency. Moreover, it addresses the use of different data structures like hashtables and B-trees, and offers insights on stopword recognition and the indexing process.
E N D
20-760 Web-based Information Architectures How to Construct a Inverted List
Parsing & Indexing: Overview • Tasks • Build a set of indices • inverted list, idf, document id, normalized tf, word positions,… • Speed (Example) • On a PC of 750MHz CPU and 256M memory, a C++ program that builds indices without positions runs 46-56 seconds on the HTML collection of 50M. (The cleanup collection is 30M) • A few seconds for your Java program on the Reuters-1000 collection • Memory • 1-5% the size of the total uncompressed documents • E.g. 128 MB RAM for 2 GB text
Document Parsing • Read the corpus file “reut2-1000.plain” • Identify the document boundary • <REUTERS ID=“document id”> • Process each document to extract: • Document ID • Segment the text into tokens • e.g. Apple, REUTERS, U.S. … • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z
Data Structure • You can use whatever you like, but hashtable is simple to implement • Hashtable • Java provide such classes in java.util • Perl has hashes as a datatype, e.g. %words • C++ implements the associated list in Standard Templete Library(STL). The template class is called map. Internal implementations are either hashes or B-tree. • You can also implement your own hashtable(see Ch13 “Information Retrieval: Data Structures & Algorithms” by William B. Frakes, Ricardo Baeza-Yates) • Searching is fast O(1), but scanning in sequential order is not possible • B-tree and B+ tree (see section 2.3 of the above book for details)
Associated List • Associated list is a data structure, a list of pairs. Each pair is composed of a key and a value. Value could be a complex data structure. • In our case: Key/value -> Term / Associated posting list • Access an associated list. You have the key, you want to access the associated value quickly. • Many ways of implementing the associated list: Hash, B-tree, Array
Hashtable • Hashtable provides the insertion/access of the associated value in a constant time • Hashtable uses a hash function to map the key to the address that the associated value is stored Hash(key) value
Indices • Format • <term> <idf> <doc id>:<normalized tf>:<tf>:<positions> • positions are separated by commas • IDF(t) = log2(N/n) where N is the number of documents in the whole collection, n is the number of documents that contains the term t • TFnom = TF/TFmax • Sample
Stopword Recognition • There are usually fewer than 500 stopwords • Some systems have very few • Every word token is checked, so the test should be very fast • Store the stopword list in a hash table • Since stopword lists evolve slowly, calculate a perfect hash code • Lookup each word token in the hash table • If found, the token is a stopword, so discard it • Document length & word locations should count stopwords • Example: “Library of Congress” has length of 3 Location: 1 2 3
Good Luck! • Due on 7:00pm July 19.