Processing of large document collections

Processing of large document collections Part 5

In this part: • Indexing • querying • index construction

Indexing • An index is a mechanism for locating a given term in a text • Index in a book • it is possible to find information without browsing the pages • in large document collections (gigabytes) page-by-page search would even be impossible

Indexing • It is supposed that • a document collection consists of a set of separate documents • each document is described by a set of representative terms • index must be capable of identifying all documents that contain combinations of specified terms • document is the unit of text that is returned in response to queries

Indexing • What is a document? • E.g. emails • sender, recipient, subject, message body • one email, one field, a set of emails?

Indexing • Granularity of the index = the resolution to which term locations are recorded within each document • e.g. 1 email = 1 document, but the index could be capable of ascertaining a more exact location within the document of each term • which documents contain terms ’tax’ and ’avoidance’ in the same sentence?

Indexing • If the granularity of the index is taken to be one word, then the index will record the exact location of every word in the collection • the original text can be recovered from the index • the index takes more space than the original text

Indexing • Choice of representative terms • each word that appears in the documents is included verbatim as a term in the index • the number of terms is huge • usually some transformations • case folding • stemming, baseword reduction • removal of stopwords

Inverted file indexing • An inverted file contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text • each pointer is the number of a document in which that term appears • a lexicon: a list of all terms that appear in the document collection • supports mapping from terms to their corresponding inverted lists

Inverted file indexing • A query involving a single term is answered by scanning its inverted list and retrieving every document that it cites • for conjunctive Boolean queries of the form ’term AND term AND … AND term’, the intersection of the terms’ inverted lists is formed • for disjunction (OR): union of lists • for negation (NOT): complement

Inverted file indexing • The inverted lists are usually stored in order of increasing document number • various merging operations can be performed in a time that is linear in the size of the lists

Inverted file indexing: granularity • A coarse-grained index might identify only a block of text, where each block stores several documents • a moderate-grain index will store locations in terms of document numbers • a fine-grained index will return a sentence or word number

Inverted file indexing: granularity • Coarse indexes • require less storage, but during retrieval, more of the plain text must be scanned to find terms • multiterm queries are more likely to give rise to false matches, where each of the desired terms appears somewhere in the block, but not all within the same document

Inverted file indexing: granularity • Word-level indexing • enables queries involving adjacency and proximity to be answered quickly because the desired relationship can be checked before the text is retrieved • adding precise locational information expands the index • more pointers in the index • each pointer requires more bits of storage

Inverted file indexing: granularity • Unless a significant fraction of the queries are expected to be proximity-based, the usual granularity is to individual documents • phrase-based queries can be handled by the slightly slower method of a postretrieval scan

Inverted file compression • Uncompressed inverted files can consume considerable space • 50-100% of the space of the text itself • the size of an inverted file can be reduced considerably by compressing it • key for compression: • each inverted list can without any loss of generalization be stored as an ascending sequence of integers

Inverted file compression • Suppose that some term appears in 8 documents of a collection; the term is described in the inverted file by a list: • <8; 3, 5, 20, 21, 23, 76, 77, 78> the address of which is contained in the lexicon • more generally, the list for a term t store the number of documents ft in which the term appears and then a list of ft document numbers

Inverted file compression • the list of document numbers within each inverted list is in ascending order, and all processing is sequential from the beginning of the list • -> the list can be stored as an initial position followed by a list of d-gaps • the list for the term above: • <8; 3, 2, 15, 1, 2, 53, 1, 1>

Inverted file compression • The two forms are equivalent, but it is not obvious that any saving has been achieved • the largest d-gap in the second presentation is still potentially the same as the largest document number in the first • if there are N documents in the collection and a flat binary encoding is used to represent the gap sizes, both methods require log N bits per stored pointer

Inverted file compression • Considering each inverted list as a list of d-gaps, the sum of which is bounded by N, allows improved representation • -> it is possible to code inverted lists using on average substantially fewer than log N bits per pointer

Inverted file compression • many specific models have been proposed • global methods • every inverted list is compressed using the same common model • local methods • adjusted according to some parameter, usually frequency • tend to outperform global ones, but are more complex to implement

Querying • How to use an index to locate information in the text it describes?

Boolean queries • A Boolean query comprises a list of terms that are combined using the connectives AND, OR, and NOT • the answers to the query are those documents that satisfy the condition

Boolean queries • e.g. ’text’ AND ’compression’ AND ’retrieval’ • all three words must occur somewhere in every answer (no particular order) • ”the compression and retrieval of large amounts of text is an interesting problem” • ”this text describes the fractional distillation scavenging technique for retrieving argon from compressed air”

Boolean queries • A problem with all retrieval systems: • non-relevant answers are returned • must be filtered out manually • broad query -> high recall • narrow query -> high precision

Boolean queries • Small variations in a query can generate very different results • data AND compression AND retrieval • text AND compression AND retrieval • the user should be able to pose complex queries like • (text OR data OR image) AND (compression OR compaction OR decompression) AND (archiving OR retrieval OR storage)

Ranked queries • Non-professional users might prefer simply giving a list of words that are of interest and letting the retrieval system supply the documents that seem most relevant, rather than seeking exact Boolean answers • text, data, image, compression, compaction, archiving, storage, retrieval...

Ranked queries • It would be useless to convert a list of words to a Boolean query • connect with AND -> too few documents • connect with OR -> too many documents • solution: a ranked query • a heuristic that is applied to measure the similarity of each document to the query • r most closely matching documents are returned

Ranking strategies • Simple techniques • count the number of query terms that appear somewhere in the document • a document that contains 5 query terms is ranked higher than a document that contains 3 query terms • more advanced techniques • cosine measure • takes into account the lenghts of the documents etc.

Accessing the lexicon • The lexicon for an inverted file index stores • the terms that can be used to search the collection • information needed to allow queries to be processed • address in the inverted file (of the corresponding list of document numbers) • the number of documents containing the term

Access structures • A simple structure • an array of records, each comprising a string along with two integer fields • if the lexicon is sorted, a word can be located by a binary search of the strings • consumes a lot of space • e.g. a collection of million words (~5GB), stored as 20-byte strings, with 4-byte inverted file address and 4-byte freq. value -> 28MB

Access structures • The space for the strings is reduced if they are all concatenated into one long contiguous strings • an array of 4-byte character pointers is used for access • each term: its exact number of characters + 4 for the pointer • it is not necessary to store string lengths: next pointer indicates the end of the string • in the collection of million terms, memory reduction is 8 MB -> 20 MB

Access structures • The memory required can be further reduced by eliminating many of the string pointers • 1 word in 4 is indexed, and each stored word is prefixed by a 1-byte length field • the length field allows the start of the next string to be identified and the block of strings traversed

Access structures • In each group, 12 bytes of pointers is saved • at the cost of including 4 bytes of length information • for a million word lexicon: saving of 2MB -> 18 MB

Access structures • Blocking makes the search process more complex: to look up a term • the array of string pointers is binary-searched to locate the correct block of words • the block is scanned in a linear fashion to find the term • the term’s ordinal term number is inferred from the combination of the block number and the position within block • freq.value and inverted file addresses are accessed using the ordinal term number

Access structures • Consecutive words in a sorted list are likely to share a common prefix • front coding: • 2 integers are stored with each word • one to indicate how many prefix characters are the same as the previous word • the other to record how many suffix characters remain when the prefix is removed • the integers are followed by the suffix characters

Access structures • Front coding yields a net saving of about 40 percent of the space required for string storage in a typical lexicon for the English language • problem with the complete front coding: • binary search is no longer possible • solution: partial 3-in-4 front coding

Access structures • Partial 3-in-4 front coding • every 4th word (the one indexed by the block pointer) is stored without front coding, so that binary search can proceed • on a large lexicon, expected to save about 4 bytes on each of three words, at the cost of 2 extra bytes of prefix-length information • a net gain of 10 bytes per 4-word block • for million-word lexicon: -> 15,5 MB

Disk-based lexicon storage • The amount of primary memory required by the lexicon can be reduced by putting the lexicon on disk • just enough information is retained in primary memory to identify the disk block corresponding to each term

Disk-based lexicon storage • To locate the information corresponding to a given term, the in-memory index is searched to determine a block number • the block is read into a buffer • search is continued within the block • B-tree etc. can be used

Disk-based lexicon storage • This approach is simple and requires minimal amount of primary memory • a disk-based lexicon is many times slower to access than a memory-based one • one disk access per lookup is required • extra time is tolerable when just a few terms are being looked up (like in normal query processing, less than 50 terms) • not suitable for index construction process

Boolean query processing • Processing a query • the lexicon is searched for each term in the query • each inverted list is retrieved and decoded • lists are merged, taking the intersection, union, or complement, as appropriate • finally, the documents are retrieved and displayed

Conjunctive queries • text AND compression AND retrieval • a conjunctive query of r terms is processed • each term is stemmed and located in the lexicon • if the lexicon is on disk, one disk access per term is required • the terms are sorted by increasing frequency

Conjunctive queries • The inverted list for the least frequent term is read into memory • the list = a set of candidates (documents that have not yet been eliminated and might be answers to the query) • all remaining inverted lists are processed against this set of candidates, in increasing order of term frequency

Conjunctive queries • In a conjunctive query, a candidate cannot be an answer unless it appears in all inverted lists • -> the size of the set of candidates is non-increasing • to process a term, each document in the set of candidates is checked and removed if it does not appear in the term’s inverted list • the remaining candidates are the answers

Term processing order • reasons to select the least frequent term to initialize the set of candidates (and also later): • to minimize the amount of temporary memory space required during query processing • the number of candidates may be quickly reduced, even to zero, after which no processing is required

Processing ranked queries • How to assign a similarity measure to each document that indicates how closely it matches a query?

Coordinate matching • Count the number of query terms that appear in each document • the more terms that appear, the more likely it is that the document is relevant • a hybrid query between a conjunctive AND query and a disjunctive OR query • a document that contains any of the terms is a potential answer, but preference is given to documents that contain all or most of them

Inner product similarity • Coordinate matching can be formalized as an inner product of a query vector with a set of document vectors • the similarity measure of query Q with document Dd is expressed as • M(Q, Dd) = Q · Dd • the inner product of two n-vectors X and Y:

Drawbacks • Takes no account of term frequency • documents with many occurrences of a term should be favored • takes no account of term scarcity • rare terms should have more weight? • long documents with many terms are automatically favored • they are likely to contain more of any given list of query terms

Processing of large document collections