Indexing

Indexing

Overview of the Talk • Inverted File Indexing • Compression of inverted files • Signature files and bitmaps • Comparison of indexing methods • Conclusion

Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based. Inverted File Indexing • Inverted file index • contains a list of terms that appear in the document collection (called a lexicon or vocabulary) • and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list. • Granularity of an index determines the accuracy of representation of the location of the word • Coarse-grained index requires less storage and more query processing to eliminate false matches • Word-level index enables queries involving adjacency and proximity, but has higher space requirements

Doc Text 1 Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, 3 Nine days old. 4 Some like it hot, some like it cold, 5 Some like it in the pot, 6 Nine days old. Terms Documents --------------------------- cold <2; 1, 4> days <2; 3, 6> hot <2; 1, 4> in <2; 2, 5> it <2; 4, 5> like <2; 4, 5> nine <2; 3, 6> old <2; 3, 6> pease <2; 1, 2> porridge <2; 1, 2> pot <2; 2, 5> some <2; 4, 5> the <2; 2, 5> Inverted File Index: Example Notation: N: number of documents; (=6) n: number of distinct terms; (=13) f: number of index pointers; (=26)

This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer than bits. Assume d-gap representation for the rest of the talk, unless stated otherwise Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of

Symbolwise methods are more suited for coding d-gaps Text Compression Two classes of text compression methods • Symbolwise (or statistical) methods • Estimate probabilities of symbols - modeling step • Code one symbol at a time - coding step • Use shorter code for the most likely symbol • Usually based on either arithmetic or Huffman coding • Dictionary methods • Replace fragments of text with a single code word (typically an index to an entry in the dictionary). • eg: Ziv-Lempel coding, which replaces strings of characters with a pointer to a previous occurrence of the string. • No probability estimates needed

model model text compressed text text encoder decoder Models can be static, semi-static or adaptive. Information content of a symbol s, denoted by I(s) is given by Shannon’s formula Entropy, or the average amount of information per symbol over the whole alphabet, denoted H, is given by Models

A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example

A 0.05 0.1 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example

Huffman Coding: Example Symbol Code Probability A 0000 0.05 B 0001 0.05 C 001 0.1 D 01 0.2 E 10 0.3 F 110 0.2 G 111 0.1 1.0 0 1 0.4 0 1 0.2 0.6 0 1 1 0 0.1 0.3 0 1 0 1 A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1

A) 1.0000 Pr[c]=1/3 0.6667 Pr[b]=1/3 Interval used to code b 0.3333 Pr[a]=1/3 C) 0.0000 0.6667 0.6667 Pr[c]=3/6 B) 0.6667 Pr[c]=2/5 0.6501 Final interval (represents whole output) Pr[c]=1/4 Interval used to code c Pr[b]=2/6 0.6390 0.5834 Pr[a]=1/6 Pr[b]=2/4 0.6334 0.6334 Pr[b]=2/5 0.4167 Pr[a]=1/4 0.3333 0.6001 Pr[a]=1/5 0.5834 Arithmetic Coding String = bccb Alphabet = {a, b, c} Code = 0.64

Arithmetic Coding produces near-optimal codes, given an accurate model Arithmetic Coding: Conclusions • High probability events do not reduce the size of the interval in the next step very much, whereas low-probability events do. • A small final interval requires many digits to specify a number guaranteed to be in the interval. • Number of bits required is proportional to the negative logarithm of the size of the interval. • A symbol s of probability Pr[s] contributes -log Pr[s] bits to the output.

Methods for Inverted File Compression • Methods for compressing d-gap sizes can be classified into • global: each list is compressed using the same model • local: the model for compressing an inverted list is adjusted according to some parameter, like the frequency of the term • Global methods can be divided into • non-parameterized: probability distribution for d-gap sizes is predetermined. • parameterized: probability distribution is adjusted according to certain parameters of the collection. • By definition, local methods are parameterized.

γ code: Number x is coded as a unary code for followed by a code of in binary. bits that represents δ code: Number of bits in binary is represented using γ code. For small integers, δ codes are longer than γ codes, but for large integers, the situation reverses. Non-parameterized models Unary code: An integer x > 0, is coded as (x-1) ‘1’ bits followed by a ‘0’ bit.

Non-parameterized models Each code has an underlying probability distribution, which can be derived using Shannon’s formula. Probability assumed by unary is too small.

Global parameterized models Probability that a random document contains a random term, Assuming a Bernoulli process, Arithmetic coding: Huffman-style coding (Golomb coding):

Global observed frequencymodel • Use exact d-gap values and then use arithmetic or Huffman coding • Only slightly better than γ or δ code • Reason: pointers are not scattered randomly in the inverted file • Need local methods for any improvement

Need an adaptive model that is good for clusters Local methods • Local Bernoulli • Use a different p for each inverted list • Use γ code for storing • Skewed Bernoulli • Local Bernoulli model is bad for clusters • Use a cross between γ and Golomb, with b=median gap size • Need to store b (use γ representation) • This is still a static model

Interpolative code Consider an inverted list Documents 8, 9, 11, 12 and 13 form a cluster Can do better with a minimal binary code

Performance of index compression methods Compression of inverted files in bits per pointer

Signature Files • Each document is given a signature, that captures its content • Hash each document term to get several hash values • Bits corresponding to those values are set to 1 • Query processing: • Hash each query term to get several hash values • If a document has all bits corresponding to those values set to 1, it may contain the query term • False matches • set several bits for each term • make the signatures sufficiently long • Naïve representation: may have to read the entire signature file for each query term • Use bitslicing to save on disk transfer time

Signature files: Conclusion • Design involves many tradeoffs • wide, sparse signatures reduce number of false matches • short, dense signatures require more disk accesses • For reasonable query times, requires more space than compressed inverted file • Inefficient for documents of varying sizes • Blocking makes simple queries difficult to answer • Text is not random

Bitmaps • Simple representation: For each term in the lexicon, store a bitvector of length N. A bit is set if and only if the corresponding document contains the term. • Efficient for boolean queries • Enormous amount of storage requirement, even after removing stop words • Have been used to represent common words

1100 0101 1010 0000 0000 Compression of signature files and bitmaps • Signature files are already in compressed form • Decompression affects query time substantially • Lossy compression results in false matches • Bitmaps can be compressed by a significant amount Compressed code: 1100 : 0101, 1010 : 0010, 0011, 1000, 0100 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000

Compressed inverted files are the most useful for indexing a collection of variable length text documents Comparison of indexing methods • All indexing methods are variations of the same basic idea!! • Signature files and inverted files require an order of magnitude less secondary storage than bitmaps • Signature files cause unnecessary access to the document collection unless signature width is large • Signature files are disastrous when record lengths vary a lot • Advantages of signature files • no need to keep lexicon in memory • better for conjunctive queries involving common terms

Conclusion • For practical purposes, the best index compression algorithm is the local Bernoulli method (using Golomb coding) • Compressed inverted indices are almost always better than signature files and bitmaps in most practical situations, in terms of both space and response time for queries

Indexing

Indexing

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing