1 / 52

Index Compression

Index Compression. Ferrol Aderholdt. Motivation. Uncompressed indexes are large It might be useful for some modern devices to support information retrieval techniques that would not be able to do with uncompressed indexes. Motivation (cont.). Disk I/O is slow. Types of Compression. Lossy

myee
Télécharger la présentation

Index Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index Compression Ferrol Aderholdt

  2. Motivation • Uncompressed indexes are large • It might be useful for some modern devices to support information retrieval techniques that would not be able to do with uncompressed indexes

  3. Motivation (cont.) • Disk I/O is slow

  4. Types of Compression • Lossy • Compression that involves the removal of data. • Loseless • Compression that involves no removal of data.

  5. Overview • A lossy compression scheme • Static Index Pruning • Loseless compression • Elias Codes • n-s encoding • Golomb encoding • Variable Byte Encoding (vByte) • Fixed Binary Codewords • CPSS-Tree

  6. Static Index Pruning • Goal is to reduce the size of the index without reducing the precision such that a human can’t tell the difference between a pruned index and non-pruned index • Focuses on the top k or top δ results • Assumes there is a scoring function • Assumes the function is based off of some table A such that A(t,d) > 0 if t is within d and A(t,d) = 0 otherwise

  7. Static Index Pruning (cont.) • Two approaches • Defined as Uniform pruning. • The removal of “all posting entries whose corresponding table values are bounded above by some fixed cutoff threshold” • Could have a term’s entire posting list pruned • Defined as Term based pruning • An approach that attempts to guarantee that every term will have at least some entries remaining in the index

  8. Static Index Pruning (cont.) • Scoring functions are fuzzy • Only need to find some scoring function S’ such that S’ is within a factor of epsilon of S • Carmel et al proved this mathematically for both uniform and term-based methods

  9. Static Index Pruning (cont.)

  10. Static Index Pruning results • Found that the idealized top k pruning algorithm did not work very well • The smallest value in the posting list was almost always above their threshold so little pruning was done • Modified the algorithm to apply a shift • Subtracted the smallest value from all positive scores with the list • Greatly increased the pruning

  11. Static Index Pruning results (cont.)

  12. Static Index Pruning results (cont.)

  13. Static Index Pruning results (cont.)

  14. Overview Loseless Compression

  15. Elias Codes • Non-parameterized bitwise method of coding integers • Gamma Codes • Represent a positive integer k with stored as a unary code. This is followed by the binary representation of the number without the most significant bit • Not efficient for numbers larger than 15

  16. Elias Codes (cont.) • Delta Codes • Represent a positive integer k with stored as a gamma code. This is followed by the binary representation of the number without the most significant bit • Not efficient for small values

  17. n-s coding • Parameterized, bitwise encoding • Uses a block of n bits followed by s stop bits. • Also contains a parameter b which refers to the base of the number. Meaning, the numbers represented in the blocks of n size cannot be greater than or equal to b.

  18. n-s coding example • Let n=3, s=2, and the base be 6. • Valid data blocks are 000, 001, 010, 011, 100, and 101. • 101 100 001 11 would have the value of 5416

  19. n-s coding (cont.) • [2] used n-s codes with prefix omission and run-length encoding • Ex.

  20. n-s coding (cont.) • Run-length encoding is the process of replacing non-initial elements of a sequence with differences between adjacent elements. E.g.

  21. n-s coding results

  22. Golomb coding • Better compression and faster retrieval than Elias codes • Is parameterized • This is usually stored separate using some other compression scheme

  23. vByte coding • A very simple bytewise compression scheme • Uses 7 bits to code the data portion and the most significant bit is reserved as a flag bit.

  24. Scholer et. al. • Defined an inverted list to be the following: • Where the list is <freq,doc,[offsets]> • Example inverted list for term “Matthew”: <3,7,[6,51,117]><1,44,[12]><2,117,[14,1077]> • Uses different coding schemes per part • E.g. Golomb for freq, Gamma for doc, and vByte for offset

  25. Scholer et al. (cont.) • One optimization is to require encoding to be byte aligned so that decompression can be faster • Another optimization when referring to Boolean or ranked queries is to ignore the offsets and only take into account flag bits within the offset. • Referred to as scanning

  26. Scholer et al. (cont.) • Third optimization is called signature blocks. • An eight bit block that stores the flag bits of up to eight blocks that follow. • For example: 11100101 • Represents 5 integers that are stored in the eight blocks • Requires more space but allows the data blocks to use all 8 bits instead of 7.

  27. Scholer et al. results

  28. Scholer et al. results (cont.)

  29. Scholer et al. results (cont.)

  30. Fixed Binary Codes • Often times the inverted list will be stored as a series of difference gaps between documents like so, • This reduces the amount of bits required to represent a document IDs on average

  31. Fixed Binary Codes (cont.) • Take for example the following list of d-gaps: <12; 38, 17, 13, 34, 6 ,4 ,1, 3, 1, 2, 3, 1> • If a binary code was used to encode this list, 6 bits would be used on each codeword when that would be unnecessary

  32. Fixed Binary Codes (cont.) • Instead encode as spans: <12; (6,4 : 38, 17, 13, 34),(3,1: 6), (2,7 : 4, 1, 3, 1, 2, 3, 1)> where the notation would indicate that w-bit binary codes are to be used to code each of the next s values. • Similar to the approach of Anh and Moffat

  33. Anh and Moffat • Uses a selector then data representation for encoding • A selector can be thought of as the unary portion of gamma codes • Data representation would be the binary portion of gamma codes • The selector uses a table of values where each case is determined on the w-value and is relative to the previous case.

  34. Anh and Moffat (cont.)

  35. Anh and Moffat (cont.) • Using this list and assuming s1= 1, s2= 2, and s3= 4 • From the table on the previous slide we get the following • With each selector as 4 bits (2 bits for w ± 3, 2 bits to choose s1-s3) it takes 16 bits plus the summation of all of the w x s pairs. So, 57 bits are used to encode this list. It would take 60 bits for gamma code.

  36. Anh and Moffat (cont.)

  37. Anh and Moffat (cont.) • The use of parsing is involved to discover segments. • A graph is used in combination with shortest path labeling • Each node is a d-gap and the width to code it • Each outgoing edge is a different way in which selector might be used to cover some subsequent gaps.

  38. Anh and Moffat (cont.) • A multiplier is used since every list can be different but the values for s1, s2, and s3 are fixed. • For example, if m=2 and s1= 1, s2= 2, and s3= 4, or 1-2-4, then they would be equal to 2-4-8. • An escape sequence can also be used on lists that have gaps that span larger than s3 would allow. • This is the addition of an extra 4 bits stating that up to 15m gaps can be placed under one selector

  39. Anh and Moffat results (cont.)

  40. Anh and Moffat results (cont.)

  41. Anh and Moffat

  42. Speeding up decoding • Need to exploit the cache and reduce both cache misses and TLB misses • Use CSS-trees or CPSS-trees • CSS-trees are cache-sensitive search trees that are a variation on m-ary trees. • By making each node contiguous this reduces the need for child pointers • This allows for each node to fit into a cache line (32/64 bit)

  43. CSS-Tree vs m-ary Tree

  44. CPSS-trees • Cache/Page sensitive search trees main purpose is to reduce number cache/TLB misses during random searches • Accomplished by making each node, except the root, 4 KB in size and contains several CSS-Trees • The CSS-Trees are the same size as a cache line and contain the postings • Either 32 or 64 bit

  45. CPSS-trees results

  46. CPSS-trees results (cont.)

  47. Compressed CPSS-trees results

  48. Compressed CPSS-tree results

  49. Questions?? • Questions??

  50. References • [1] David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, Aya Soffer. Static Index Pruning for Information Retrieval Systems. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 43-50, 2001. • [2] Gordon Linoff, Craig Stanfill. Compression of Indexes with Full Positional Information in Very Large Text Databases. SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 88-95, 1993.

More Related