1 / 22

Search engines 2

Search engines 2. Øystein Torbjørnsen Fast Search and Transfer. Outline. Inverted index Constructing inverted indexes Compression Succinct index ( Holger Bast ) Hierarchical inverted indexes Skip lists. Inverted index. Posting file. Dictionary. a. cal. drill. docid. frequency.

ivy
Télécharger la présentation

Search engines 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search engines 2 Øystein TorbjørnsenFast Search and Transfer

  2. Outline • Inverted index • Constructing inverted indexes • Compression • Succinct index (HolgerBast) • Hierarchical inverted indexes • Skip lists

  3. Inverted index Posting file Dictionary a cal drill docid frequency position list excellent zebra posting list dark darker

  4. Inverted index • Posting list is sorted on docid • Usually 2 disk IOs to look up one term, O(1) • One to read the dictionary entry • One to read the posting list (possibly large)

  5. Construction • Create sorted subfiles • Merge the subfiles into one large file Needs twice the disk storage as the final index

  6. Compression • Basic idea: • Use knowledge of value distribution to compress data • Costly to compress and decompress, but • Less disk IO • More data fits in main memory • Better locality in memory • Many different schemes: • Delta coding • vByte • PFOR-DELTA • Huffman, Golomb, Rice, Simple9, Simple16

  7. Delta coding • Works on sorted lists • Encoded as difference from previous entry • To be combined with other compression 17 31 62 88 89 97 113 187 199 17 14 31 26 1 8 16 74 12

  8. vByte byte • Variable-byte encoding • Using full bytes • 1 marker bit +7 value bits • Fast encoding and decoding value end marker 0 1001100 = 76 *128*128 = 1245184 0 0111001 = 57 *128 = 7296 1 1101010 = 106 = 106 = 1252586

  9. PFOR-DELTA • Combination of three techniques • P=Prefix suppression • FOR=Frame Of Reference • DELTA = delta coding • Blocks of e.g. 128 values • Fixed number of bits per value • Exception list for outliers

  10. Succinct index • Variation of inverted index • Index ranges of words • Prefix and range search • Smaller dictionary • Longer lists to process • Better compression • Less disk IOs • Disk position vs. transfer times

  11. Hierarchical inverted indexes • Incremental indexing • Build vs lookup time

  12. Never merge • Just keep subfiles and never merge into large file • Construction is O(n) • Fastest possible construction time • Slow lookup with many files O(n)

  13. Hierarchy Level 3 Level 2 Level 1 n=3

  14. Merging strategy Merge into same level Merge to level above m=2 n=3

  15. Issues • Needs twice the space • Merge of upper layer takes a long time • Larger initial files leads to fewer merges • Lookup times varies over time depending on number of files at each level

  16. Column organization • Field selection • Based on query • Phrase queries and proximity scoring needs position • Simple boolean queries does not need position and frequency • Relevance scoring needs frequency • Don’t decompress what you don’t need • Don’t read from disk what you don’t need • Locality

  17. More than text search • Context info • Meta data • Values docid frequency position list context docid date docid size docid owner docid URI docid position person docid position zip code docid position company

  18. Skipping • Search engine and skipping • Used in merging (AND queries) • Semi sequential access • Direct lookup • Disk based • Skip list • Vs Btree • Variants

  19. Skip list 0 < p < 1 (e.g. p=1/2 or p=1/4) Lookup and insertion is O(log n) Size vs speed

  20. Issues • Compression • Can be skewed

  21. Skip list vs B-Tree Skip list B-Tree Disk based structure Better locality • Main-memory structure • Less space

  22. Variations • Deterministic skip list • 1 level skips • Separate skip table

More Related