1 / 60

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #14 March 10 , 1999. The university. as seen from my window. This lecture. Creating an inverted index file. Building an inverted file. Some size and time assumptions (Managing Gigabytes chapter 5) The methods. Sizes.

jnegron
Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #14 March 10, 1999

  2. The university as seen from my window

  3. This lecture • Creating an inverted index file

  4. Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods

  5. Sizes

  6. Times and main memory

  7. Methods for Creating an inverted file • Memory based inversion • Sort based methods • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging

  8. Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning

  9. Compression in IR • The dictionary • The inverted file

  10. Fixed length index compression (Grossman) • Entries in inverted list are sorted by document number (4 bytes each) • To save space the offset between consecutive documents is stored • Compression • The two leftmost bits store the number of bytes. Then the offset is stored in the next 6, 14, 22, 30 bits

  11. Fixed length index compression

  12. Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bytes reduced from 4*5 to 6

  13. Elias  encoding • An integer is represented with 2lg x+1 bits • The first lg x bits are the unary representation of lg x as lg x ones • The next bit is a stop bit of 0. • At this point the highest power of 2 that does not exceed x is represented

  14. Elias  encoding • The next lg x bits represent the remainder of x - 2lg x in binary • Let x = 14. lg x =3 . • x - 2lg x= 14 - 8 = 6 • So 14 is represented by 111 0 110 • Let x = 1,000,000. lg x =19. • 19(1) then 0 then 1,000,000 - 219 in 19 bits

  15. Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bits reduced from 8*4*5 to 35

  16. The unary prefix of the  encoding is encoded in  encoding More precisely 1 + lg x is encoded using  encoding Let x = 9. After  encoding it is 1110 001. Using  encoding on the unary prefix 4, we get 11000 001  encoding

  17. The number of bits is: 1+ 2lg (1 + lg x)+ (for  encoding of (1 + lg x)) + lg x (for remainder of original  encoding ) = 1+ 2lg lg 2x+ lg x Better than  for large values of x  encoding

  18. First decode lg x+1 from 11000, which is 22+0=4. So lg x=4-1=3. Now compute 23+001=9 Let x = 1,000,000. lgx=19.93. So the number would start with 19 ones followed by a 0, followed by 19 bit remainder (39 bits)  encoding 20, we get 11110 0100 requiring 9 bits. (9+19=28 bits)  decoding

  19. Golomb code • Given b. x>0 coded in 2 parts. First q+1 in unary where q =  (x - 1) / b; Then r = x - bq - 1 is coded in binary requiring either lg b or lg b bits. • Let b = 3. Remainders are 0, 1 (10), 2 (11).

  20. Local Benoulli Model • Frequent words are coded with small values of b • Words that appear in 10% of the collection have b=7 • Rare words appear with very large b

  21. Compression of temp • Compress <t, d, ft,d> triples • Since runs are sorted by t, can compute t-gaps, and use  encoding • The <d, ft,d> pairs can be compressed to 1 byte (on average) even with simple compression methods such as  and 

  22. Compression • Internal sorting is done concurrently with parsing the text • Dictionary is stored in memory • Initial runs become smaller and there are more passes • Instead of 7 disk intensive passes there are 9 processor intensive passes

  23. Compressing the temporary file Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr +td ) + (sort runs) +  log R(2I’(tr + td)+ f* tc) (merge in log R passes) +(I’+ I)*(td + tr) ( recompress) ~ 26 hours I’~1.35*I, temp file 680 megabytes

  24. Multiway merging • Merge is now processor and not disk intensive • R way merge (400 way merge of buffers of 100 Kbytes) • 540 Mbytes compressed file requires 5400 transfers and 5400 seeks.

  25. Time - multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +I’(ts/b + tr + td) (merge in one pass) +(I’+ I)*(td + tr) ( recompress) ~ 11 hours I’~1.35*I, temp file 540 megabytes

  26. In-place multiway merging • All blocks are padded to be exactly b bytes • Each output block is written back into a vacant block of the temp file • To keep track of the output a block table is generated • The block table is used to create a sequentially sort file • Requires less additional memory

  27. Time - in-place multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +2I’(ts/b + tr + td) (merge and write into empty blocks) + 2I’(ts/b + tr) (permute) +(I’+ I)*(td + tr) ( recompress) ~ 13 hours I’~1.35*I

  28. Large memory inversion • Machine has large memory (this method would need about 1.5 Gbytes memory instead of 4Gbytes, with better compression about 420Mbytes) • Saves space in 2 ways: 1. No need for pointers, 2. uses compression

  29. Large memory inversion • The size of the inverted file is computed based on the size required for each inverted list • A pass over the collection will be needed to compute this data • An array of this size is allocated.

  30. Large memory inversion • The lexicon has a pointer to the start of each inverted list and during inversion to the current empty location on the list • The size of each inverted list is: dft,* log N for its d component + dft,* log maxftt for its ftt component. • Better compressed sizes can be used

  31. 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 0 1 2 3 4 Term 1 Term 2 Term 12 Term 4 df start current 2 0 1 1 2 2 2 3 4 4 5 8 1 9 9 2 10 10 1 12 13 1 13 13 1 14 14 1 2 3 4 5 6 7 8 9 D1: 1, 4 D2: 12, 4 D3: 2, 4 ...

  32. Time Time = B * tr + F * tp + (read and parse) B * tr + F * tp +2I’ td + I*(td + tr) ( invert) ~ 12 hours

  33. Text based partitioning Inverted files are generated for chunks of text and merged Each chunk uses the previous method and is done completely in memory This method uses very little extra disk space (34 Mbytes)

  34. Text based partitioning The merging for each new chunk can be done in place by copying the list for every term to its correct location on the disk About 16 hours

  35. Lexical-based partitioning -FAST-INV • Developed by Fox • Inverts file without external sort • Main idea: dictionary based partitioning

  36. FAST-INV • Divide input into j load files: • Each can be loaded into main memory • Each has about the same number of concepts

  37. FAST-INV • j is as small as possible • concept numbers in load file i are greater than concept numbers in load file j for all j < i

  38. Doc 1: New York slows its rate of taxgrowth But residents pay more than the other 49 states. State and local taxes went up less than the inflation rate in New York between 1994 and 1996, although they are still the highest in the nation, a new report shows…

  39. Doc 2: Block that refund Incometaxrefunds will break last year's record of $114 billion. They are nice to get, but many Americans pay too much. Instead of loaning money to Uncle Sam for a year you could invest it.

  40. Creating the temp file • Each document is converted into a list of Docid/ConceptIds pairs • All DocId/ConceptId pairs are stored in a temp file in secondary memory

  41. The temp file Doc/ConceptId 1, 1 (growth) 1, 2 (New York) 1, 3 (pay) 1, 4 (resident) 1, 5 (state) 1, 6 (tax) 2, 7 (income) 2, 8 (loan) 2, 3 (pay) 2, 9 (refund) 2, 6 (tax) 1: 1, 2 3, 4, 5, 6 2: 7, 8, 3, 9, 6 DocId

  42. Phase 1 • Initialize concept counts to 0. • Read temp file and increment counts • Compute the no. docs (df) per concept • Start building Conptr and Load-table

  43. Complete Conptr • Add load numbers using • the counts and • the amount of free memory space • Compute the offsets

  44. The Conptr file

  45. Build Load-Table • When a load is determined a row is added to Load-Table.

More Related