1 / 43

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #12 March 3 , 1999. This lecture. Evaluation Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142). Fallout. Fallout=

fieldb
Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #12 March 3, 1999

  2. This lecture • Evaluation • Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142)

  3. Fallout • Fallout= • A good system should have high recall and low fallout

  4. Relevance judgment • Exhaustive? • Assume 100 queries • 750,000 documents in collection • Requires 75 million relevance judgments

  5. Relevance judgment • Sampling • with average 200 and maximum 900 relevant docs per query the sample size of the collection needed for good results is still too large

  6. Relevance judgment • Polling • 33 runs of 200 top documents, an average of 2398 docs per topic

  7. Calculating recall & precision • 200 documents in collection • 6 relevant documents for query

  8. Recall & precision

  9. Recall & precision

  10. Interpolated values • The interpolated precision is the maximum precision at this and all higher recall levels

  11. Precision 1.0 1 2 4 0.75 7 3 10 0.57 5 6 13 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall

  12. Precision Interpolated Values 1.0 1 2 4 0.75 7 3 10 0.57 5 6 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall

  13. Precision Query 1 1.0 Interpolation graphs for 2 queries Query 2 0.75 0.57 0.5 0.46 0.0 1.0 0.5 0.6 0.8 Recall

  14. Averaging performance • Average recall/precision for a set of queries is either user or system oriented • User oriented • Obtain the recall/precision values for each query and • then average over all queries

  15. Averaging performance • System oriented - use the following totals for all queries: • relevant documents, • relevant retrieved, • total retrieved • User oriented is commonly used

  16. User oriented recall-level average • Average at each recall level after interpolation

  17. Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods

  18. Sizes

  19. Times and main memory

  20. Methods for Creating an inverted file • Sort based methods • Memory based inversion • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging

  21. Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning

  22. Inverted file - creating a temporary file • Each document is parsed • Stop words are removed * • Words are stemmed * • Every keyword with its document identifier, tf or location is stored in a record • A dictionary is generated

  23. The dictionary • Binary search tree • Worst case O(dictionary-size) • Average O(lg(dictionary-size)) • Needs space for left and right pointers • A sorted list is generated by traversal

  24. The dictionary • A sorted array • Binary search to find term in array O(log(size-dictionary)) • Insertion is slow O(size-dictionary)

  25. The dictionary • A hash table • Search is fast O(1) • Does not generate a sorted dictionary

  26. The parsed collection

  27. The inverted lists • Data stored in inverted list: • The term, df, list of DocIds • government, 3, <5, 18, 26,> • List of pairs of DocId and tf • government, 3 <(5, 2), (18, 1)(26, 2)> • List of DocId and positions • government, 3 <5, 25, 56><18, 4><26, 12, 43>

  28. After sorting

  29. The inverted file • A dictionary • Stored in memory or • Secondary storage • Each record contains a pointer to inverted list, the term, (and possibly df, a term number) • A postings file - a sequential file with inverted lists sorted by term number

  30. The dictionary and postings files Postings 2 2 2 1 1 1 1 1,2 1 1 Dictionary Doc-id

  31. 2,1 Memory Based Inversion • Creates a dictionary where each term points to a linked list • Each record in list contains <d, ft,d> and a pointer to next node 3,1 file search spider tool web 1,1

  32. Pseudo Code for Memory Based Inversion Create an empty dictionary S /*Index collection and create data structure*/ for documents d = 1 to N Read d, index and compute ft,d s foreach term t of d Iftnot in dictionary insert into S Append a node <d, ft,d> to list for t

  33. Pseudo Code for Memory Based Inversion /*Output of inverted file*/ for all terms t in S start a new inverted file entry copy all <d, ft,d> pairs compress and append

  34. Time Time = B * tr + F * tp + (read and index 5.1 hrs) + I * (td + tr) (write compressed inverted file .6 hrs) ~ 5.75 hours Each node in list 10 bytes. Main memory needed: 10 * 400,000,000 = 4 gigabytes

  35. Linked lists stored on disk • If we don’t have sufficient memory an alternative is to store the linked list records on disk • The problem is that when lists are traversed they are distributed in different locations on disk requiring a seek for each record

  36. Time - linked lists on disk Time = B * tr + F * tp + (read and index ~5.1 hrs) + 10*f* tr + (store on disk, .6 hrs) + f* ts + (read lists from disk 6.1 weeks) + I*(td + tr) (write compressed inverted file .6 hrs) = 6.64 weeks.

  37. Pseudo Code for External Sort Based Inversion Create empty dictionary S and temp file /*Index collection and store*/ for document d = 1 to N Read d, index and compute ft,d s for each term t of d Ift not in dictionary insert into S write <t, d, ft,d> to temp file

  38. External Sort Based Inversion /*Create runs of k records*/ While “more unsorted records” read k records from temp file sort by t and d and write back to file /*merge*/ merge pairs of runs until one sorted run

  39. External Sort Based Inversion /*Output inverted file*/ for all terms t in S start a new inverted file entry read all triples <t, d, ft,d> for t compress and append

  40. Space for external sort • To do external sort we need a temporary file of 4Gbytes • At the peak of merge there are 2 copies of the temporary file requiring 8Gbytes

  41. Merging the runs • Since main memory is 40Mbytes, each run is at most 40Mbytes, so 100 runs • lg100=7 merges of a 4 gigabyte file

  42. Time 1. B * tr + F * tp + (read and index ~5.1 hrs) 2. + 10*f* tr + (store on disk .6 hrs) 3. + 20*f* tr + R(1.2klgk)tc (sort runs ~4 hrs) 4. +  lg R(20*f* tr + f* tc) (merge pairs of runs ~8 hrs, dominated by disk transfer time)

  43. Time 5. + 10*f* tr + I*(td + tr) (read and write compressed inverted file, ~1 hr) Total time ~ 19.45 hours

More Related