430 likes | 439 Vues
CS533 Information Retrieval. Dr. Michal Cutler Lecture #12 March 3 , 1999. This lecture. Evaluation Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142). Fallout. Fallout=
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #12 March 3, 1999
This lecture • Evaluation • Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142)
Fallout • Fallout= • A good system should have high recall and low fallout
Relevance judgment • Exhaustive? • Assume 100 queries • 750,000 documents in collection • Requires 75 million relevance judgments
Relevance judgment • Sampling • with average 200 and maximum 900 relevant docs per query the sample size of the collection needed for good results is still too large
Relevance judgment • Polling • 33 runs of 200 top documents, an average of 2398 docs per topic
Calculating recall & precision • 200 documents in collection • 6 relevant documents for query
Interpolated values • The interpolated precision is the maximum precision at this and all higher recall levels
Precision 1.0 1 2 4 0.75 7 3 10 0.57 5 6 13 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall
Precision Interpolated Values 1.0 1 2 4 0.75 7 3 10 0.57 5 6 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall
Precision Query 1 1.0 Interpolation graphs for 2 queries Query 2 0.75 0.57 0.5 0.46 0.0 1.0 0.5 0.6 0.8 Recall
Averaging performance • Average recall/precision for a set of queries is either user or system oriented • User oriented • Obtain the recall/precision values for each query and • then average over all queries
Averaging performance • System oriented - use the following totals for all queries: • relevant documents, • relevant retrieved, • total retrieved • User oriented is commonly used
User oriented recall-level average • Average at each recall level after interpolation
Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods
Methods for Creating an inverted file • Sort based methods • Memory based inversion • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging
Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning
Inverted file - creating a temporary file • Each document is parsed • Stop words are removed * • Words are stemmed * • Every keyword with its document identifier, tf or location is stored in a record • A dictionary is generated
The dictionary • Binary search tree • Worst case O(dictionary-size) • Average O(lg(dictionary-size)) • Needs space for left and right pointers • A sorted list is generated by traversal
The dictionary • A sorted array • Binary search to find term in array O(log(size-dictionary)) • Insertion is slow O(size-dictionary)
The dictionary • A hash table • Search is fast O(1) • Does not generate a sorted dictionary
The inverted lists • Data stored in inverted list: • The term, df, list of DocIds • government, 3, <5, 18, 26,> • List of pairs of DocId and tf • government, 3 <(5, 2), (18, 1)(26, 2)> • List of DocId and positions • government, 3 <5, 25, 56><18, 4><26, 12, 43>
The inverted file • A dictionary • Stored in memory or • Secondary storage • Each record contains a pointer to inverted list, the term, (and possibly df, a term number) • A postings file - a sequential file with inverted lists sorted by term number
The dictionary and postings files Postings 2 2 2 1 1 1 1 1,2 1 1 Dictionary Doc-id
2,1 Memory Based Inversion • Creates a dictionary where each term points to a linked list • Each record in list contains <d, ft,d> and a pointer to next node 3,1 file search spider tool web 1,1
Pseudo Code for Memory Based Inversion Create an empty dictionary S /*Index collection and create data structure*/ for documents d = 1 to N Read d, index and compute ft,d s foreach term t of d Iftnot in dictionary insert into S Append a node <d, ft,d> to list for t
Pseudo Code for Memory Based Inversion /*Output of inverted file*/ for all terms t in S start a new inverted file entry copy all <d, ft,d> pairs compress and append
Time Time = B * tr + F * tp + (read and index 5.1 hrs) + I * (td + tr) (write compressed inverted file .6 hrs) ~ 5.75 hours Each node in list 10 bytes. Main memory needed: 10 * 400,000,000 = 4 gigabytes
Linked lists stored on disk • If we don’t have sufficient memory an alternative is to store the linked list records on disk • The problem is that when lists are traversed they are distributed in different locations on disk requiring a seek for each record
Time - linked lists on disk Time = B * tr + F * tp + (read and index ~5.1 hrs) + 10*f* tr + (store on disk, .6 hrs) + f* ts + (read lists from disk 6.1 weeks) + I*(td + tr) (write compressed inverted file .6 hrs) = 6.64 weeks.
Pseudo Code for External Sort Based Inversion Create empty dictionary S and temp file /*Index collection and store*/ for document d = 1 to N Read d, index and compute ft,d s for each term t of d Ift not in dictionary insert into S write <t, d, ft,d> to temp file
External Sort Based Inversion /*Create runs of k records*/ While “more unsorted records” read k records from temp file sort by t and d and write back to file /*merge*/ merge pairs of runs until one sorted run
External Sort Based Inversion /*Output inverted file*/ for all terms t in S start a new inverted file entry read all triples <t, d, ft,d> for t compress and append
Space for external sort • To do external sort we need a temporary file of 4Gbytes • At the peak of merge there are 2 copies of the temporary file requiring 8Gbytes
Merging the runs • Since main memory is 40Mbytes, each run is at most 40Mbytes, so 100 runs • lg100=7 merges of a 4 gigabyte file
Time 1. B * tr + F * tp + (read and index ~5.1 hrs) 2. + 10*f* tr + (store on disk .6 hrs) 3. + 20*f* tr + R(1.2klgk)tc (sort runs ~4 hrs) 4. + lg R(20*f* tr + f* tc) (merge pairs of runs ~8 hrs, dominated by disk transfer time)
Time 5. + 10*f* tr + I*(td + tr) (read and write compressed inverted file, ~1 hr) Total time ~ 19.45 hours