IR Homework #1

IR Homework #1 By J. H. Wang Mar. 5, 2008

Programming Exercise #1: Indexing • Goal: to build an index for a text collection using inverted files • Input: a set of documents concatenated into a single large file • (to be described later) • Output: inverted index files • (exact format to be described later)

Input: the Test Collection • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI, each in different formats • Ex: The Time Collection: 423 documents (1.5MB) • You have to do some preprocessing for different test collections

Output: Inverted Index • Two files • Vocabulary file: a sorted list of words (each word in a separate line) • Occurrences file: for each word, a list of occurrences in the original text • [word#] [term freq.] [ (doc#, char#) pairs] • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91) • 2 2 (3, 44) (8, 72) • …

Implementation Issues • Note: char# means the character position in the FILE (not the document) • This can facilitate easier implementation in later steps after indexing • Document preprocessing should be handled with care • Digits, hyphens, punctuation marks, …

Implementation Issues • You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format • Optional functionality • Stopword removal • Stemming • They should be able to be turned off by a parameter trigger

Submission • Your submission *should* include • The source code (and optionally your executable file) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: two weeks (Mar.19, 2008)

Submission Instructions • Programs or homework in electronic files must be submitted directly to the TA by e-mail as follows • Before submission: one single compressed file (including source codes and documentation), for example, 9659xxxx-HW1.ZIP • Remember to specify your name and student ID in the files and documentation • E-mail of TA: alowblow@hotmail.com • You will get a confirmation e-mail from the TA after receiving your submission • If you cannot successfully e-mail your work, please contact with the TA or the instructor

Evaluation • Minimum requirement: the Time Collection as provided on the Web page will be used as input, and the inverted index generated by your program will be checked for correctness • Optional features such as stemming and stopword removal will be considered as bonus • You might be required to demo if the program submitted was unable to run by TA

Questions?

IR Homework #1

IR Homework #1

Presentation Transcript

Homework : “Colons and Semicolons” Bring your IR book tomorrow.

Homework! Oh, Homework!

IR COMD POLS COMD IR IR Global COMD POLS POLS IR Global Psychology IR COMD IR IR

IR Homework #2

IR Homework #1

IR Homework #3

“IR”

IR Homework #2

Homework 1 Homework 2 Homework 3 Homework 4