1 / 10

IR Homework #1

IR Homework #1. By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing. Goal: to build an index for a text collection using inverted files Input : a set of documents concatenated into a single large file (to be described later) Output : inverted index files

duc
Télécharger la présentation

IR Homework #1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Homework #1 By J. H. Wang Mar. 5, 2008

  2. Programming Exercise #1: Indexing • Goal: to build an index for a text collection using inverted files • Input: a set of documents concatenated into a single large file • (to be described later) • Output: inverted index files • (exact format to be described later)

  3. Input: the Test Collection • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI, each in different formats • Ex: The Time Collection: 423 documents (1.5MB) • You have to do some preprocessing for different test collections

  4. Output: Inverted Index • Two files • Vocabulary file: a sorted list of words (each word in a separate line) • Occurrences file: for each word, a list of occurrences in the original text • [word#] [term freq.] [ (doc#, char#) pairs] • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91) • 2 2 (3, 44) (8, 72) • …

  5. Implementation Issues • Note: char# means the character position in the FILE (not the document) • This can facilitate easier implementation in later steps after indexing • Document preprocessing should be handled with care • Digits, hyphens, punctuation marks, …

  6. Implementation Issues • You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format • Optional functionality • Stopword removal • Stemming • They should be able to be turned off by a parameter trigger

  7. Submission • Your submission *should* include • The source code (and optionally your executable file) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: two weeks (Mar.19, 2008)

  8. Submission Instructions • Programs or homework in electronic files must be submitted directly to the TA by e-mail as follows • Before submission: one single compressed file (including source codes and documentation), for example, 9659xxxx-HW1.ZIP • Remember to specify your name and student ID in the files and documentation • E-mail of TA: alowblow@hotmail.com • You will get a confirmation e-mail from the TA after receiving your submission • If you cannot successfully e-mail your work, please contact with the TA or the instructor

  9. Evaluation • Minimum requirement: the Time Collection as provided on the Web page will be used as input, and the inverted index generated by your program will be checked for correctness • Optional features such as stemming and stopword removal will be considered as bonus • You might be required to demo if the program submitted was unable to run by TA

  10. Questions?

More Related