2ID10: Information Retrieval Lecture 10: Assignments

2ID10: Information RetrievalLecture 10: Assignments Lora Aroyo 30th May 2006

Course Topics • Basic Information Retrieval Terminology • Query Languages & Operations • Precision & Recall in Search Engines • Relevance Feedback • Language modeling for IR • Search engines • Reference Structures in IR • Multimedia Information Retrieval • Publishing of enriched, structured content Lecture 10: Assignments

Assignment Submission • Final Assignment – 7th July 2006 • Submit to: IR@listserver.tue.nl • Register at: http://listserver.tue.nl/mailman/listinfo/IR • Subject: [Group #] [Assignment #] • Files: group#.ass#.title.extension • URL: http://...../group#/assignment# • URL with running application & documentation • In all files include your Group# and Assignment#, as well as all the names of the group members Lecture 10: Assignments

Assignment Submission • Each assignment should include: • detailed report of the modeling, algorithmic and implementation (in applicable) of the solution • clear problem description • clear solution description and justification • literature and relevant material that you used to solve the problem • the significance and benefit of your solution • URL with report, running implementation and source code Lecture 10: Assignments

Assignment 1: How to improve an existing NL parser • Related Lecture: # 9 - C-content • The behaviour of the NL parser needs to be tuned, based upon the results of a pre-defined set of queries on a pre-defined document collection. • Identify problematic queries or query types leading to low precision/recall results. Lecture 10: Assignments

Assignment 1: Improving an existing NL parser What do we have? • A natural language parser, which does • Spelling checker • Term identification • Search term checking • Syntactic analysis • Semantic expansion • Query base generation What is the problem? • The behavior needs to be tuned, based upon the results of a pre-defined set of queries on a pre-defined document collection Lecture 10: Assignments

Assignment 1: Improving an existing NL parser Assignment • Identifying problematic queries or query types (Requires at least one Dutch native speaker in the team) • Type of problems: • Low precision • Low recall • Misunderstood queries • E.g., does ‘information on cars and traffic jams’ mean that only documents containing both ‘cars’ and ‘traffic jams’ should be found or both documents containing either ‘cars’ or ‘traffic jams’ • … Lecture 10: Assignments

Assignment 1: Improving an existing NL parser Proposed steps • Define a set of natural language queries, including expected query results • define the Boolean query to be executed • list the documents to be found by the query • Run queries through the on-line website • Identify mismatch area’s Lecture 10: Assignments

Assignment 1:Improving an existing NL parser Proposed steps • Define a set of natural language queries, including expected query results • Run queries through the on-line website • Perform the queries on the information portal • Identify mismatch area’s Lecture 10: Assignments

Assignment 1: Improving an existing NL parser Proposed steps • Define a set of natural language queries, including expected query results • Run queries through the on-line website • Identify mismatch area’s • keep track of precision and recall percentages • if these percentages are low, what is wrong with the Boolean query • how could this be improved Lecture 10: Assignments

Assignment 2: Identify context/metadata relation between documents in legal domain • Related Lecture: # 9 - C-content What do we have? • Document collections, which • Contain metadata • Has contextual relations between documents What is the problem? • How to construct relations between documents based on existing metadata? Lecture 10: Assignments

Assignment 2: Identify context/metadata relation between documents in legal domain Assignment • Design and implement the algorithm to identify the possible relation between documents, including a link certainty indicator (Requires at least one Dutch native speaker in the team) Basis information • An ideal document collection relevant to one starting document • A mixed ideal and ‘noise’ documents collection • Description of metadata that can be used to identify the relation Lecture 10: Assignments

Assignment 2: Identify context/metadata relation between documents in legal domain Proposed Steps • Study description and the document collections delivered. • Design the algorithm to solve the problem • Implement the algorithm in a simple programming module • Analyze possible problems due to ambiguity in the context/metadata Lecture 10: Assignments

Assignment 4: User Alert Service • Related Lecture: # 9 - C-content What do we have? • Document collections, which • Contain metadata • Has contextual relations between documents • Structure of the context/metadata relation (assignment 2) What is the problem? • How to construct the user alert service based on the user profile information and the resulted document relation? Lecture 10: Assignments

Assignment 4: User Alert Service Assignment • Design and implement the user profile structure and the user alert service Proposed Steps • Design the user profile structure • Design the algorithm for the user alert service • Implement the algorithm in simple programming module • Analyze complexity level in the user profile structure and the consequence to the user alert service Lecture 10: Assignments

Assignment 3: Using Relevance Feedback during Information Retrieval • Related lecture: # 3 by Theo van der Weide • Consider the cooperators of the Informatics Department. • User feedback for query modification with the Rocchio technique • Download the Perlfect Search engine sources (http://www.perlfect.com/freescripts/search/). • After a query, offer the searcher the top 10 ranking documents, and ask feedback. • Use this feedback to construct the modified query using the Rocchio technique. Lecture 10: Assignments

Assignment 3: Using Relevance Feedback during Information Retrieval Lecture 10: Assignments

Assignment 5: Language identification is an important basic component for multilingual search engines • Related lecture: # 8 by Wessel Kraaij (TNO) • Develop a language identification module using generative character models for the 21 EU languages. • Develop and test the language identification module on the EU constitution corpus. • Test the language identification module on some non EU languages • (e.g. taken from the KDE corpus on the same site) • adjust the classifier in such a way that it can recognize unknown languages. • Optional: produce a similarity matrix between the languages using the identification module. • Use the similarity matrix to build a tree using hierarchical agglomerative techniques. (related languages will be grouped). Lecture 10: Assignments

Assignment 5 • Download the EU constitution in 21 languages from http://logos.uio.no/opus/ • Split each language corpus in a training, test and evaluation set. • Train classifiers for each language using character trigram models smoothed with character bigram models. • Optimize the smoothing parameter on the test data. • Compute the accuracy of the classifier, by taking 100 sentences from the evaluation set of each language. • After adjusting the classifier, test the accuracy of the classifier on the same evaluation set plus 100 sentences from 5 non European languages. • Present results per language. Lecture 10: Assignments

Assignment 5:Language identification is an important basic component for multilingual search engines • Optional: produce a similarity matrix, using the previous test data by averaging probabilities across test sentences and averaging P(A|B) and P(B|A) (the similarity matrix must be symmetric). Use CLUTO to cluster the languages, e.g. by applying hierarchical agglomerative clustering. Lecture 10: Assignments

Assignment 6: A general assumption in CLIR research is that parallel corpora can be exploited to improve monolingual search • Related lecture: # 8 by Wessel Kraaij (TNO) • Test the assumption by training statistical thesauri and performing experiments on public IR test collections. • Investigate whether a combination of statistical thesauri trained on different parallel texts can be used to improve retrieval performance. Lecture 10: Assignments

Assignment 6: A general assumption in CLIR research is that parallel corpora can be exploited to improve monolingual search • Choose three European languages other than English: X, Y, Z. Train statistical thesauri English=>{X|Y|Z}=>English on several parallel corpora available through the OPUS website, by cascading two translation models trained with the EGYPT toolkit (choose IBM model 1). Lecture 10: Assignments

Assignment 6: A general assumption in CLIR research is that parallel corpora can be exploited to improve monolingual search • Construct various combined thesauri: e.g. by interpolating threeindividual thesauri trained on a single corpus or by combining thesauri trained on different corpora. To be effective such thesauri usually are interpolated with the identity matrix. (Xu et al. 2002, TREC2001) • Implement a CLIR system using LEMUR and http://www.lemurproject.org/doxygen/lemur-3.1/html/classXLingRetMethod.html as retrieval method. Lecture 10: Assignments

Assignment 6: A general assumption in CLIR research is that parallel corpora can be exploited to improve monolingual search • Perform monolingual IR experiments (measure mean average precision) on the CACM and CRANFIELD collections. Experiments must include: • a baseline run (standard generative language model) • individual thesauri (with and without interpolation with the identity matrix) • combined thesauri (with and without interpolation with the identity matrix) Lecture 10: Assignments

Assignment 7: Self-learning Dutch meta-search • Related lecture: # 10 by Nils Rooijmans (ilse media) • Build a meta-search • Learn which algorithm works best for which type of queries: • single VS multi-word queries • general queries VS specific queries(roughly based on the number of results in the resultset) • Improve ranking Lecture 10: Assignments

Assignment 7: Self-learning Dutch meta-search • Build website based on API’s • Ilse API (mail searchengine@ilse.net for documentation) • Google API: http://www.google.com/apis/index.html • Yahoo API: http://developer.yahoo.net/web/V1/webSearch.html • Merge resultsets (handle duplicates, hide originating API(s)) • Build feedback mechanism • Score different API’s for different query types • Use score in merging of results Lecture 10: Assignments

Assignment 8: Web-services for Multimedia Indexing • Related lecture: # 8 by Arjen de Vries (CWI) • 8a) implement the EM algorithm to train a model on a given image • 8b) make a web-service that takes an image URL and uses this program to train the model (include 8a) • 8c) make a web-service that visualizes a trained model on a given image, where results for each component are presented in SVG as an overlay of their pixel blocks with transparency depending on the block likelihood (include 8a and 8b) Lecture 10: Assignments

Assignment 9: Digital Accessibility on Dutch Species • Related lecture: # 6 by Trezorix • www.soortenregister.nl • www.nederlandsesoorten.nl (NSR) • www.w3.org/2004/02/skos (SKOS) • www.w3.org/TR/owl-features (OWL) • www.openrdf.org (Sesame) • lucene.apache.org (Lucene) Lecture 10: Assignments

Assignment 9: Digital Accessibility on Dutch Species • The Dutch Species Register (Nederlands Soortenregister, NSR) • thesaurus-structure with information on Dutch plants, animals and fungi. • has about 40,000 biological concepts (taxa). • the web version of the NSR is linked to various databases • like a preservation status database • an image library • article libraries, etc. • the editorial maintenance of the site is done by the Dutch National Museum of Natural History Naturalis • the technical infrastructure is developed and maintained by Trezorix. Lecture 10: Assignments

Assignment 9: Digital Accessibility on Dutch Species • Make a website for digital access to (part of) the collection of the NSR. • The site should illustrate the use of reference structures for browsing and searching digital collections. • The NSR-structure is a complicated structure of thesaurus-like semantic relations combined with naming data for each concept. Lecture 10: Assignments

Assignment 9: Digital Accessibility on Dutch Species • Describe a NSR datamodel. • Represent the structure part (thesaurus) of the NSR in SKOS. • Represent as much of the ‘extra data’ (see below) as possible in SKOS. • Represent those data which do not fit into SKOS in OWL. • Use the open source RDF-framework Sesame for storage of structures. • Use the open source search engine Lucene for findability of NSR elements • Explain how do you deal with the complexity of the Lecture 10: Assignments

Assignment 9: Digital Accessibility on Dutch Species • Applicable ‘extra data’: • Change record data - these data represent a history per concept of mutations of naming data. • Data for species counter - bottom up accumulative data about the number of species under a certain node. • Data about the availability of photographs of species. • What will be supplied by Trezorix?: • A limited NSR dataset, for instance songbirds, including scientific names, synonyms, change record data, etc. • Extra data: data for species counter and data about availability of photographs. • Relevant photographs. • Technical background. Lecture 10: Assignments

Course Goal • dimensions of the IR "problem“: • functions of an IR system • components of an IR system • factors which optimize the IR process • examine current research issues in IR • explore examples of industrial & research IR applications • form a broad picture of the IR field • build experience to work with IR systems Lecture 10: Assignments

2ID10: Information Retrieval Lecture 10: Assignments