CS798: Information Retrieval

CS798: Information Retrieval • Charlie Clarke • claclark@plg.uwaterloo.ca • Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data.

Housekeeping Web page:http://plg.uwaterloo.ca/~claclark/cs798 Area: “Applications/Databases” Meeting times: Mondays, 2:00-5:00, MC2036

NLP DB IR ML

Topics • Basic techniques • Searching, browsing, ranking, retrieval • Indexing algorithms and data structures • Evaluation • Application areas

1. Basic Techniques • Text representation & Tokenization • Inverted indices • Phrase searching example • Vector space model • Boolean retrieval • Simple proximity ranking • Test collections & Evaluation

2. Retrieval and Ranking • Probabilistic retrieval and Okapi BM25F • Language modeling • Divergence from randomness • Passage retrieval • Classification • Learning to rank • Implicit user feedback

3. Indexing • Algorithms and data structures • Index creation • Dynamic update • Index compression • Query processing • Query optimization

4. Evaluation • Statistical foundations of evaluation • Measuring Efficiency • Measuring Effectiveness • Recall/Precision • NDCG • Other measures • Building a test collection

5. Application Areas • Parallel retrieval architectures • Web search (Link analysis/Pagerank) • XML retrieval • Filesystem search • Spam filtering

Other Topics (student projects) • Image/video/speech retrieval • Web spam • Cross- and multi-lingual IR • Clustering • Advertising/Recommendation • Distributed IR/Meta-search • Question answering • etc.

Resources Textbook (partial draft on Website): Büttcher, Clarke & Cormack. Information Retrieval: Data Structures, Algorithms and Evaluation. (start reading ch. 1-3) Wumpus: www.wumpus-search.org

Grading • Short homework exercises from text (10%) • A literature review based on a topic area selected by the student with the agreement of the instructor (30%) • 30-minute presentation on your selected topic (20%) • Class project (40%) – details coming up..

“Documents” • Documents are the basic units of retrieval in an IR system. • In practice they might be: Web pages, email messages, LaTeX files, news articles, phone message, etc. • Update: add, delete, append(?), modify(?) • Passages and XML elements are other possible units of retrieval.

Probability Ranking Principle If an IR system’s response to a query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its users will be maximized.

Evaluating IR systems • Efficiency vs. effectiveness • Manual evaluation • Topic creation and judging • TREC (Text REtreival Conference) • Google Has 10,000 Human Evaluators? • Evaluation through implicit user feedback • Specificity vs. exhaustivity

<topic> • <title> shark attacks </title> • <desc> • Where do shark attacks occur in the world? • </desc> • <narr> • Are there beaches or other areas that are particularly prone to shark attacks? Documents comparing areas and providing statistics are relevant. Documents describing shark attacks at a single location are not relevant. • </narr> • </topic>

Class Project:Wikipedia Search • Can we outperform Google on the Wikipedia? • Basic project: Build a search engine for the Wikipedia (using any tools you can find). • Ideas: Pagerank, spelling, structure, element retrieval, summarization, external information, user interfaces

Class Project: Evaluation • Each student will create and judge n topics. • The value of n depends on the number of students. (But workload stays the same.) • Quantitative measure of effectiveness. • Qualitative assessment of user interfaces. • Volunteer needed to operate the judging interface (for credit).

Class Project: Organization • You may work in groups (check with me). • You may work individually (check with me). • You may create and share tools with other students. You get the credit. (e.g. Volunteer needed to set up a class wiki.) • Programming can’t be avoided, but can be minimized. ☺ • Programming can also be maximized.

Class Project: Grading • Topic creation and judging: 10% • Other project work: 30% • You are responsible for submitting one experimental run for evaluation. • Other activities are up to you.

One line?

Tokenization • For English text: Treat each string of alphanumeric characters as a token. • Number sequentially from the start of the text collection. • For non-English text: Depends on the language (possible student projects) • Other considerations: Stemming, stopwords, etc.

Inverted Indices • Basic data structure • More next day…

Plan • Sept 17: • Inverted indices (from Chapter 3) • Index construction/Wumpus (Stefan) • Sept 24: • Vector space model, Boolean retrieval, proximity • Basic evaluation methods • October 1: • Probabilistic retrieval, language modeling • Start topic creation for class project • October 8: Web search

CS798: Information Retrieval

CS798: Information Retrieval

Presentation Transcript

Text summarization

An Overview of the Indri Search Engine

INFO624 -- Week 9 Effective Information Retrieval

Applications (1 of 2): Information Retrieval

Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval

Practical Online Retrieval Evaluation SIGIR 2011 Tutorial

SIMS 247 Information Visualization and Presentation

Text Information Retrieval and Applications – Advanced Topics

Memory

Text-retrieval Systems

Cross-Language Retrieval

Laboratory Information Management Systems (LIMS)

Information Retrieval and Recommendation Techniques

Text Information Retrieval and Applications – Advanced Topics

Information Retrieval and Search Engines

Information Retrieval and Search Engines

Event-based Notification VL Web-basierte Informationssysteme

Learning Embeddings for Similarity-Based Retrieval

Modeling the Internet and the Web: Text Analysis

Text information storage and retrieval and the CDS/ISIS program