Information Retrieval

Information Retrieval For the MSc Computer Science Programme Lecture 1 Introduction to Information Retrieval (Manning et al. 2007) Chapter 1 Dell Zhang Birkbeck, University of London

What is IR? • IR is about search engines. Database System Structured Data (Tables) SQL Queries Search Engine Unstructured Data (Text) Keyword Queries VS Structured data has been the big commercial success (e.g., Oracle) but unstructured data is now becoming dominant in a large and increasing range of activities.

Search is Cool • Text everywhere • books, documents (ms-word, pdf, etc.), articles (journal, magazine, newspaper, etc.), Web pages, emails, SMS, chat, … • Much more than text • music, photos, videos, … • Much more than the Web • personal • enterprise, institutional and domain-specific

Search is Cool • Not only search/finding, but also organization and mining • clustering • classification • …… http://commons.wikimedia.org/wiki/Image:Bookshelf.jpg For example, given thousaunds of CVs, …

Search is Hot • Most people’s means of information access • In the 1990s: other people • Nowadays: Web search • “92% of Internet users say the Internet is a good place to go for getting everyday information” – (2004 Pew Internet Survey) • The Search Engine War • Google, Yahoo, Microsoft, … Video: How Big Can You Think?

Data: Unstructured vs. Structured in 1996

Data: Unstructured vs. Structured in2006

Free Textbook Online Head of Yahoo! Research • C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. • http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html Don’t remember. Search! Another reason for you to come to this module.

A Simple Search Engine “to be, or not to be” • http://www.rhymezone.com/shakespeare/ • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible How do you search in a book?

Boolean Queries • Boolean Queries are queries using AND, OR and NOT together with query terms. • Each document is viewed as a set of words. • Precise: a document matches condition or not. • Primary commercial retrieval tool for 3 decades. • Professional searchers still like Boolean queries - you know exactly what you’re getting. • For example, http://www.westlaw.com/.

Term-Document Incidence 1 if play contains word, 0 otherwise BrutusANDCaesar but NOTCalpurnia

Incidence Vectors • So we have a 0/1 vector for each term. • To answer the query • take the vectors for Brutus, Caesar and Calpurnia(complemented)  bitwise AND. • 110100 & 110111 & 101111 = 100100

Query Results Antony and Cleopatra, Act III, Scene ii …… Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. …… Hamlet, Act III, Scene ii ……. Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. ……

Bigger Corpora • Consider • n = 1M documents • each with about 1K terms. • Avg 6 bytes/term incl spaces/punctuation • 6GB of data in the documents. • Say there are m = 500K distinct terms among these.

Bigger Corpora • The 500K x 1M matrix has • half-a-trillion 0’s and 1’s, • but no more than one billion 1’s. • The matrix is extremely sparse. • Can’t build the matrix straightforwardly. • What’s a better representation? • Only record the 1 positions. Why?

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted Index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

Brutus Calpurnia Caesar Dictionary Postings Inverted Index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 (sorted by docID)

Documents (tobe indexed) Friends, Romans, countrymen. Tokenization Friends Romans Countrymen Token Stream Linguistic Preprocessing friend roman countryman Modified Tokens friend Indexing roman 2 4 countryman Inverted Index 1 2 16 13 Inverted Index - Construction

Linguistic Preprocessing • Case-folding • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization. • Removal of stopwords • Very common words like the, of, to, etc. • List of stopwords (stop lists) • http://www.dcs.gla.ac.uk/idom/ir_resources/ linguistic_utils/stop_words • http://www.ranks.nl/tools/stopwords.html

Linguistic Preprocessing • Stemming • Reduce terms to their “roots” before indexing, e.g., automate(s), automatic, automation all reduced to automat. • Porter’s stemmer • Implementations in C, Java, Perl, Python, etc. • http://www.tartarus.org/~martin/PorterStemmer/ • ……

Indexing • Input • a sequence of (term, docID) pairs Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sort • by terms. This is the core indexing step,

Merge • multiple term entries in a single document. • Add • frequency information. Why? Will discuss later.

Split • the result into • a dictionary file and a postings file.

The index we just built Where do we pay in storage? How do we process a query?

2 4 8 16 32 64 1 2 3 5 8 13 21 Query Processing • Consider processing the query: BrutusANDCaesar • Locate Brutus in the dictionary; • Retrieve its postings. • Locate Caesar in the dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 Query Processing - Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries. In other words, if the posting list lengths are x and y, the merge takes O(x+y)operations. 128 2 34 What’s crucial: postings must be sorted by docID.

Query Processing - Exercise • Adapt the mergealgorithm for the queries: • BrutusAND NOTCaesar • BrutusOR NOTCaesar Can we still run through the merge in time O(x+y)?

Query Processing - Exercise • What about an arbitrary Boolean formula? • (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) • Can we always merge in “linear” time? • The time complexity is linear in what? • Can we do better?

Query Processing - Exercise • Extend the merge algorithm to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size? [Hint] Begin with the case of a Boolean formula query, then each query term appears only once in the query.

Take Home Messages • IR is about search engines. • Search is hot! Search is cool! • Free textbook online! • http://www.dcs.bbk.ac.uk/~dell/teaching/ir/ • Inverted Index • dictionary + postings • Boolean Search • query processing

Information Retrieval