Advanced topics in Computer Science

Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net

Course purpose • Teach in English in most time • Introduce senior undergraduate studentsto some advanced topics in computer science 2

Course contents • Introduction to information retrieval • Approximate string processing • XML data management • Cloud computing 3

Lecturer Academic experience 2006.9 ~2008.6 University of California, Irvine, Postdoc researcher Supervisor：Prof. Chen Li 2002.8 ~2006.8 National University of Singapore, PhD candidateSupervisor：Prof. Ling Tok Wang 1998.9 ~ 2001.1Shanghai Jiao Tong University Master candidate

University of California, Irvine

Research in Postdoc Data integration in medical system [US patent] Approximate string search[ICDE08] 6 6

National University of Singapore 7

Course grading • Presentation in English/Chinese only 40% • Programming only 40% • In-class presence and quiz 20% 8

Any question and any comments ?

Evaluating Information Retrieval

Online text book:Introduction to Information Retrievalhttp://www-csli.stanford.edu/~hinrich/information-retrieval-book.html

search engine • Have you any comments about search engine? • Baidu • Google • Sogou • Yahoo

Measures for a search engine • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of query language • Speed on complex queries

Measures for a search engine • All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise • The key measure: user happiness • What is this? • Speed of response/size of index are factors • But blindingly fast, useless answers won’t make a user happy • Need a way of quantifying user happiness

Measuring user happiness • Issue: who is the user we are trying to make happy? • Depends on the setting • Web engine: user finds what they want and return to the engine • Can measure rate of return users • eCommerce site: user finds what they want and make a purchase • Is it the end-user, or the eCommerce site, whose happiness we measure? • Measure time to purchase, or fraction of searchers who become buyers?

Measuring user happiness • Enterprise (company/govt/academic): Care about “user productivity” • How much time do my users save when looking for information? • Many other criteria having to do with breadth of access, secure access … more later

Happiness: elusive to measure • But how do you measure relevance? • Will detail a methodology here, then examine its issues • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system • Note: information need is translated into a query • Relevance is assessed relative to the information neednot thequery • E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. • Query: wine red white heart attack effective

Standard relevance benchmarks • TREC - National Institute of Standards and Testing (NIST) has run large IR benchmark for many years • Reuters and other benchmark doc collections used • “Retrieval tasks” specified • sometimes as queries • Human experts mark, for each query and for each doc, Relevant or Irrelevant • or at least for subset of docs that some system returned for that query

Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)

Accuracy – a different measure • Given a query an engine classifies each doc as “Relevant” or “Irrelevant”. • Accuracy of an engine: the fraction of these classifications that is correct.

Why not just use accuracy? • How to build a 99.9999% accurate search engine on a low budget…. • People doing information retrieval want to find something and have a certain tolerance for junk.

Precision/Recall • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases (in a good system)

Difficulties in using precision/recall • Should average over large corpus/query ensembles • Need human relevance assessments • People aren’t reliable assessors • Assessments have to be binary • Nuanced assessments? • Heavily skewed by corpus/authorship • Results may not translate from one domain to another

A combined measure: F • Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): • People usually use balanced F1measure • i.e., with  = 1 or  = ½

Any question and any comments ? 2014/10/9 26

Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)

Precision and Recall Quiz • Precision P = tp/(tp + fp) = 10/13= 77% • Recall R = tp/(tp + fn)=10/15= 67%

Introduction to Information Retrieval System

Query • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • Could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the phrase Romans and countrymen) not feasible

Term-document incidence 1 if play contains word, 0 otherwise

Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100.

Answers to query • Antony and Cleopatra, Act III, Scene ii • Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, • When Antony found Julius Caesar dead, • He cried almost to roaring; and he wept • When at Philippi he found Brutus slain. • Hamlet, Act III, Scene ii • Lord Polonius: I did enact Julius Caesar I was killed i' the • Capitol; Brutus killed me.

Bigger document collections • Consider N = 1million documents, each with about 1K terms. • Avg 6 bytes/term incl spaces/punctuation • 6GB of data in the documents. • Say there are M = 500K distinct terms among these.

Can’t build the matrix • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. • matrix is extremely sparse. • What’s a better representation? • We only record the 1 positions. Why?

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted index • For each term T: store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

Brutus Calpurnia Caesar Dictionary Postings Inverted index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

Tokenizer Friends Romans Countrymen Token stream. Linguistic modules More on these later. friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sort by terms. Core indexing step.

Multiple term entries in a single document are merged. • Frequency information is added. Why frequency? Will discuss later.

The result is split into a Dictionary file and a Postings file.

Where do we pay in storage? Will quantify the storage, later. Terms Pointers

The index we just built Today’s focus • How do we process a Boolean query? • Later - what kinds of queries can we process?

2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Basic postings intersection

Boolean queries: Exact match • Queries using AND, OR and NOT together with query terms • Views each document as a set of words • Is precise: document matches condition or not. • Primary commercial retrieval tool for 3 decades. • Professional searchers (e.g., Lawyers) still like Boolean queries: • You know exactly what you’re getting.

Example: WestLaw http://www.westlaw.com/ • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • About 7 terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: • What is the statute of limitations in cases involving the federal tort claims act? • LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • Long, precise queries; proximity operators; incrementally developed; not like web search

More general merges • Exercise: Adapt the merge for the queries: BrutusAND NOTCaesar BrutusOR NOTCaesar Can we still run through the merge in time O(x+y)?

Advanced topics in Computer Science