Web Search and Information Retrieval

Web Search and Information Retrieval

Definition of information retrieval • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers)

Structured vs unstructured data • Structured data : information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Unstructured data • Typically refers to free text • Allows • Keyword-based queries including operators • More sophisticated “concept” queries, e.g., • find all web pages dealing with drug abuse

Ultimate Focus of IR • Satisfying user information need • Emphasis is on retrieval of information (not data) • Predicting which documents are relevant, and then linearly ranking them.

Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task SIGIR 2005

The classic search model Mis-conception Mis-translation Mis-formulation Get rid of mice in a politically correct way TASK Info Need Info about removing mice without killing them Verbal form How do I trap mice alive? Query mouse trap SEARCHENGINE QueryRefinement Results Corpus

Boolean Queries • Some simple query examples • Documents containing the word “Java” • Documents containing the word “Java” but not the word “coffee” • Documents containing the phrase “Java beans” or the term “API” • Documents where “Java” and “island” occur in the same sentence • The last two queries are called proximity queries

Before processing the queries… • Documents in the collection should be tokenized in a suitable manner • We need to decide what terms should be put in the index

Tokens and Terms

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below

Why tokenization is difficult – even in English Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. Tokenize this sentence

One word or two? (or several) • fault-finder • co-education • state-of-the-art • data base • San Francisco • cheap San Francisco-Los Angeles fares

Tokenization: language issues Chinese and Japanese have no spaces between words: 莎拉波娃現在居住在美國東南部的佛羅里達。 Not always guaranteed a unique tokenization

Ambiguous segmentation in Chinese The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.

Normalization • Need to “normalize” terms in indexed text as well as query terms into the same form. • Example: We want to match U.S.A. and USA • Two general solutions • We most commonly implicitly define equivalence classes of terms. • Alternatively: do asymmetric expansion • window → window, windows • windows → Windows, windows • Windows (no expansion) • More powerful, but less efficient

Case folding Reduce all letters to lower case exception: upper case in mid-sentence? Fed vs. fed Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

Lemmatization Reduce inflectional/variant forms to base form E.g., am, are,is be car, cars, car's, cars'car the boy's cars are different colorsthe boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form

Stemming Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress.

Porter algorithm • Most common algorithm for stemming English • Results suggest that it is at least as good as other stemming options • Phases are applied sequentially • Each phase consists of a set of commands. • Sample command: Delete final “ement” if what remains is longer than 1 character • replacement → replac • cement → cement • Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

Porter stemmer: A few rules Rule Example • SSES → SS caresses → caress • IES → I ponies → poni • SS → SS caress → caress • S → cats → cat

Other stemmers Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) Full morphological analysis – at most modest benefits for retrieval Do stemming and other normalizations help? English: very mixed results. Helps recall for some queries but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational Definitely useful for Spanish, German, Finnish, …

Thesauri Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile color = colour Rewrite to form equivalence classes Index such equivalences When the document contains automobile, index it under car as well (usually, also vice-versa) Or expand query? When the query contains automobile, look under car as well

Stop words(1) stop words = extremely common words which would appear to be of little value in helping select documents matching a user need They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with Without suitable compression techniques, it needs a lot of space to index stop words. Stop word elimination used to be standard in older IR systems.

Stop words(2) • But the trend is away from doing this: • Good compression techniques mean the space for including stopwords in a system is very small • Good query optimization techniques mean you pay little at query time for including stop words. • You need them for: • Phrase queries: “King of Denmark” • Various song titles, etc.: “Let it be”, “To be or not to be” • ‘can’ as a verb is not very useful for keyword queries, but ‘can’ as a noun could be central to a query • Most web search engines index stop words

The information contains in Doc1&&2 can be represented in the right table. Start to process Boolean queries(1) Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Start to process Boolean queries(2) • The table mentioned above is called POSTING • By using a table like this, it is simple to answer the queries using SQL • Documents containing the word “Java” • select did from POSTING where tid=‘jave’ • Documents containing the word “Java” but not the word “coffee” • (select did from POSTING where tid= ‘java’) except (select did from POSTING where tid=‘coffee’)

Start to process Boolean queries(3) • Documents containing the phrase “Java beans” or the term “API” • With D_JAVA(did, pos) as (select did, pos from POSTING where tid=‘java’), D_BEANS(did, pos) as (select did, pos from POSTING where tid=‘beans’), D_JAVABEANS(did) as (select D_JAVA.did from D_JAVA, D_BEANS where D_JAVA.did= D_BEANS.did and D_JAVA.pos+1=D_BEANS.pos), D_API(did) as (select did from POSTING where tid=‘api’), (select did from D_JAVABEANS) union (select did from D_API) • Documents where “Java” and “island” occur in the same sentence • If sentence terminators are well defined, one can keep a sentence counter and maintain sentence positions as well as token positions in the POSTING table.

Is it efficient? • Although the three-column table makes it easy to write keyword queries, it wastes a great deal of space. • To reduce the storage space • Document-term matrix -> term-document matrix • Inverted index • For each term T, we must store a list of all documents that contain T.

Inverted index: the basic concept

Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings lists Posting 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID

Query processing: AND Consider processing the query: BrutusANDCaesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: 2 4 8 16 32 64 1 2 3 5 8 13 21 128 Brutus Caesar 34 32

Walk through the two postings simultaneously, in time linear in the total number of postings entries The merge Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. 33

Sequence of (Modified token, Document ID) pairs. Index construction Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sort by terms. External sort is used N-way merge sort Large scale indexer Core indexing step.

Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.

The result is split into a Dictionary file and a Postings file.

Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Google data centers • Google data centers mainly contain commodity machines. • Data centers are distributed around the world. • Estimate: a total of 1 million servers, 3 million processors/cores (Gartner 2007) • Estimate: Google installs 100,000 servers each quarter. • Based on expenditures of 200–250 million dollars per year

Distributed indexing Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.

Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents

Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion

Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists

Data flow Master assign assign Postings Parser Inverter a-f g-p q-z a-f Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z Map phase Reduce phase Segment files

MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … … without having to write code for the distribution part.

MapReduce Index construction was just one phase. Another phase: transforming a term-partitioned index into document-partitioned index. Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents (As we discuss in the web part of the course) most search engines use a document-partitioned index … better load balancing, etc.)

Dynamic indexing Up to now, we have assumed that collections are static. They rarely are: Documents come in over time and need to be inserted. Documents are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Simplest approach Maintain “big” main index Insertions New docs go into “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

Dynamic indexing at search engines • All the large search engines now do dynamic indexing • Their indices have frequent incremental changes • News items, new topical web pages • But (sometimes/typically) they also periodically reconstruct the index from scratch • Query processing is then switched to the new index, and the old index is then deleted

Something about dictionary

Web Search and Information Retrieval

Web Search and Information Retrieval

Presentation Transcript

Information Retrieval and Web Search

INFORMATION RETRIEVAL AND WEB SEARCH

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search