Information Retrieval

Information Retrieval CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze http://informationretrieval.org

CSE 8337 Outline • Introduction • Simple Text Processing • Boolean Queries • Web Searching/Crawling • Indexes • Vector Space Model • Matching • Evaluation

Modeling TOC(Vector Space and Other Models) • Introduction • Classic IR Models • Boolean Model • Vector Model • Probabilistic Model • Extended Boolean Model • Vector Space Scoring • Vector Model and Web Search

Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing

The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics and neat formalism • Terms are either present or absent. Thus, wij  {0,1} • Consider • q = ka  (kb  kc) • qdnf = (1,1,1)  (1,1,0)  (1,0,0) • qcc= (1,1,0) is a conjunctive component

Ka Kb (1,1,0) (1,0,0) (1,1,1) Kc The Boolean Model • q = ka  (kb kc) • sim(q,dj) = 1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc)) 0 otherwise

Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided • Information need has to be translated into a Boolean expression • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model • Use of binary weights is too limiting • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching

The Vector Model • wij > 0 whenever ki appears in dj • wiq >= 0 associated with the pair (ki,q) • dj = (w1j, w2j, ..., wtj) • q = (w1q, w2q, ..., wtq) • To each term ki is associated a unitary vector i • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

The Vector Model j dj  q i • Sim(q,dj) = cos() = [dj  q] / |dj| * |q| = [ wij * wiq] / |dj| * |q| • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 • A document is retrieved even if it matches the query terms only partially

Weights wij and wiq ? • One approach is to examine the frequency of the occurence of a word in a document: • Absolute frequency: • tf factor, the term frequency within a document • freqi,j - raw frequency of ki within dj • Both high-frequency and low-frequency terms may not actually be significant • Relative frequency: tf divided by number of words in document • Normalized frequency: fi,j = (freqi,j)/(maxl freql,j)

Inverse Document Frequency • Importance of term may depend more on how it can distinguish between documents. • Quantification of inter-documents separation • Dissimilarity not similarity • idf factor, the inverse document frequency

IDF • N be the total number of docs in the collection • ni be the number of docs which contain ki • The idf factor is computed as • idfi = log (N/ni) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki. • IDF Ex: • N=1000, n1=100, n2=500, n3=800 • idf1= 3 - 2 = 1 • idf2= 3 – 2.7 = 0.3 • idf3 = 3 – 2.9 = 0.1

The Vector Model • The best term-weighting schemes take both into account. • wij = fi,j * log(N/ni) • This strategy is called a tf-idf weighting scheme

The Vector Model • For the query term weights, a suggestion is • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni) • The vector model with tf-idf weights is a good ranking strategy with general collections • The vector model is usually as good as any known ranking alternatives. • It is also simple and fast to compute.

The Vector Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • Assumes independence of index terms (??); not clear that this is bad though

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example I

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example II

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example III

Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration

Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeating this process, it is expected that the description of the ideal answer set will improve • Have always in mind the need to guess at the very beginning the description of the ideal answer set • Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. • But, • how to compute probabilities? • what is the sample space?

The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant • Taking the odds minimize the probability of an erroneous judgement • Definition: • wij {0,1} • P(R | dj) :probability that given doc is relevant • P(R | dj) : probability doc is not relevant

Improving the Initial Ranking • Let • V : set of docs initially retrieved • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively

Improving the Initial Ranking • To avoid problems with V=1 and Vi=0: • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • Also, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1

Pluses and Minuses • Advantages: • Docs ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors

Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections • This seems also to be the view of the research community

Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra

The Idea • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra • Let, • q = kx ky • wxj = fxj * idfx associated with [kx,dj] max(idfi) • Further, wxj = x and wyj = y

2 2 sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 The Idea: qand = kx ky; wxj = x and wyj = y (1,1) ky AND y = wyj dj (0,0) x = wxj kx

2 2 sim(qor,dj) = sqrt( x + y ) 2 The Idea: qor = kx ky; wxj = x and wyj = y (1,1) ky OR dj y = wyj (0,0) x = wxj kx

Generalizing the Idea • We can extend the previous model to consider Euclidean distances in a t-dimensional space • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter

Generalizing the Idea 1 1 p p p p p p    p p p    p p • sim(qor,dj) = (x1 + x2 + . . . + xm ) m p p p • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m • A generalized disjunctive query is given by • qor = k1 k2 . . . kt • A generalized conjunctive query is given by • qand = k1 k2 . . . kt

Properties • If p = 1 then (Vector like) • sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m • If p =  then (Fuzzy like) • sim(qor,dj) = max (wxj) • sim(qand,dj) = min (wxj) • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

Properties 2    • This is quite powerful and is a good argument in favor of the extended Boolean model • q = (k1 k2) k3 k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required. • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2

Conclusions • Model is quite powerful • Properties are interesting and might be useful • Computation is somewhat complex • However, distributivity operation does not hold for ranking computation: • q1 = (k1  k2)  k3 • q2 = (k1  k3)  (k2  k3) • sim(q1,dj)  sim(q2,dj)

Vector Space Scoring • First cut: distance between two points • ( = distance between the end points of the two vectors) • Euclidean distance? • Euclidean distance is a bad idea . . . • . . . because Euclidean distance is large for vectors of different lengths.

Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query qand the distribution of terms in the document d2 are very similar.

Use angle instead of distance • Thought experiment: take a document d and append it to itself. Call this document d′. • “Semantically” d and d′ have the same content • The Euclidean distance between the two documents can be quite large • The angle between the two documents is 0, corresponding to maximal similarity. • Key idea: Rank documents according to angle with query.

From angles to cosines • The following two notions are equivalent. • Rank documents in decreasing order of the angle between query and document • Rank documents in increasing order of cosine(query,document) • Cosine is a monotonically decreasing function for the interval [0o, 180o]

Length normalization • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm: • Dividing a vector by its L2 norm makes it a unit (length) vector • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

cosine(query,document) Dot product Unit vectors qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts)

3 documents example contd. Log frequency weighting After normalization cos(SaS,PaP) ≈ 0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?

tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial?

Information Retrieval