Ranking in Information Retrieval Systems

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE 6392 03/23/2006

Introduction • Basic Assumption: • A document is a bag of words • A keyword query is a small set of words • The 3 main ranking model in IR: • Vector Space Model • Probabilistic IR Model • Language Model

Factors that impact Ranking in IR • Relative frequency of occurrence of query keywords in a document. • Proximity of query keywords within a document ( this cannot be done in a bag-of-word model, instead it needs a sequence-of-word model). • Specificity/importance of query keywords. e.g., Given a query (Microsoft, Corporation) Microsoft is more specific than Corporation

Factors that impact Ranking in IR • Links provide more structure to the documents (unlike in earlier IR systems which conformed to the bag-of-word model). This is more relevant in the web context. • Popularity of a page with respect to the relevance of the query. • Try to look at popularity of pages and links to check people’s opinion instead of looking at all other ad hoc methods. • All previous methods are approximation methods to achieve this property of Popularity.

Vector Space Model • Was used in the earliest IR systems and uses Relativefrequency and Specificity of query keywords to come up with a ranking function. • Extends the idea of vector space to language in order to express the simple heuristic idea of the bag-of-word model. • Idea is to view the space of documents as a set of points in a very high-dimensional space.

Relative Frequency in the Vector Space Model • Consider a matrix where each column represents a distinct word in the English language and each row represents a document/page. 1 • How do you represent information about 2 what, part of a document contains, with respect to a particular word? • Boolean matrix used to model vector space, where: • ‘0’ means a particular word doesn’t belong to that document • ‘1’ means a particular word belongs to that document • This information can be used to find N the relative frequency of query keywords w1 w2 w7 w12 w14

Specificity in Vector Space Model • E.g., Consider the query: (Microsoft, Corporation) • We assume that Microsoft is more specific than Corporation. • The number of documents to which a word belongs is inversely proportional to the specificity of the word. This is called ‘Inverse Document Frequency (IDF)’.

Inverse Document Frequency • Let document frequency be the number of documents to which ‘w’ belongs. • Inverse Document Frequency, • This is too strong a definition and we need to dampen this definition because some words can be more/less important than others. • Using a logarithmic dampener,

Inverse Document Frequency • IDF is associated with every word. • IDF cannot be captured using the matrix representation in slide 6 (it can capture Term Frequency and Selectivity of query keywords). • Use another vector to capture Inverse Document Frequency (this captures Relative frequency of query keywords).

Term Frequency • Term Frequency is a (word, document) pair. • Given a word ‘w’ and a document ‘d’, • Term Frequency (w, d) = num: of times ‘w’ belongs to ‘d’/ num: of words in ‘d’.

Scoring function • Mathematical function defining the score of a document with respect to a query: • Assume a query is a bunch of keywords. • Let Q = { } be a query. • What are the arguments/domain on which the scoring function will work on ? • Documents and Words in queries

Scoring function • Vector in a 2-dimensional space is represented as: ax+by. Similarly, vectors can be represented in high-dimension space. • Coefficients ‘a’ and ‘b’ defines the vector such that if we move ‘a’ units in the x-axis and ‘b’ units in the y-axis, then we will hit the tip of the vector ax+by. ax+by

Vector • A vector is an ordered list of coefficients such that each item in the list defines the strength of the vector in that particular dimension. • Each word represents a dimension in the vector space model. If there are ‘n’ words, then this will be an n-dimension vector. • Each document represents a vector in high-dimension vector space.

Vector Space Model • Coefficient of each document vector corresponds to ‘Term Frequency’. • Term Frequency – tells us how important a word is with respect to a document. • Inverse Document Frequency – tells us how many times a word occurs in a document and about its importance. • Score( ) – 1) use angles b/w 2) use distances b/w tips of

Scoring function • Treat a query as a small document then it can also be thought of as a vector. • Only words in the query will have a ‘1’ and all other words will have a ‘0’ entry. q d1 d2 d3

Geometric Interpretation of Score • According to intuition, if score of a document is high, then the document is closer to the query. • There are 2 ways of finding this score: • Find the angle between the query vector and the document vector. • Find the distances of tips of the document vector and query vector and find out the closest distance. • Does these 2 methods produce the same result? Maybe not

Geometric Interpretation of Score • This is because vectors are of arbitrary length. • SOLUTION: Make vectors of same length. • Why does vectors have different lengths? • If a document has 20 words and another document has 2 words, lengths will be different. • How can we make vectors have same length? • Using a unit sphere, replace all vectors such that all of them sit on the surface of the unit sphere.

Normalizing Vectors • If you have all the vectors sitting on the surface of the unit sphere, and if you sort all vectors by its angles and distances from the query vector, it will give the same result. • Use angle measurement after normalizing all vectors to the same length. • This entire process is called ‘TF-IDF ranking with cosine similarity’. d1 d2 q d3

Scoring function • Small angle means it is good (means, document is very similar to the query). • Small angle means ‘large cosine’. Hence, replace ‘angles’ with ‘cosine’ so that now large means good. • Use this feature in scoring functions since the dot product of vectors uses cosine.

Calculating the Score • Let • To find the score of a document, • Normalize the vectors • Find the dot product of the document vector and query vector Q IDF w1 w2 Wm 1 2 N

Calculating the Score of a document • Normalizing vectors: • Finding dot product of vectors: • For every word in the query document, find the corresponding entry in the document vector ‘d’, find the product of each term and sum them all up to get the score.

Conclusion • Given a collection of documents and an incoming query, how will you find the top-k documents? • Do preprocessing by creating the data structures required earlier. • Concerns with this approach: • Do you run query against all documents and words especially when there are lots of sparse entries? • How to take ranking function and boost this implementation? • How will you create these data structures? Do you create data structures for all words and create a sparse matrix or not?

Ranking in Information Retrieval Systems