Information Retrieval and Search Engines

Information Retrieval and Search Engines Lecture 8: Matrix Decompositions and Latent Semantic Indexing Prof. Michael R. Lyu

Motivation of This Lecture • Could we move beyond vector space model to discover semantics among words? • Could we find an representation that is capable of capturing synonym? • E.g. “car” and “automobile”; “boat” and “ship” • How could we apply low-rank approximation in IR?

Motivation of This Lecture • How can we capture the semantic of an object? • Answer: remove details (color), pay attention to the important dimension (shape) Grape

Outline (Ch. 18 of IR Book) • Recap • Linear Algebra Review • Singular Value Decompositions • Low-Rank Approximations • Latent Semantic Indexing

The Document Ranking Problem • Ranked retrieval setup • Given a collection of documents, the user issues a query, and an ordered list of documents is returned • Assume binary notion of relevance • Rd,qis a random dichotomous variable • Rd,q= 1, if document d is relevant w.r.t query q • Rd,q = 0, otherwise • Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P(R = 1|d, q)

Binary Independence Model • Probabilistic retrieval strategy • Find measurable statistics (term frequency, document frequency, document length) • Relevance of each document is independent of other documents • Estimate the probability of document relevance • Order documents by decreasing estimated probability of relevance P(R|d,q)

Deriving a Ranking Function for Query Terms • Equally rank documents by the logarithm of this term, since log is a monotonic function • Resulting quantity used for ranking is called Retrieval Status Value (RSV) in this model

Okapi BM25: A Nonbinary Model • If the query is long, we might also use similar weighting for query terms • tftd: term frequency in the document d • tftq: term frequency in the query q • k1 /k3: tuning parameter controlling term frequency scaling of the document/query • b: tuning parameter controlling scaling of the document length Document term frequency and document length scaling Term frequency in the query Document frequency

Language Models • How to come up with good queries? • Think of words that would likely appear in a relevant document • E.g.: we would like to know when and where the Olympic games started, a good query could be “Olympic history” • Language modeling approach to information retrieval • A document is a good match to a query if the document model is likely to generate the query • Rank documents based on the probability P(q|Md)

Query Likelihood Language Model • Construct each document d a language model • Rank documents by P(d|q): likelihood that it is relevant to the query q • P(d|q) = P(q|d)P(d)/P(q) • P(q): the same for all documents • P(d): prior probability, could include criteria such as authority, length, etc. For simplicity, we assume uniform • P(q|d): query would be observed as a randomsample from the documentmodel • Summary

Extended Language Modeling • Three ways of developing the language modeling approach • (a) Query likelihood: P(q|Md) • (b) Document likelihood: P(d|Mq) • (c) Model comparison: KL(Md||Mq)

Linear Algebra Review • C: M*N matrix with real-valued entries, e.g., term-document matrix • Rank of a matrix: number of linearly independent rows (or columns) in it, rank(C) <= min{M, N} • Diagonal matrix: square r*r matrix all of whose off-diagonal entries are zero, rank equal to number of nonzero diagonal entries • Identity matrix: all r diagonal entries of a diagonal matrix are 1

Linear Algebra Review • Eigenvalue • C: M*M matrix • : an M-vector that is not all zeros, right eigenvector • λ: eigenvalue of C • Principaleigenvector: eigenvector corresponding to the eigenvalue of largest magnitude • Calculate eigenvalues by solving the characteristic equation Determinant of a square matrix

Linear Algebra Review • Rank(s) = 3, λ1=30, λ2 = 20, λ3 = 1 • Multiply s by an arbitrary vector Smaller eigenvalue Lesser effect

Linear Algebra Review • is an arbitrary vector, effect of multiplication by S is determined by the eigenvalues and eigenvectors of S • Effect of smalleigenvalues (and their eigenvectors) on a matrix-vectorproduct is small • Symmetric matrix S, eigenvectors corresponding to distinct eigenvalues are orthogonal

In Class Practice 1 • Link

Linear Algebra Review • Matrix decomposition: square matrix can be factored into the product of matrices derived from its eigenvectors • Eigendecomposition: S is square real-valued M*M matrix with M linearly independent eigenvectors. There exists an eigendecomposition • Columns of U are eigenvectors of S, and ᴧis a diagonalmatrix whose diagonal entries are the eigenvalues of S in decreasing order

Linear Algebra Review • Symmetricdiagonaldecomposition: S is a square, symmetricreal-valued M*M matrix with M linearly independent eigenvectors. There exists a symmetricdiagonaldecomposition • Columns of Q are orthogonal and normalized (unit length, real) eigenvectors of S, ᴧis the diagonalmatrix whose entries are eigenvalues of S. Q-1 = QT

Singular Value Decompositions • Term-document matrix C: M*N, rank(C) = r • Singular value decomposition • U: M*M matrix whose columns are the unit length orthogonaleigenvectors of CCT • V: N*N matrix whose columns are the unit length orthogonaleigenvectors of CTC • Eigenvaluesλ1,λ2,.., λr of CCTare the same as the eigenvalues of CTC • For 1<=i<=r, let σi=, with λi >= λi+1. Then the M*N matrix is composed by setting = σi for 1<=i<=r, and zero otherwise • σi are singularvalues of C

Reduced SVD • Represent as an r*r matrix with the singular values on the diagonals, because all entries outside this submatrix are zeros • Omit the rightmost M-r columns of U correspond to these omitted rows of • Likewise, omit the rightmost N-r columns of V because they correspond in VT to the rows that will be multiplied by the N-r columns of zeros in

Singular Value Decompositions Each column vector is unit length, column & row vectors are orthogonal Reduced SVD, omit two rows in the middle matrix, omit two columns in the left matrix

C = UΣVT : The matrix C This is a standard term-document matrix. Actually, we use a non-weighted matrix here to simplify the example.

C = UΣVT : The matrix U nature sport • One row pert term, one column per topic • This is an orthonormalmatrix: • Column vectors have unit length • Any two distinct column vectors are orthogonal to each other • Think of the dimensions as “semantic” dimensions that capture distinct topics like politics, sports, economics • Each number uij in the matrix indicates howstronglyrelatedtermiis to the topic represented by semanticdimensionj

C = UΣVT : The matrix Σ • The diagonalconsists of the singular valuesof C • The magnitude of the singular value measures the importance of the corresponding semantic dimension • We’ll make use of this by omitting unimportant dimensions

C = UΣVT : The matrix VT • One column per document, one row per min(M,N) where M is the number of terms and N is the number of documents • Orthonormal matrix • Row vectors have unit length • Any two distinct row vectors are orthogonal to each other • Each number vijin the matrix indicates howstronglyrelateddocumentjis to the topic represented by semanticdimensioni

C = UΣVT : All four matrices Semantic dimension

Low-Rank Approximations • Given an M*N matrix C, a positive integer k, find an M*N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X= C – Ck, • If r is the rank of C, clearly Cr = C. When k is far smaller than r, refer to Ck as a low-rank approximation

Low-Rank Approximations • Solve low-rank matrix approximation using SVD • Given C, construct its SVD, • Derive from the matrix formed by replacing by zerosther-ksmallestsingularvalues on the diagonal of • Compute and output as the rank-kapproximation to C • Eckartand Youngtheorem, above procedure yields the matrix of rank k with the lowest possible Frobenius error Semantic dimension Semantic dimension Dashed (inner) boxes indicate the matrix entries affected by zeroingout the smallestsingularvalues; physical meaning: omit that semantic dimension

Term Document Matrix • This matrix is the basis for computing the similarity between documents and queries. Today: Can we transform this matrix, so that we get a better measure of similaritybetween documents and queries?

Latent Semantic Indexing • Latent semantic indexing (LSI) • Approximate a term-document matrix C by one of lower rank using SVD • Low-rank approximation to C yields a newrepresentation for each document in the collection • Casting queries into low-rankrepresentation, compute query-document similarity in this low-rankrepresentation

Latent Semantic Indexing • LSI takes documents that are semanticallysimilar (= talk about the same topics) but are notsimilar in the vectorspace (because they use different words) • Re-represents them in a reduced vector spacein which they have higher similarity • LSI addresses the problems of synonymy and semanticrelatedness • Standard vectorspace: Synonyms contribute nothing to document similarity • Desired effect of LSI: Synonymscontribute strongly to document similarity

Latent Semantic Indexing • The dimensionality reduction forces us to omit a lot of “detail” • We have to map different words (= different dimensions of the full space) to the same dimension in the reduced space • The “cost” of mapping synonyms to the same dimension is much less than the cost of collapsing unrelated words • SVD selects the “least costly” mapping. Thus, it will map synonyms to the same dimension. But it will avoid doing that for unrelated words

Advantage of Vector Space Representation • Advantages of vector space representation • Uniform treatment of queries and documents as vectors • Score computation based on cosinesimilarity • Ability to weight different terms differently • Extension beyond document retrieval to otherapplications, e.g. clustering, classification

Disadvantage of Vector Space Representation • Disadvantages of vector space representation • Synonymy • Two different words (car, automobile) have samemeaning. Vector space representation fails to capturerelationship between synonymous terms • Computed similarity between a query (car) and a document containing both (car) and (automobile) underestimate true similarity a user would perceive • Polysemy • Term has multiplemeaning • Computed similarity overestimates the similarity a user would perceive

The Problem • Example: Vector Space Model • (from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

Latent Semantic Indexing • Latent semantic indexing was proposed to address these two problems with the vector space model for IR

Latent Semantic Indexing • LSI • Original term-document matrix C is likely to have several tens of thousands of rows and columns, and a rank in the tens of thousands • Map each row/column (term/document) to a k-dimensional space • Space is defined by k principal eigenvectors (corresponding to the largest eigenvalues) of CCT and CTC • Use the new k-dimensional LSI representation to compute similarities between vectors • Query & document vector is mapped into its representation in the LSIspace by the transformation Note: Reduced SVD, Ck=UkΣkVkT, can be transformed toVk=Σk-1UkTCk ; k is just a special case of Vk

How We Use the SVD in LSI • Key property: Each singular value tells us how important its dimension is • By setting less important dimensions to zero, we keep the important information, but get rid of the “details” • Noise– in that case, reduced LSI is a better representation because it is less noisy • Make things dissimilar that should be similar– again reduced LSI is a better representation because it represents similarity better

Analogy for “Fewer Details is Better” • Image of a bright red flower • Image of a black and white flower • Omitting color makes is easier to see similarity

SVD of C=UΣVT SVD term matrix SVD document matrix

Zeroing Out All but Two Largest Singular Values • Actually, we • onlyzero out • singularvalues • in Σ. Thishas • theeffectof • settingthe • corresponding • dimensions in • U andV Tto • zerowhen • computingthe • product • C = UΣV T .

Reducing the Dimensionality to 2

Original Matrix CVS. Reduced C2 = UΣ2VT Wecanview C2as a two-dimensional representation ofthematrix. Wehave performed a dimensionality reductionto two dimensions

Why the Reduced Matrix is “Better” Similarityof d2 and d3 in the original space: 0. Similarityof d2 and d3 in the reduced space: 0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52

Information Retrieval and Search Engines