Latent Semantic Indexing

Latent Semantic Indexing These slides are compiled from the work of: Thomas Hoffman, Brown University Wikipedia and other sites included in the title of the slides

Outline • Eigenvalues and EigenVectors – reminder • Not absolutely becessary for the understanding of Singular Value Decomposition (SVD) but they are closely related. • Latent Semantic Indexing • Need for and coarse summary • Singular Value Decomposition

Latent Semantic Analysis via SVD

Vector Space Model: Pros • Automatic selection of index terms • Partial matching of queries and documents (dealing with the case where no document contains all search terms) • Ranking according to similarity score(dealing with large result sets) • Term weighting schemes (improves retrieval performance) • Various extensions • Document clustering • Relevance feedback (modifying query vector) • Geometric foundation

Problems with Vector Space Models • Ambiguity and association in natural language • Polysemy: Words often have a multitude of meanings and different types of usage (more urgent for very heterogeneous collections). • The vector space model is unable to discriminate between different meanings of the same word. • Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic). • No associations between words are made in the vector space representation.

ring jupiter ••• space voyager … saturn ... meaning 1 … planet ... car company ••• dodge ford meaning 2 contribution to similarity, if used in 1st meaning, but not if in 2nd Polysemy and Context • Document similarity on single word level: polysemy and context

Summarized from http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html • Latent Semantic Indexing (LSI) means analyzing documents to find the underlying/latentmeaning/semanticsor concepts of those documents. • The fundamental difficulty in finding relevant documents from search wordsis that what we really want is to compare the meanings or concepts behind the words. • LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.

Summarized from http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html • Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document. • Concepts are represented as patterns of words that usually appear together in documents. For example “jaguar", “car", and “speed" might usually appear in documents about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the concept of jaguar the animal. • LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. • LSI uses Singular Value Decomposition for the mapping of terms to concepts.

Advantages of LSI • LSI overcomes two of the most problematic constraints of Boolean keyword queries: • multiple words that have similar meanings (synonymy) • words that have more than one meaning (polysemy). • Text does not need to be in sentence form for LSI to be effective. It can work with lists, free-form notes, email, webcontent, etc. • LSI is also used to perform automated document categorization and clustering. In fact, several experiments have demonstrated that there are a number of correlations between the way LSI and humans process and categorize text.

mm mn V is nn The diagonal entries Σi,iof Σ are known as thesingular valuesofM. Themcolumns ofUand thencolumns ofVare called the left singular vectorsandright singular vectorsofA, respectively. Singular values. Singular Value Decomposition For an m n matrix Aof rank rthere exists a factorization (Singular Value Decomposition = SVD) as follows: orthonormal orthonormal

Thus m=3, n=2. Its SVD is SVD example Let Typically, the singular values arranged in decreasing order.

Rank of a Matrix • The column rank of a matrixA is the maximum number of linearly independent column vectors of A. • The row rank of a matrix A is the maximum number of linearly independent row vectors of A. • Column rank and the row rank are always equal • Rank of a matrix

Singular Value Decomposition • Illustration of SVD dimensions and sparseness

SVD vs Eigen Decomposition • The singular value decomposition is very general in the sense that it can be applied to any m × n matrix whereas eigenvalue decomposition can only be applied to certain classes of square matrices. Nevertheless, the two decompositions are related. • Given an SVD of M, as described above, the following two relations hold: where M* is the conjugate transpose of M. • The non-zero singular values of M (found on the diagonal entries of Σ) are the square roots of the non-zeroeigenvalues of both M*M and MM*.

mm mn V is nn Singular Value Decomposition M and A are the same in here From the above, we can derive that and extract them as such: The columns of V are orthogonal eigenvectors of ATA. The columns of U are orthogonal eigenvectors of AAT.

A non-negative real number σ is a singular value for M if and only if there exist unit-length vectors u in Km and v in Kn such that • The vectors u and v are called left-singular and right-singular vectors for σ, respectively. • Similar to eigenvalue and eigenvector.

Low Rank Approximation

Singular Value Decomposition • Illustration of SVD dimensions and sparseness

Without knowing the rank, we can first find S mxn and then reduce its dimensionsby throwing away the 0 columns and rows, obtaining an rxr matrix, where r is the rank of A. • The SVD then becomes Amxn = Umxr Srxr VTrxn

LSI computes the term and document vector spaces by transforming the term-frequency matrix, A , into three other matrices: • m by r term-concept vector matrix,T • r by r singular values matrix, S , • n by r concept-document vector matrix, D where m is the number of unique terms, and n is the number of documentsand the matrices satisfy the following relations:

At this point, we can further reduce dimensionality by zero-ing the lowest entries in the S matrix, leaving only k non-zero entries. • k << r • This is the same thing as having U, S and VT as mxk, kxk and kxn matrices.

k column notation: sum of rank 1 matrices Low-rank Approximation • Solution via SVD set smallest r-k singular values to zero

After computing the SVD, LSI approach reduces the rankof the singular value matrix S to size k « r • typically on the order of a k in the range of 100 to 300 dimensions • The SVD operation, along with this reduction, has the effect of preserving the most important semantic information in the text while reducing noise and other undesirable artifacts of the original space of A.

Frobenius norm Low-rank Approximation • Approximation problem:Given k, find Akof rank k such that Ak and X are both mn matrices. Typically, want k << r.

What it is • From term-doc matrix A, we compute the approximation Ak. • There is a row for each term and a column for each doc in Ak • Thus terms and docs live in a space of k<<r dimensions • These dimensions are not the original axes • But why?

Latent Semantic Analysis • Latent semantic space: illustrating example courtesy of Susan Dumais

http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=4http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=4 Nice full example

Terms and Documents • Here are the 9 documents (book titles in this case), with index words underlined. • The Neatest Little Guide to StockMarketInvesting • Investing For Dummies, 4th Edition • The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns • The Little Book of ValueInvesting • ValueInvesting: From Graham to Buffett and Beyond • RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! • Investing in RealEstate, 5th Edition • StockInvesting For Dummies • RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss

Term-Document Frequencies

SVD and Low Rank Approximation • Rank-3 approximation

In the example, the first dimension is thrown with the following explanation • For documents, the first dimension correlates with the length of the document. • For words, it correlates with the number of times that word has been used in all documents. • How do we compute the query or new documents representation in this space?

The LSI Space of Terms and Docs red squares are terms, blue dots are documents How do we generate this plot?

Performing the maps • Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. • Claim: this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. • A query q is also mapped into this space, by • Query is NOT a sparse vector.

//just taking a transpose In other words, each column of Vk is found by multiplying the corresponding document (row of AT) with UkS-1

Example

Example taken fromwww.miislita.com/.../latent-semantic-indexing-fast-track-tutorial.pdf • A “collection” consists of the following “documents” • d1: Shipment of gold damaged in a fire. • d2: Delivery of silver arrived in a silver truck. • d3: Shipment of gold arrived in a truck. • First approach: • stop words were not ignored • text was tokenized and lowercased • no stemming was used • terms were sorted alphabetically

Problem: Use Latent Semantic Indexing (LSI) to rank these documents for the query gold silver truck. • Step 1: Score term weights and construct the term-document matrix A and query matrix:

Step 2: Decompose matrix A matrix using SVD and find the U, S and V matrices, where A = USVT

d1 d2 d3

Step 3: Implement a rank-2 approximation by keeping the first columns of U and V and the first columnsand rows of S.

This is the truncated representation, where the 0 entries are not shown.

Step 4: Find the new document vector coordinates in this reduced 2-dimensional space. • Columns ofVT holds the coordinates of individual document vectors: • d1=(-0.4945, 0.6492) • d2=(-0.6458, -0.7194) • d3=(-0.5817, 0.2469)

Step 5: Find the new query vector coordinates in the reduced 2-dimensional space. • q = qTUkSk-1

Step 6: Rank documents in decreasing order of query-document cosine similarities.

Try this onhttp://www.bluebit.gr/matrix-calculator/calculate.aspx 1 1 0 A =1 0 1 1 0 1

Uk= -0.460 0.888 x Sk=2.175 0.000 x VTk = -0.789 -0.211 -0.577 -0.628 -0.325 0.000 1.126 0.211 0.789 -0.577 -0.628 -0.325 -1.001 1.000 1.001 1.000 0.001 -1.366 -0.366 = 1.001 -0.001 0.999 -1.366 -0.366 1.001 -0.001 0.999

Latent Semantic Analysis - Summary • Perform a low-rank approximation of document-term matrix (typical rank 100-300) • General idea • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space • Goals • Similar terms map to similar location in low dimensional space • Noise reduction by dimension reduction

Has proved to be a valuable tool in many areas of NLP as well as IR • summarization • cross-language IR • topics segmentation • text classification • question answering • more

LSI has many other applications • In many settings in pattern recognition and retrieval, we have a feature-object matrix. • For text, the terms are features and the docs are objects. • Could be opinions and users … • This matrix may be redundant in dimensionality. • Can work with low-rank approximation. • If entries are missing (e.g., users’ opinions), can recover if dimensionality is low. • Powerful general analytical technique • Close, principled analog to clustering methods.

Latent Semantic Indexing