EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell

EE3J2 Data MiningLecture 8Latent Semantic IndexingMartin Russell EE3J2 Data Mining

Objectives • To explain vector representation of documents • To understand cosine distance • To understand, intuitively, Latent Semantic Indexing (LSI) • To understand, intuitively, how to do Latent Semantic Analysis (LSA) EE3J2 Data Mining

Vector Notation for Documents • Suppose that we have a set of documents D = {d1, d2, … ,dN} think of this as the corpus for IR • Suppose that the number of different words in the whole corpus is V • Now suppose a document din D contains M different terms: {ti(1), ti(2),…, ti(M)) • Finally, suppose term ti(m)occurs ni(m)times EE3J2 Data Mining

Notice that this is the term frequencyfi(1)dfrom text IR Vector Notation • The document d can be represented exactly as the V dimensional vector vec(d) i(1)th place i(2)th place i(M)th place EE3J2 Data Mining

Example • d1 = the cat sat on the mat • d2= the dog chased the cat • d3 = the mouse stayed at home • Vocabulary: • the, cat, sat, on, mat, dog, chased, mouse, stayed, at, home • d1 = (2,1,1,1,1,0,0,0,0,0,0) • d2 = (2,1,0,0,0,1,1,0,0,0,0) • d3= (1,0,0,0,0,0,0,1,1,1,1) EE3J2 Data Mining

Document length revisited • Recall that the length of a vector is given by: EE3J2 Data Mining

Document length • In the case of a ‘document vector’ EE3J2 Data Mining

Document Similarity • Suppose d is a document and q is a query • If d and q contain the same words in the sameproportions, then vec(d) and vec(q) will point in the same direction • If d and q contain the different words, then vec(d) and vec(q) will point in different directions • Intuitively, the greater the angle between vec(d) and vec(q) the less similar the document d is with the query q EE3J2 Data Mining

Cosine similarity • Define the Cosine Similarity between document d and query q by: CSim(q,d) = cos where  is the angle between vec(q) and vec(d) • Similarly, define the Cosine Similarity between documents d1 and d2 by: CSim(d1,d2) = cos where  is the angle between vec(d1) and vec(d2) EE3J2 Data Mining

u v  Cosine Similarity & Similarity • Let u=(x1,y1) and v=(x2,y2) be vectors in 2 dimensions, then • In fact, this result holds for vectors in any N dimensional space EE3J2 Data Mining

Cosine similarity Similarity ‘is related to’ Cosine Similarity & Similarity • Hence, if q is a query, d is a document, and  is the angle between vec(q) and vec(d), then: EE3J2 Data Mining

Latent Semantic Analysis (LSA) • Suppose we have a real corpus with a large number of documents • For each document d the dimension of the vector vec(d) will typically be several (tens of) thousands • Let’s focus on just 2 of these dimensions, corresponding, say, to the words ‘sea’ and ‘beach’ • Intuitively, often, when a document d includes ‘sea’ it will also include ‘beach’ EE3J2 Data Mining

“about the seaside” sea beach LSA continued • Equivalently, if vec(d) has a non-zero entry in the ‘sea’ component, it will often have a non-zero entry in the ‘beach’ component  EE3J2 Data Mining

“about the seaside” sea  beach Latent Semantic Classes • If we can detect this type of structure, then we can discover relationships between words automatically, from data • In the example we have found an equivalence set of terms, including ‘beach’ and ‘sea’, which is ‘about the seaside’ EE3J2 Data Mining

Finding Latent Semantic Classes • LSA involves some advanced linear algebra - the description here is just an outline • First construct the ‘word-document’ matrix A • Then decompose A using Singular Value Decomposition (SVD) • SVD is a standard technique from matrix algebra • Packages such as MATLAB have SVD functions: >>[U,S,V]=svd(A) EE3J2 Data Mining

Number of times tn occurs in dm ‘Word-Document’ Matrix term document EE3J2 Data Mining

Direction of most significant correlation ‘Strength’ of most significant correlation Singular Value Decomposition (SVD) EE3J2 Data Mining

More about LSA • See: Landauer, T.K. and Dumais, S.T., “A solution to Platos problem: The Latent Semantic Analysis theory of the acquisition, induction and representation of knowledge”, Psychological Review 104(2), 211-240 (1997) EE3J2 Data Mining

Summary • Vector space representation of documents • Intuitive description of Latent Semantic Analysis EE3J2 Data Mining

EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell