210 likes | 429 Vues
EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell. Objectives. To explain vector representation of documents To understand cosine distance To understand, intuitively, Latent Semantic Indexing (LSI) To understand, intuitively, how to do Latent Semantic Analysis (LSA).
E N D
EE3J2 Data MiningLecture 8Latent Semantic IndexingMartin Russell EE3J2 Data Mining
Objectives • To explain vector representation of documents • To understand cosine distance • To understand, intuitively, Latent Semantic Indexing (LSI) • To understand, intuitively, how to do Latent Semantic Analysis (LSA) EE3J2 Data Mining
Vector Notation for Documents • Suppose that we have a set of documents D = {d1, d2, … ,dN} think of this as the corpus for IR • Suppose that the number of different words in the whole corpus is V • Now suppose a document din D contains M different terms: {ti(1), ti(2),…, ti(M)) • Finally, suppose term ti(m)occurs ni(m)times EE3J2 Data Mining
Notice that this is the term frequencyfi(1)dfrom text IR Vector Notation • The document d can be represented exactly as the V dimensional vector vec(d) i(1)th place i(2)th place i(M)th place EE3J2 Data Mining
Example • d1 = the cat sat on the mat • d2= the dog chased the cat • d3 = the mouse stayed at home • Vocabulary: • the, cat, sat, on, mat, dog, chased, mouse, stayed, at, home • d1 = (2,1,1,1,1,0,0,0,0,0,0) • d2 = (2,1,0,0,0,1,1,0,0,0,0) • d3= (1,0,0,0,0,0,0,1,1,1,1) EE3J2 Data Mining
Document length revisited • Recall that the length of a vector is given by: EE3J2 Data Mining
Document length • In the case of a ‘document vector’ EE3J2 Data Mining
Document Similarity • Suppose d is a document and q is a query • If d and q contain the same words in the sameproportions, then vec(d) and vec(q) will point in the same direction • If d and q contain the different words, then vec(d) and vec(q) will point in different directions • Intuitively, the greater the angle between vec(d) and vec(q) the less similar the document d is with the query q EE3J2 Data Mining
Cosine similarity • Define the Cosine Similarity between document d and query q by: CSim(q,d) = cos where is the angle between vec(q) and vec(d) • Similarly, define the Cosine Similarity between documents d1 and d2 by: CSim(d1,d2) = cos where is the angle between vec(d1) and vec(d2) EE3J2 Data Mining
u v Cosine Similarity & Similarity • Let u=(x1,y1) and v=(x2,y2) be vectors in 2 dimensions, then • In fact, this result holds for vectors in any N dimensional space EE3J2 Data Mining
Cosine similarity Similarity ‘is related to’ Cosine Similarity & Similarity • Hence, if q is a query, d is a document, and is the angle between vec(q) and vec(d), then: EE3J2 Data Mining
Latent Semantic Analysis (LSA) • Suppose we have a real corpus with a large number of documents • For each document d the dimension of the vector vec(d) will typically be several (tens of) thousands • Let’s focus on just 2 of these dimensions, corresponding, say, to the words ‘sea’ and ‘beach’ • Intuitively, often, when a document d includes ‘sea’ it will also include ‘beach’ EE3J2 Data Mining
“about the seaside” sea beach LSA continued • Equivalently, if vec(d) has a non-zero entry in the ‘sea’ component, it will often have a non-zero entry in the ‘beach’ component EE3J2 Data Mining
“about the seaside” sea beach Latent Semantic Classes • If we can detect this type of structure, then we can discover relationships between words automatically, from data • In the example we have found an equivalence set of terms, including ‘beach’ and ‘sea’, which is ‘about the seaside’ EE3J2 Data Mining
Finding Latent Semantic Classes • LSA involves some advanced linear algebra - the description here is just an outline • First construct the ‘word-document’ matrix A • Then decompose A using Singular Value Decomposition (SVD) • SVD is a standard technique from matrix algebra • Packages such as MATLAB have SVD functions: >>[U,S,V]=svd(A) EE3J2 Data Mining
Number of times tn occurs in dm ‘Word-Document’ Matrix term document EE3J2 Data Mining
Direction of most significant correlation ‘Strength’ of most significant correlation Singular Value Decomposition (SVD) EE3J2 Data Mining
More about LSA • See: Landauer, T.K. and Dumais, S.T., “A solution to Platos problem: The Latent Semantic Analysis theory of the acquisition, induction and representation of knowledge”, Psychological Review 104(2), 211-240 (1997) EE3J2 Data Mining
Summary • Vector space representation of documents • Intuitive description of Latent Semantic Analysis EE3J2 Data Mining