1 / 16

IR Models: Latent Semantic Analysis

IR Models: Latent Semantic Analysis. Algebraic. Set Theoretic. Generalized Vector Lat. Semantic Index Neural Networks. Structured Models. Fuzzy Extended Boolean. Non-Overlapping Lists Proximal Nodes. Classic Models. Probabilistic. boolean vector probabilistic.

kaida
Télécharger la présentation

IR Models: Latent Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Models:Latent Semantic Analysis

  2. Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Model Taxonomy U s e r T a s k Retrieval: Adhoc Filtering Browsing

  3. Vocabulary Problem • The “vocabulary problem” causes classic IR to potentially experience poor retrieval: • Polysemy - same term means many things so unrelated documents might be included in the answer set • Leads to poor precision • Synonymy - different terms mean the same thing so relevant documents that do not contain any index term are not retrieved. • Leads to poor recall

  4. Latent Semantic Indexing • Retrieval based on index terms is vague and noisy • The user information need is more related to concepts and ideas than to index terms • A document that shares concepts with another document known to be relevant might be of interest

  5. Latent Semantic Indexing • The key idea • Map documents and queries into a lower dimensional space • Lower dimensional space represents higher level concepts which are fewer in number than the index terms • Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

  6. Latent Semantic Indexing • Definitions • Let t be the total number of index terms • Let N be the number of documents • Let (Mij) be a term-document matrix with t rows and N columns • Each element of this matrix is assigned a weight wij associated with the pair [ki,dj] • The weight wij can be based on a tf-idf weighting scheme

  7. Singular Value Decomposition • The matrix (Mij) can be decomposed into 3 matrices: • (Mij) = (K) (S) (D)t • (K) is the matrix of eigenvectors derived from (M)(M)t • (D)t is the matrix of eigenvectors derived from (M)t(M) • (S) is an r x r diagonal matrix of singular values where • r = min(t,N) that is, the rank of (Mij)

  8. Latent Semantic Indexing • In the matrix (S), select only the s largest singular values • Keep the corresponding columns in (K) and (D)t • The resultant matrix is called (M)s and is given by • (M)s = (K)s (S)s (D)t • where s, s < r, is the dimensionality of the concept space • The parameter s should be • large enough to allow fitting the characteristics of the data • small enough to filter out the non-relevant details

  9. LSI Ranking • The user query can be modelled as a pseudo-document in the original (M) matrix • Assume the query is modelled as the document numbered 0 in the (M) matrix • The matrix (M)t(M)squantifies the relantionship between any two documents in the reduced concept space • The first row of this matrix provides the rank of all the documents with regard to the user query (represented as the document numbered 0)

  10. Latent Semantic Analysis as Model of Human Language Learning • Psycho-linguistic model: • Acts like children who acquire word meanings not through explicit definitions but by observing how they are used. • LSA is a pale reflection of how humans learn language, but it is a reflection. • LSA offers an explanation of how people can agree enough to share meaning.

  11. LSA Applications • In addition for typical query systems, LSA has been used for: • Cross-language search • Reviewer assignment at conferences • Finding experts in an organization • Identifying reading level of documents

  12. Concept-based IR Beyond LSA • LSA/LSI uses principle component analysis • Principle components are not necessarily good for discrimination in classification. • Linear Discriminant Analysis (LDA) identifies linear transformations • maximizing between-class variance while • minimizing within class variance • LDA requires training data

  13. B 2.0 1.5 1.0 0.5 . . . . . . . . . . . . . . . w A 0.5 1.0 1.5 2.0 Linear Discriminant Analysis • Projecting a 2D space to 1 PC (from slides by Shaoqun Wu)

  14. B 2.0 1.5 1.0 0.5 B 2.0 1.5 1.0 0.5 . . . . . . . . . . . . . . . . . . . . . . w . . A . . . A . 0.5 1.0 1.5 2.0 . . w 0.5 1.0 1.5 2.0 Linear Discriminant Analysis LDA: discovers a discriminating projection PCA

  15. LDA results • LDA reduces number of dimensions (concepts) required for classification tasks

  16. Conclusions • Latent semantic indexing provides an intermediate representation of concept to aid IR, minimizing the vocabulary problem. • It generates a representation of the document collection which might be explored by the user. • Alternative methods for identifying clusters (e.g. LDA) may improve results.

More Related