1 / 10

Using the Cell to Perform Latent Semantic Indexing

This guide explores the use of latent semantic indexing (LSI) in information retrieval, focusing on Principal Component Analysis (PCA) to reduce the dimensionality of term-document matrices. By grouping semantically related terms and documents, LSI offers insights into document relationships and term associations. It provides methodologies from constructing term-document matrices to executing Singular Value Decomposition (SVD). Although SVD has computational challenges, including slow processing speeds, existing libraries and potential parallelization projects can enhance efficiency in real-world applications.

keene
Télécharger la présentation

Using the Cell to Perform Latent Semantic Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Cell to Perform Latent Semantic Indexing Luke Georgalas Andrew Raim

  2. What Is LSI? • Information retrieval technique performed on a corpus of documents • Perform Principal Component Analysis (PCA) to reduce dimensionality of a term-document matrix • Semantically related terms and documents get grouped together

  3. What Can You Do With LSI? • Compare documents in a corpus to see how related they are • Compare terms in a corpus to see how related they are • Compare terms with documents • Incorporate new documents into the concept space and find closest matches (queries can be considered documents)

  4. How Does LSI Work? • Term-document matrix is constructed • Singular Value Decomposition (SVD) is performed • M = T S D’  DT Mtd T = r  r r  d t  d t  r * Diagram is from Dr. Charles Nicholas’ lecture notes for CMSC676

  5. How Does LSI Work? • D, a representation of M in r dimensions • T, a matrix for transforming new documents • Diagonal matrix  gives relative importance of dimensions • Dimensions represent semantic concepts * This slide is from Dr. Charles Nicholas’ lecture notes for CMSC676

  6. On The Cell? • Term-document matrices get very large with large coropra • One reason this is not commonly used in search engines is because SVD is slow

  7. Existing Serial Libraries • LAPACK (U of Texas linear algebra library) • QR algorithm O(n3) • DQDS algorithm – only calculates S • Divide-and-conquer – only calculates S • Bisection method and inverse iteration – only calculates singular values and vectors of interest

  8. Existing Serial Libraries • GSL (GNU Scientific Library) • Golub-Reinsch algorithm • Modified Golub-Reinsch algorithm for M >> N • Jacobi orthagonalization

  9. Existing Serial Libraries • Stream Hestenes SVD • Modified Hestenes algorithm to support stream processing • Requires R2 processing elements to work most efficiently * http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-641.pdf

  10. Possible Projects • Algorithm – SVD only • Start with a serial routine • See what we can parallelize • See what we can vectorize • Application - indexer • Create a cell-optimized version of SVD • Index a corpus • Run queries

More Related