1 / 13

Latent Semantic Indexing

Latent Semantic Indexing. Journal Article Comparison Al Funk CS 5604 / Information Retrieval. What is LSI?. Use similarities between concepts to map documents and determine their proximity in concept space

Télécharger la présentation

Latent Semantic Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Indexing Journal Article Comparison Al Funk CS 5604 / Information Retrieval

  2. What is LSI? • Use similarities between concepts to map documents and determine their proximity in concept space • “Singular Value Decomposition” – popular statistical method for generating concept space via dimensionality reduction • Mapping results from SVD’s spatial analysis of a collection of documents; does not require human intervention to generate

  3. Strengths of LSI • Increased relevance of information retrieval, as concepts are recognized rather than keywords • Larger result sets due to retrieval of texts that do not include the specific query keywords • LSI recognizes that keywords are related • Minimal human intervention to generate mappings

  4. Weaknesses of LSI • Storage requirements for indexes • Computation time In essence, high dimensionality of document representation can make searching resource intensive. LSI can reduce these costs but also can incur some of its own. Q: Is there a way to maintain the benefits of LSI and reduce resource requirements?

  5. Two Solutions Identified? • Many journal articles focus on mitigating the resource intensivity of LSI by reducing dimensionality. Two approaches: • Article 1: Use “random projection” to lower dimensionality of the concept space, hoping to prevent erosion of vector relationships • Article 2: Replace SVD with “Semidiscrete Matrix Decomposition,” creating an approximation that serves to reduce dimensionality but still retain the bulk of relationships

  6. Random Projection • Traditional methods of dimensionality reduction have focused means of analyzing datasets to maximize benefit and minimize loss of variation. Two such methods are: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) • SVD is primary for document retrieval because it performs well with sparse matrices. • PCA and SVD are both computationally expensive, particularly for large datasets.

  7. Random Projection • Random Projection (RP) attempts to solve these problems by creating a random matrix and using it to project the document observation vectors onto a lower dimensional space. • Random projection can be used before SVD, enabling the expensive algorithm to operate on a matrix of lower dimension. • Bingham and Mannila’s results indicate that RP has an acceptable impact on the data while significantly reducing required computation.

  8. SDD vs. SVD • Kolda and O’Leary propose to replace the expensive SVD algorithm with “Semidiscrete Matrix Decomposition” • Lower computation time • Lower storage requirements • Claim that methodology is as accurate as SVD but less resource intensive

  9. What is SVD? • Defined as the “closest rank-k matrix to the term-document matrix in the Frobenius measure”. • Essentially creates a lower-order matrix that maximizes the approximation of the original m x n document / keyword matrix.

  10. What is SDD? • SDD is a different LSI algorithm to achieve the same goals as SVD • SDD creates a lower-order matrix like SVD but restricts vector item values to –1, 0 or 1 • As a result of the restriction to these values, SDD is computationally more expensive up front

  11. Benefits of SDD • Despite higher up-front processing times, updates to the matrix can be made rapidly to accommodate changing collections • Searching is more efficient (as much as ½ the time) • Storage requirements are lower, as SDD can store each matrix value in 2 bits (rather than multiple bytes for a floating-point value)

  12. Conclusions • Both articles provide for a quantifiable increase in performance over traditional LSI techniques • Techniques could potentially be used together, as both tackle the related issues of performance and dimensionality reduction

  13. Article Links • http://doi.acm.org/10.1145/291128.291131 • http://doi.acm.org/10.1145/502512.502546

More Related