1 / 39

Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand. Adam Carlson. Outline. Discourse Segmentation LSI Motivation Math - How to do LSI Applications More Math - Why does it work Wacky Ideas. Discourse Segmentation.

Télécharger la présentation

Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic IndexingorHow I Learned to Stop Worrying and Love Math I Don’t Understand Adam Carlson

  2. Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  3. Discourse Segmentation • Some collections (like the web) have high variance in document length • Sometimes things like sentences or paragraphs work, sometimes they don’t • Would like to segment documents according to topic CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  4. TextTiling • Break document into units of fixed length • Score cohesion between units • Look for patterns of low cohesion surrounded by high cohesion • Indicates change of subject • Found good agreement with human judges • Possible application for LSI measures of coherence CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  5. Using Co-occurrence Information • Major problems with word-matching • Synonymy (one meaning, many words) • Polysemy (one word, many meanings) • Solutions • Concept search • Query expansion • Clustering Latent Semantic Indexing almost CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  6. Latent Semantic Indexing is ... • Latent • Captures underlying structure of corpus • Semantic • Groups words by “conceptual” similarity • Cool • Lots of neat applications • Not Silver Bullet • Not really semantic, just MDS, expensive CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  7. What is LSI • Restructures vector space so that co-occurrences are mapped together • Captures transitive co-occurrence relations • Application of dimensional reduction to term-document matrix • Throw out da noise, bring in da regularities • Form of clustering CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  8. Document vector space CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  9. Semantic Space CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  10. Singular Value Decomposition CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  11. Term-Document Matrix Approximation CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  12. Properties of  • Best least-squares approximation of A given only k dimensions • Terms and documents which were similar in A are more similar in  • This measure of similarity is transitive So what can we do with this? CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  13. LSI Tricks and Tips • Use  to query using standard cosine measure • Use Uk·Dk for term similarity • Use Dk·VkT for document similarity CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  14. Information Retrieval Improve retrieval Cross-language retrieval Document routing/filtering Measuring text coherence Cognitive Science Learning synonyms Subject matter knowledge Word sorting behavior Lexical priming Education Essay grading Text selection Applications CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  15. Standard Vector Space Retrieval in LSI Space • Improves recall at expense of precision • Compared to term-document vector space, SMART and Vorhees [Deerwester et al. 1990] • LSI did best on MED dataset • SMART did best on CISI dataset • but LSI was comparable to SMART when stemming was added CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  16. Cross Language Retrieval • Train on multilingual corpora using “combined” documents • Add in single language documents • Query in LSI space [Landauer & Littman 1990] French & English [Landauer, Littman & Stornetta 1992] Japanese & English [Young 1994] Greek & English [Dumais, Landauer & Littman 1996] Comparisons between LSI, no-LSI and Machine Translation CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  17. Document Routing/Filtering • Match reviewers with papers to be reviewed based on reviewers’ publications [Dumais & Nielsen 1992] • Select papers for researchers to read based on other papers they liked [Foltz & Dumais 1992] CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  18. Train LSI on encyclopedia articles Test against TOEFL synonym test Results comparable to (non-native) college applicants [Landauer & Dumais 1996] Train on introductory Psychology texts Receive passing grade on multiple-choice questions (but did worse than students) [Landauer, Foltz & Laham 1998] LSI goes to college CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  19. Essay Grading • Several techniques • Use essay (or sentences from essay) to query into textbook or database of graded essays • Grade based on cosine from text or closest graded essay • More consistent than expert human graders • Is that good? [Landauer, Laham & Foltz 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  20. Routing Meets Education • Run LSA on a bunch of texts at different levels of sophistication • Have student write short essay about topic • Use essay as query to select most appropriate text for student [Wolfe, Schreiner, Rehder, Laham Foltz, Kintch and Landauer 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  21. Measuring Text Coherence • Use LSI to compute cosine of each sentence with following one [Foltz, Kintch & Landauer 1998] • Correlates highly with established methods • Can indicate where coherence breaks down • Can be used to measure how semantic content changes across a text (discourse segmentation?) CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  22. Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  23. Least Squares ApproximationWhy does it work? 1st Attempt • Â is best least-squares approximation to A using just k dimensions CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  24. Least Squares cont. • Why does this work • Are these the regularities we want to capture • Why approximate at all? (hint: overfitting) Not very convincing CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  25. Neural Network ExplanationWhy does it work? 2nd Attempt • Consider fully connected 3 layer network • First layer is terms • Middle layer has k units • Last layer is documents • Weights on hidden layer will adjust to group terms that appear in similar documents and documents containing similar terms • This is analogous to the SVD matrices CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  26. Spectral AnalysisWhy does it work? 3rd Attempt • Kleinberg’s “Authoritative Sources” • A link provides evidence of authority • Authoritative sources are pointed to by hubs • Hubs point to authoritative sources • Give every page some “weight” • Move weight back and forth across links • Stabilizes with authority and hubs • Equivalent to spectral analysis - eigenstuff CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  27. Spectral Analysis cont. • Co-occurrence instead of authority • Links are documents with the same word • Similar documents have many similar words • Similar words occur in similar documents • Turn Kleinberg crank and get: • Authoritative sources = similar documents • Hubs = words that occur in similar documents • Doesn’t exactly fit (asymmetric) CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  28. More EigenexplanationWhy does it work? 4th Attempt • Rank of a matrix is a measure of how much information it contains • Rows which are linear combinations of each other can be removed • In this case, some singular values will be 0 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  29. Eigenvalues cont. • Consider vectors of terms X, Y and Z • X = [1 1 0 0 1 0 ... ] • Y = [0 0 1 1 0 0 ... ] • Z = [1 1 2 2 0 1 ... ] • Z » X + 2Y • Some singular value of A is low • By forcing that singular value to 0, we merge X, Y and Z CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  30. LSI Theory • Under certain assumptions • Corpus has k topics • Each topic has n>l unique terms • Documents can cover multiple topics • 95% of content words in document are on-topic • LSI is guaranteed to separate documents into proper topics • Speedup with random projection [Papdimitriou, Raghavan, Tamaki & Vempala 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  31. Related Techniques • PCA/Factor analysis/Multi-dimensional scaling • Neural nets • Kohonen Maps CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  32. Dimensionality Reduction • Dimensionality reduction takes high-dimensional data and re-expresses it in a lower dimension • PCA • If you were only allowed 1 line to represent all the data, what would it be • The one that explains the greatest variance • Recur CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  33. PCA cont. CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  34. Wacky ideas • Hierarchical concept clustering • Measure spatial deviations • Communication barriers • Language drift • Statistical/Symbolic Hybrids CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  35. Hierarchical Concept Clustering • LSI doesn’t handle polysemy well • Find subspaces which separate polysemous words into different clusters • Hopefully those subspaces correspond to topics • Lather, rinse, repeat CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  36. Finding Communication Barriers • Want to find terms which have different meanings in different corpora • Judge words by the company they keep • Look for words which are in cohesive clusters in both corpora but the terms in those clusters are different CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  37. Communication Barriers cont. • Tried with pro-choice/pro-life corpora • Poor results • Didn’t use cohesive clusters • Not enough data • Highly variable data • Possible fix - start with baseline corpus and measure drift as other corpora are merged in CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  38. Tracking Language Drift • Follow changes in clusters as a corpus grows • Hierarchical Agglomerative Clustering may have discontinuities • Use these to mark significant changes CS590Q W99 - Latent Semantic Indexing - Adam Carlson

  39. Hybrid Approach • Merge statistical analysis (LSI) with symbolic analysis (MindNet) • Use LSI term similarity metric to assign strengths to MindNet relations • Incorporate syntactic information • Preprocess documents, adding POS or attachment information to words • Time-N Flies-V Like-AVP An-Det Arrow-N CS590Q W99 - Latent Semantic Indexing - Adam Carlson

More Related