390 likes | 477 Vues
Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand. Adam Carlson. Outline. Discourse Segmentation LSI Motivation Math - How to do LSI Applications More Math - Why does it work Wacky Ideas. Discourse Segmentation.
E N D
Latent Semantic IndexingorHow I Learned to Stop Worrying and Love Math I Don’t Understand Adam Carlson
Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Discourse Segmentation • Some collections (like the web) have high variance in document length • Sometimes things like sentences or paragraphs work, sometimes they don’t • Would like to segment documents according to topic CS590Q W99 - Latent Semantic Indexing - Adam Carlson
TextTiling • Break document into units of fixed length • Score cohesion between units • Look for patterns of low cohesion surrounded by high cohesion • Indicates change of subject • Found good agreement with human judges • Possible application for LSI measures of coherence CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Using Co-occurrence Information • Major problems with word-matching • Synonymy (one meaning, many words) • Polysemy (one word, many meanings) • Solutions • Concept search • Query expansion • Clustering Latent Semantic Indexing almost CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Latent Semantic Indexing is ... • Latent • Captures underlying structure of corpus • Semantic • Groups words by “conceptual” similarity • Cool • Lots of neat applications • Not Silver Bullet • Not really semantic, just MDS, expensive CS590Q W99 - Latent Semantic Indexing - Adam Carlson
What is LSI • Restructures vector space so that co-occurrences are mapped together • Captures transitive co-occurrence relations • Application of dimensional reduction to term-document matrix • Throw out da noise, bring in da regularities • Form of clustering CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Document vector space CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Semantic Space CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Singular Value Decomposition CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Term-Document Matrix Approximation CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Properties of  • Best least-squares approximation of A given only k dimensions • Terms and documents which were similar in A are more similar in  • This measure of similarity is transitive So what can we do with this? CS590Q W99 - Latent Semantic Indexing - Adam Carlson
LSI Tricks and Tips • Use  to query using standard cosine measure • Use Uk·Dk for term similarity • Use Dk·VkT for document similarity CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Information Retrieval Improve retrieval Cross-language retrieval Document routing/filtering Measuring text coherence Cognitive Science Learning synonyms Subject matter knowledge Word sorting behavior Lexical priming Education Essay grading Text selection Applications CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Standard Vector Space Retrieval in LSI Space • Improves recall at expense of precision • Compared to term-document vector space, SMART and Vorhees [Deerwester et al. 1990] • LSI did best on MED dataset • SMART did best on CISI dataset • but LSI was comparable to SMART when stemming was added CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Cross Language Retrieval • Train on multilingual corpora using “combined” documents • Add in single language documents • Query in LSI space [Landauer & Littman 1990] French & English [Landauer, Littman & Stornetta 1992] Japanese & English [Young 1994] Greek & English [Dumais, Landauer & Littman 1996] Comparisons between LSI, no-LSI and Machine Translation CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Document Routing/Filtering • Match reviewers with papers to be reviewed based on reviewers’ publications [Dumais & Nielsen 1992] • Select papers for researchers to read based on other papers they liked [Foltz & Dumais 1992] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Train LSI on encyclopedia articles Test against TOEFL synonym test Results comparable to (non-native) college applicants [Landauer & Dumais 1996] Train on introductory Psychology texts Receive passing grade on multiple-choice questions (but did worse than students) [Landauer, Foltz & Laham 1998] LSI goes to college CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Essay Grading • Several techniques • Use essay (or sentences from essay) to query into textbook or database of graded essays • Grade based on cosine from text or closest graded essay • More consistent than expert human graders • Is that good? [Landauer, Laham & Foltz 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Routing Meets Education • Run LSA on a bunch of texts at different levels of sophistication • Have student write short essay about topic • Use essay as query to select most appropriate text for student [Wolfe, Schreiner, Rehder, Laham Foltz, Kintch and Landauer 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Measuring Text Coherence • Use LSI to compute cosine of each sentence with following one [Foltz, Kintch & Landauer 1998] • Correlates highly with established methods • Can indicate where coherence breaks down • Can be used to measure how semantic content changes across a text (discourse segmentation?) CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Least Squares ApproximationWhy does it work? 1st Attempt • Â is best least-squares approximation to A using just k dimensions CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Least Squares cont. • Why does this work • Are these the regularities we want to capture • Why approximate at all? (hint: overfitting) Not very convincing CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Neural Network ExplanationWhy does it work? 2nd Attempt • Consider fully connected 3 layer network • First layer is terms • Middle layer has k units • Last layer is documents • Weights on hidden layer will adjust to group terms that appear in similar documents and documents containing similar terms • This is analogous to the SVD matrices CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Spectral AnalysisWhy does it work? 3rd Attempt • Kleinberg’s “Authoritative Sources” • A link provides evidence of authority • Authoritative sources are pointed to by hubs • Hubs point to authoritative sources • Give every page some “weight” • Move weight back and forth across links • Stabilizes with authority and hubs • Equivalent to spectral analysis - eigenstuff CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Spectral Analysis cont. • Co-occurrence instead of authority • Links are documents with the same word • Similar documents have many similar words • Similar words occur in similar documents • Turn Kleinberg crank and get: • Authoritative sources = similar documents • Hubs = words that occur in similar documents • Doesn’t exactly fit (asymmetric) CS590Q W99 - Latent Semantic Indexing - Adam Carlson
More EigenexplanationWhy does it work? 4th Attempt • Rank of a matrix is a measure of how much information it contains • Rows which are linear combinations of each other can be removed • In this case, some singular values will be 0 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Eigenvalues cont. • Consider vectors of terms X, Y and Z • X = [1 1 0 0 1 0 ... ] • Y = [0 0 1 1 0 0 ... ] • Z = [1 1 2 2 0 1 ... ] • Z » X + 2Y • Some singular value of A is low • By forcing that singular value to 0, we merge X, Y and Z CS590Q W99 - Latent Semantic Indexing - Adam Carlson
LSI Theory • Under certain assumptions • Corpus has k topics • Each topic has n>l unique terms • Documents can cover multiple topics • 95% of content words in document are on-topic • LSI is guaranteed to separate documents into proper topics • Speedup with random projection [Papdimitriou, Raghavan, Tamaki & Vempala 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Related Techniques • PCA/Factor analysis/Multi-dimensional scaling • Neural nets • Kohonen Maps CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Dimensionality Reduction • Dimensionality reduction takes high-dimensional data and re-expresses it in a lower dimension • PCA • If you were only allowed 1 line to represent all the data, what would it be • The one that explains the greatest variance • Recur CS590Q W99 - Latent Semantic Indexing - Adam Carlson
PCA cont. CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Wacky ideas • Hierarchical concept clustering • Measure spatial deviations • Communication barriers • Language drift • Statistical/Symbolic Hybrids CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Hierarchical Concept Clustering • LSI doesn’t handle polysemy well • Find subspaces which separate polysemous words into different clusters • Hopefully those subspaces correspond to topics • Lather, rinse, repeat CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Finding Communication Barriers • Want to find terms which have different meanings in different corpora • Judge words by the company they keep • Look for words which are in cohesive clusters in both corpora but the terms in those clusters are different CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Communication Barriers cont. • Tried with pro-choice/pro-life corpora • Poor results • Didn’t use cohesive clusters • Not enough data • Highly variable data • Possible fix - start with baseline corpus and measure drift as other corpora are merged in CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Tracking Language Drift • Follow changes in clusters as a corpus grows • Hierarchical Agglomerative Clustering may have discontinuities • Use these to mark significant changes CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Hybrid Approach • Merge statistical analysis (LSI) with symbolic analysis (MindNet) • Use LSI term similarity metric to assign strengths to MindNet relations • Incorporate syntactic information • Preprocess documents, adding POS or attachment information to words • Time-N Flies-V Like-AVP An-Det Arrow-N CS590Q W99 - Latent Semantic Indexing - Adam Carlson