1 / 81

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 27 4 /24/2013. Recommended reading. Jia Lu’s slides on topic models http://en.wikipedia.org/wiki/Latent_semantic_analysis

tab
Télécharger la présentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 27 • 4/24/2013

  2. Recommended reading • Jia Lu’s slides on topic models • http://en.wikipedia.org/wiki/Latent_semantic_analysis • Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. 1990. "Indexing by Latent Semantic Analysis”. Journal of the American Society for Information Science 41 (6): 391–407. • Thomas Hofmann. 1999. Probabilistic Latent Semantic Analysis. • http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation • David Blei. 2011. Introduction to Probabilistic Topic Models.

  3. Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

  4. Semantic representation 1:word-based • Semantic network • Nodes = words • Links = different semantic relations

  5. Semantic representation 2:concept-based, indicated by words • Semantic vector space • Vectors correspond to concepts • Distance = degree of expression of concept • Distance between words = degree of semantic similarity

  6. Semantic representation 3:topic model • Left: probabilistic assignment of words to topic • Right: words by topic, sorted by decreasing prob.

  7. Systems for different semantic representations • Semantic network • Lexical database: WordNet • Vector space model • Unsupervised algorithm: Latent Semantic Analysis • Topic model • Unsupervised algorithms: • Probabilistic Latent Semantic Analysis • Latent Dirichlet Allocation (popular, but too advanced for this course)

  8. Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

  9. WordNet • http://wordnet.princeton.edu/ • Most widely used hierarchically organized lexical database for English (Fellbaum, 1998) • Other languages: Global WordNet Association • http://www.globalwordnet.org/

  10. Synsets in WordNet • Example synset • { chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel, shlemiel, soft touch, mug } • Definition: “a person who is gullible and easy to take advantage of”.   • A synset defines one sense for each of the words listed in the synset • A word may occur in multiple synsets • Example: “give” has 45 senses

  11. Format of WordNet entries

  12. Distribution of senses amongWordNet Verbs

  13. Lexical Relations in WordNet

  14. Hypernymy in WordNet

  15. Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

  16. Information retrieval • Web searching, library catalog lookup • Given a query and a collection of documents, retrieve documents that are relevant to the query • Problem: determining the relevant documents

  17. Boolean queries • Example: “paris” AND “hotel” AND NOT “hilton” • Match terms: return documents with same words as in query

  18. Example Boolean query • Q : “Light waves” • D1: “Particle and wave models of light” • D2: “Surfing on the waves under star lights” • D3: “Electro-magnetic models for fotons” • x: document contains word • x’: document contains word, and word in query • REL = relevant document • MATCH = document is matched under query (document contains all words in query)

  19. Problems with Boolean queries • Too strict, only finds documents with specific words searched for • Doesn’t consider word ambiguity • Retrieval of documents isn’t probabilistic • Assigns equal importance to all words in query • Syntax is hard for non-technical users

  20. Precision and recall ininformation retrieval • Retrieve documents relevant to a query http://nltk.googlecode.com/svn/trunk/doc/images/precision-recall.png

  21. Synonymy and polysemy • Polysemy • Words with different meanings: model, python, chip • Term matching returns irrelevant documents • Lowers precision • Synonymy • Many ways to refer same object: car, automobile • Lowers recall: many documents not matched by words in query

  22. Synonymy, polysemy, anddocument similarity Synonymy These two documents have few shared words but are related auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Polysemy These two documents have many shared words, but are not related

  23. Questions for information retrieval • How do we identify documents that are relevant, but don’t contain words in our query? • Given a query, how do we rule out a document that has matching words, but that is irrelevant?

  24. Topics • Documents are about topics, rather than just specific words • e.g. sports, computers, cars, politics • Many words can belong to a given topic • Variability in word use: an author chooses a particular subset of them in writing a document

  25. Document retrieval based on topics • Perform document retrieval according to topics, rather than just the words in queries • Input: words in a query • Use query words to determine the “topic” that the user wants to look up • Figure out “topic” of a query, then return documents on that topic • How do we get topics? • Computational problem: induction of semantic topics

  26. Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

  27. Word vector space model of meaning • Meaning is a high-dimensional space • Each dimension is associated with a single word • Huge number of dimensions • Bag of words model: • Count the frequency of each word in a document • Ignore location of word in documents • Ignores syntactic structure • Remove stopwords (high-frequency function words) • the, of, to, around, by, at, … • Represent a document as a point in this space

  28. Term-document co-occurrence matrix • N documents, vocabulary size M • Generate a word-document co-occurrence matrix W • Wi,j = # of times word wi occurs in document dj d1d2 ….. dN w1 w2 : : wM W =

  29. Term-document matrix structure • Create a corpus from newsgroups • Build a term × document matrix • Example: • 100 documents each from 3 different newsgroups • 300 total documents, 12418 distinct terms • Remove words from standard stopword list • Matrix is sparse: 8% filled • 12,418 x 300 different cells

  30. Terms sorted by rank within topic

  31. Cosine similarity between each pair of documents

  32. Topic discovery • Topic of document can be distinguished by words in document • However, this assumes that topics of documents are already known • Can we use frequencies of words in documents to discover the topics? • Individual documents tend to be about one particular topic

  33. Want to learn a vector space model of concepts • N-dimensional space, each dimension corresponding to an abstract concept, rather than a single word • Location of a word indicates strength of association with concept dimensions

  34. Reduce dimension of matrix • Want a vector space model where dimensions to correspond to topics instead of individual words • Dimensionality reduction • Reduce the size of the representation from O(10,000) words to O(100) topics. • Algorithms: • Latent Semantic Analysis (LSA) / Singular Value Decomposition (SVD) • Probabilistic Latent Semantic Analysis (PLSA) • Latent Dirichlet Allocation (LDA)

  35. Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

  36. Latent Semantic Analysis (LSA) • http://en.wikipedia.org/wiki/Latent_semantic_analysis • http://lsa.colorado.edu/ • LSA is an unsupervised algorithm for discovering topics in a corpus of documents • Idea (Deerwester et. al. 1990): “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.”

  37. Basic steps in LSA • Perform singular value decomposition (SVD) on term-document co-occurrence matrix • http://en.wikipedia.org/wiki/Singular_value_decomposition • (to understand requires strong knowledge of Linear Algebra) • Produces 3 matrices that reproduce the original when multiplied together • Each of the inner dimensions corresponds to a topic • Set all but the k highest singular values to 0 • Produces best k-dimensional approximation of the original matrix • These are the k most relevant topics • Documents can then be represented in topic space, instead of word space

  38. Example: technical memo titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computersystemresponsetime c3: The EPSuserinterface management system c4: System and humansystem engineering testing of EPS c5: Relation of user perceived responsetime to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graphminors IV: Widths of trees and well-quasi-ordering m4: Graphminors: A survey

  39. Term-document co-occurrence matrix

  40. Negative correlations between both related terms and unrelated terms r (human, user) = -.38 r (human, minors) = -.29

  41. Matrix multiplication • http://en.wikipedia.org/wiki/Matrix_multiplication

  42. Variables involved • t = number of terms • d = number of documents • Term-document matrix: size t x d • m = min(t, d): rank of term-document matrix • Maximum number of topics that there could possibly be • k = number of singular values • Number of top topics, that you select

  43. Singular Value Decomposition (SVD) • Input: • X: term-document matrix, of size t x d • Output of SVD: • T0: term-topic matrix t x m • S0: singular value matrix m x m • D0’: topic-document matrix m x d • Original matrix can be exactly recovered from factored matrices through matrix multiplication • X = T0S0D0’ • Size (t x m) * (m x m) * (m x d) = size t x d

  44. Singular Value Decompositionhttp://en.wikipedia.org/wiki/Singular_value_decomposition

  45. SVD approximation • Original matrix can be exactly recovered from factored matrices through mult: • Approximation of original matrix: • Select top k singular values (top k dimensions), remove all other dimensions • Multiplication of reduced matrices approximates the original • Since the top singular values were selected, this is the best rank-k approximation of X

  46. Choose top k dimensions

  47. Another picture of reduced SVD • SVD: A = U Σ VT • After dimensionality reduction: A ~= ~U ~Σ ~VT

  48. SVD approximation leads to smaller representation • t * d > ( t * k ) + ( k * k ) + ( k * d ) • If k << m, substantial savings in size of representation (m is the rank of original matrix) • Example: t = 5,000, d = 1000, k = 50 • t * d = 5,000,000 • But ( t * k ) + ( k * k ) + ( k * d ) = 250,000 + 2,500 + 50,000 = 302,500

More Related