1 / 30

CS 479, section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, section 1: Natural Language Processing. Lecture # 32: Text Clustering with Expectation Maximization. Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture. .

orsen
Télécharger la présentation

CS 479, section 1: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lecture #32: Text Clustering with Expectation Maximization Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture.

  2. Announcements • Homework 0.3 • Key has been posted • Requires authentication • If you haven’t submitted this assignment, but you’re doing project #4, perhaps it’s time to submit what you have and move on. • Project #4 • Note updates to the requirements involving horizontal markovization • Reading Report #13: M&S ch. 13 on alignment and MT • Due Monday – online again • See directions sent to the announcements list • Project #5 • Help session: Tuesday

  3. Final Project • Proposal pre-requisites: • Discuss your idea with me before submitting proposal • Identify and possess data • Proposal Due: Today • Project Report: • Early: Wednesday after Thanksgiving • Due: Friday after Thanksgiving

  4. Where are we going? • Earlier in the semester: Classification • “Supervised” learning paradigm • Joint / Generative model: Naïve Bayes • Moving into “Unsupervised” territory …

  5. Objectives • Understand the text clustering problem • Understand the Expectation Maximization (EM) algorithm for one model • See unsupervised learning at work on a real application

  6. Clustering Task • Given a set of documents, put them into groups so that: • Documents in the same group are alike • Documents in different groups are different • “Unsupervised” learning: no class labels • Goal: discover hidden patterns in the data • Useful in a range of NLP tasks: • IR • Smoothing • Data mining and Text mining • Exploratory data analysis

  7. Clustering Task: Example Dog dogdog cat Canine dog woof Cat catcatcat Dog dogdogdog Feline cat meow cat Dog dog cat cat Dog dogdog dog Feline cat Cat dog cat Dog collar Dog dog cat Raining cats and dogs Dog dog Puppy dog litter Cat kitten cub Kennel dog pound Kitten cat litter Dog cat catcatcat Year of the dog Dog cat mouse kitten

  8. k-Means Clustering • The simplest model-based technique • Procedure: • Failure Cases: Count(cat) Count(dog)

  9. Demo http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

  10. High-Level:Expectation-Maximization (EM) • Iterative procedure: • Guess some initial parameters for the model • Use model to estimate partial label counts for all docs (E-step) • Use the new complete data to learn better model (M-step) • Repeat steps 2 and 3 until convergence • k-Means is “hard” EM • Or EM is “soft” k-Means

  11. EM : Posterior distribution over clusters for each document : Total partial label counts for each document, based on : Model parameter estimates, computed from When to stop?

  12. Model Word-Token Perspective: ci … ti,1 ti,2 ti,Mi This is what we called “Naïve Bayes” when we discussed classification.

  13. Multinomial Mixture Model Word-Type Perspective: Word-Token Perspective: ci ci ci … … xi ti,1 ti,2 ti,Mi xi,1 xi,2 xi,V

  14. Multinomial Mixture Model Word-Type Perspective: Word-Token Perspective: ci ci ci … … xi ti,1 ti,2 ti,Mi xi,1 xi,2 xi,V

  15. Computing Data Likelihood Marginalization Factorization ConditionalIndependence Make word type explicit+ Conditional Indep. Categoricalparameters

  16. Log Likelihood of the Data • We want the probability of the unlabeled data according to a model with parameters : Independence of data Logarithm Log of product Log of sum Log of product

  17. logsum • For the computation involved in the logsum and its justification: https://facwiki.cs.byu.edu/nlp/index.php/Log_Domain_Computations

  18. Likelihood Surface Data Likelihood Model Parameterizations

  19. EM Properties • Each step of EM is guaranteed to increase data likelihood - a hill climbing procedure • Not guaranteed to find global maximum of data likelihood • Data likelihood typically has many local maxima for a general model class and rich feature set • Many “patterns” in the data that we can fit our model to… Data Likelihood Idea: Random restarts! Model Parameterizations

  20. EM for Naïve Bayes: E Step Dog dogdog cat Canine dog woof Cat catcatcat Dog dogdogdog Feline cat meow cat Dog dog cat cat Dog dogdog dog Feline cat Cat dog cat Dog collar Dog dog cat Raining cats and dogs Dog dog Puppy dog litter Cat kitten cub Kennel dog pound Kitten cat litter Dog cat catcatcat Year of the dog Dog cat mouse kitten

  21. EM for Naïve Bayes: E Step Dog cat catcatcat 1. Compute posteriors for each : • 2. Use as partial counts,total them up: Document j

  22. EM for Naïve Bayes: M Step • Once you have your partial counts, re-estimate parameters like you would with regular counts

  23. Etc. • In the next E step, the partial counts will be different, because the parameters have changed: • E.g., • And the likelihood will increase!

  24. EM for Multinomial Mixture Model • Initialize model parameters. E.g., randomly • E-step: • Calculate posterior distributions over clusters for each document xi: 2. Use posteriors as fractional labels; Total up your new counts!

  25. M-Step • Re-estimate : • Re-estimate  from the fractionally labeled data with smoothing:

  26. EM in General • EM is a technique for learning anytime we have incomplete data • Induction Scenario (ours): we actually want to know (e.g., clustering) • Convenience Scenario: we want the marginal, , and including just makes the model simpler (e.g., mixing weights)

  27. General Approach • Learn and • E-step: make a guess at posteriors • This means scoring all completions with the current parameters • Treat the completions as (fractional) complete data, and count. • M-step: re-estimate to maximize log • Then compute (smoothed) ML estimates of model parameters

  28. Expectation Maximization (EM)for Mixing Parameters • How to estimate mixing parameters and ? • Sometimes you can just do line search, as we have discussed. • … or the “try a few orders of magnitude” approach • Alternative: Use EM • Think of mixing as a hidden choice between histories: • Given a guess at PH, we can calculate expectations of which generation route a given token took (over held-out data) • Use these expectations to update PH, rinse and repeat

  29. Advertisement • Learn more about text mining, including text classification, text clustering, topic models, text summarization, and visualization for text mining: Register for CS 679 – Fall 2012

  30. Next • Next Topics: • Machine Translation • If time permits, Co-reference Resolution • We can use the EM algorithm to attack both of these problems!

More Related