Generative Topic Models for Community Analysis

Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt

Objectives • Cultural literacy for ML: • Q: What are “topic models”? • A1: popular indoor sport for machine learning researchers • A2: a particular way of applying unsupervised learning of Bayes nets to text • Quick historical survey of some sample papers in the area

Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model

Introduction to Topic Models • Multinomial Naïve Bayes  • For each document d = 1,, M • Generate Cd ~ Mult( ¢ | ) • For each position n = 1,, Nd • Generate wn ~ Mult(¢|,Cd) C ….. WN W1 W2 W3 M b

Introduction to Topic Models • Naïve Bayes Model: Compact representation   C C ….. WN W1 W2 W3 W M N b M b

Introduction to Topic Models • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: • But classes are not visible:  C Z W N M b

Introduction to Topic Models

Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn)  Topic distribution z w N M 

Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model • Learning using EM • Not a complete generative model • Has a distribution  over the training set of documents: no new document can be generated! • Nevertheless, more realistic than mixture model • Documents can discuss multiple topics!

Introduction to Topic Models • PLSA topics (TDT-1 corpus)

Introduction to Topic Models

Introduction to Topic Models • Latent Dirichlet Allocation  • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) a z w N M 

Introduction to Topic Models • Latent Dirichlet Allocation • Overcomes the issues with PLSA • Can generate any random document • Parameter learning: • Variational EM • Numerical approximation using lower-bounds • Results in biased solutions • Convergence has numerical guarantees • Gibbs Sampling • Stochastic simulation • unbiased solutions • Stochastic convergence

Introduction to Topic Models • Variational EM for LDA • Approximate the posterior by a simpler distribution • A convex function in each parameter!

Introduction to Topic Models • Gibbs sampling • Applicable when joint distribution is hard to evaluate but conditional distribution is known • Sequence of samples comprises a Markov Chain • Stationary distribution of the chain is the joint distribution

Introduction to Topic Models • LDA topics

Introduction to Topic Models • LDA’s view of a document

Introduction to Topic Models • Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA

Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model

Hyperlink modeling using PLSA

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]  • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( ¢ | d) • generate cj ~ Mult( ¢ | zj) d d z z w c N L M  g

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]  PLSA likelihood: d d z z New likelihood: w c N L M  g Learning using EM

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] Heuristic:  (1-) 0 ·· 1 determines the relative importance of content and hyperlinks

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] • Classification performance content Hyperlink Hyperlink content

Hyperlink modeling using LDA

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004] a  • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( . | d) • generate cj ~ Mult( . | zj) z z w c N L M  g Learning using variational EM

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

Author-Topic Model for Scientific Literature

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • For each author a = 1,,A • Generate a ~ Dir(¢ | ) • For each topic k = 1,,K • Generate fk ~ Dir( ¢ | ) • For each document d = 1,,M • For each position n = 1,, Nd • Generate author x ~ Unif(¢ | ad) • generate zn ~ Mult( ¢ | a) • generate wn ~ Mult( ¢ | fzn) a x z  A w N M f b K

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P  x z  A w N M f b K

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Gibbs sampling

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Datasets • Enron email data • 23,488 messages between 147 users • McCallum’s personal email • 23,488(?) messages with 128 authors

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: Enron set

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: McCallum’s data

Modeling Citation Influences

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA paper

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Words in LDA paper assigned to citations

Link-PLSA-LDA: Topic Influence in Blogs (ICWSM 2008) Ramesh Nallapati, Amr Ahmed Eric Xing

Generative Topic Models for Community Analysis

Generative Topic Models for Community Analysis

Presentation Transcript

Generative Topic Models for Community Analysis

Models of Generative Grammar

Generative Models For Text

Beyond Search: Statistical Topic Models for Text Analysis

GENERATIVE TOPIC

Generative Models vs. Discriminative models

Linear Classification Models: Generative

Topic models

Generative Models

Generative Models for Image Understanding

Generative Models for Crowdsourced Data

Topic Models

Generative Models

Models of Generative Grammar

Generative Models for the Web Graph

Generative Models for Image Analysis

Generative Graphical Models for Maneuvering Object Tracking and Dynamics Analysis

Semantic History Embedding in Online Generative Topic Models

Generative models for automated brain MRI segmentation

Topic Significance Ranking for LDA Generative Models

Topic Models for Social Network Analysis and Bibliometrics

Beyond Search: Statistical Topic Models for Text Analysis