Text Mining and Topic Modeling

Text Mining and Topic Modeling Padhraic SmythDepartment of Computer Science University of California, Irvine

Progress Report New deadline • In class, Thursday February 18th (not Tuesday) Outline • 3 to 5 pages maximum • Suggested content • Brief restatement of the problem you are addressing (no need to repeat everything in your original proposal), e.g., ½ a page • Summary of progress so far • Background papers read • Data preprocessing, exploratory data analysis • Algorithms/Software reviewed, implemented, tested • Initial results (if any) • Challenges and difficulties encountered • Brief outline of plans between now and end of quarter • Use diagrams, figures, tables, where possible • Write clearly: check what you write

Road Map Topics covered • Exploratory data analysis and visualization • Regression • Classification • Text classification Yet to come…. • Unsupervised learning with text (topic models) • Social networks • Recommender systems (including Netflix) • Mining of Web data

Text Mining • Document classification • Information extraction • Named-entity extraction: recognize names of people, places, genes, etc • Document summarization • Google news, Newsblaster (http://www1.cs.columbia.edu/nlp/newsblaster/) • Document clustering • Topic modeling • Representing document as mixtures of topics • And many more…

Named Entity-Extraction • Often a combination of • Knowledge-based approach (rules, parsers) • Machine learning (e.g., hidden Markov model) • Dictionary • Non-trivial since entity-names can be confused with real names • E.g., gene name ABS and abbreviation ABS • Also can look for co-references • E.g., “IBM today…… Later, the company announced…..” • Useful as a preprocessing step for data mining, e.g., use entity-names to train a classifier to predict the category of an article

Example: GATE/ANNIE extractor • GATE: free software infrastructure for text analysis (University of Sheffield, UK) • ANNIE: widely used entity-recognizer, part of GATE http://www.gate.ac.uk/annie/

Information Extraction From Seymore, McCallum, Rosenfeld, Learning Hidden Markov Model Structure for Information Extration, AAAI 1999

Topic Models • Background on graphical models • Unsupervised learning from text documents • Motivation • Topic model and learning algorithm • Results • Extensions • Topics over time, author topic models, etc

Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

Enron email data 250,000 emails 28,000 authors 1999-2002

Other Examples of Large Corpora • CiteSeer digital collection: • 700,000 papers, 700,000 authors, 1986-2005 • MEDLINE collection • 16 million abstracts in medicine/biology • US Patent collection • and many more....

Unsupervised Learning from Text • Large collections of unlabeled documents.. • Web • Digital libraries • Email archives, etc • Often wish to organize/summarize/index/tag these documents automatically • We will look at probabilistic techniques for clustering and topic extraction from sets of documents

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..

Review Slides on Graphical Models

Multinomial Models for Documents • Example: 50,000 possible words in our vocabulary • Simple memorylessmodel • 50,000-sided die • a non-uniform die: each side/word has its own probability • to generate N words we toss the die N times • This is a simple probability model: • p( document | f ) = Pp(word i | f ) • to "learn" the model we just count frequencies • p(word i) = number of occurences of i / total number • Typically interested in conditional multinomials, e.g., • p(words | spam) versus p(words | non-spam)

Real examples of Word Multinomials

A Graphical Model for Multinomials p( doc | f ) = P p(wi | f ) f f= "parameter vector" = set of probabilities one per word p( w | f ) w2 w1 wn

Another view.... p( doc | f ) = P p(wi | f ) This is “plate notation” Items inside the plate are conditionally independent given the variable outside the plate There are “n” conditionallyindependent replicatesrepresented by the plate f wi i=1:n

Being Bayesian.... This is a prior on our multinomial parameters, e.g., a simple Dirichlet smoothing prior with symmetric parameter a to avoid estimates of probabilities that are 0 a f wi i=1:n

Being Bayesian.... Learning: infer p( f | words, a ) proportional to p( words | f) p(f|a) a f wi i=1:n

Multiple Documents p( corpus | f ) = P p( doc | f ) a f wi 1:n 1:D

Different Document Types p( w | f) is a multinomial over words a f wi 1:n

Different Document Types p( w | f) is a multinomial over words a f wi 1:n 1:D

Different Document Types p( w | f , zd) is a multinomial over words zd is the "label" for each doc a zd f wi 1:n 1:D

Different Document Types p( w | f , zd) is a multinomial over words zd is the "label" for each doc Different multinomials, depending on the value of zd (discrete) f now represents |z| different multinomials a zd f wi 1:n 1:D

Unknown Document Types Now the values of z for each document are unknown - hopeless? b a zd f wi 1:n 1:D

Unknown Document Types Now the values of z for each document are unknown - hopeless? b a Not hopeless :) Can learn about both z and q e.g., EM algorithm This gives probabilistic clustering p(w | z=k , f) is the kth multinomial over words zd f wi 1:n 1:D

Topic Model b ziis a "label" for each word q: p( zi | qd)= distribution over topics that is document specific f: p( w | f , zi = k) = multinomial over words = a "topic" qd a zi f wi 1:n 1:D

Example of generating words MONEY1 BANK1BANK1 LOAN1 BANK1 MONEY1 BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 .... 1.0 .6 RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 .... .4 1.0 RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2.... Topics Mixtures θ Documents and topic assignments

Learning MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY? BANK? LOAN? LOAN? BANK? MONEY? .... ? RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY? RIVER? MONEY? BANK? LOAN? MONEY? .... ? ? RIVER? BANK? STREAM? BANK? RIVER? BANK?.... Topics Mixtures θ Documents and topic assignments

Key Features of Topic Models • Model allows a document to be composed of multiple topics • More powerful than 1 doc -> 1 cluster • Completely unsupervised • Topics learned directly from data • Leverages strong dependencies at word level • Learning algorithm • Gibbs sampling is the method of choice • Scalable • Linear in number of word tokens • Can be run on millions of documents

Document generation as a probabilistic process • Each topic is a distribution over words • Each document a mixture of topics • Each word chosen from a single topic From parameters  (j) From parameters  (d)

Learning the Model • Three sets of latent variables we can learn • topic-word distributions  • document-topic distributionsθ • topic assignments for each wordz • Options: • EM algorithm to find point estimates of f and q • e.g., Chien and Wu, IEEE Trans ASLP, 2008 • Gibbs sampling • Find p(f | data), p(q | data), p(z| data) • Can be slow to converge • Collapsed Gibbs sampling • Most widely used method [See also Asuncion, Welling, Smyth, Teh, UAI 2009 for additional discussion]

Gibbs Sampling • Say we have 3 parameters x,y,z, and some data • Bayesian learning: • We want to compute p(x, y, z | data) • But frequently it is impossible to compute this exactly • However, often we can compute conditionals for individual variables, e.g., p(x | y, z, data) • Not clear how this is useful yet, since it assumes y and z are known (i.e., we condition on them).

Gibbs Sampling 2 • Example of Gibbs sampling: • Initialize with x’, y’, z’ (e.g., randomly) • Iterate: • Sample new x’ ~ P(x | y’, z’, data) • Sample new y’ ~ P(y | x’, z’, data) • Sample new z’ ~ P(z | x’, y’, data) • Continue for some number (large) of iterations • Each iteration consists of a sweep through the hidden variables or parameters (here, x, y, and z) • Gibbs = a Markov Chain Monte Carlo (MCMC) method In the limit, the samples x’, y’, z’ will be samples from the true joint distribution P(x , y, z | data) This gives us an empirical estimate of P(x , y, z | data)

Example of Gibbs Sampling in 2d From online MCMC tutorial notes by Frank Dellaert, Georgia Tech

Computation • Convergence • In the limit, samples x’, y’, z’ are from P(x , y, z | data) • How many iterations are needed? • Cannot be computed ahead of time • Early iterations are discarded (“burn-in”) • Typically monitor some quantities of interest to monitor convergence • Convergence in Gibbs/MCMC is a tricky issue! • Complexity per iteration • Linear in number of hidden variables and parameters • Times the complexity of generating a sample each time

Gibbs Sampling for the Topic Model • Recall: 3 sets of latent variables we can learn • topic-word distributions  • document-topic distributions θ • topic assignments for each word z • Gibbs sampling algorithm • Initialize all the z’s randomly to a topic, z1, ….. zN • Iteration • For i = 1,…. N • Sample zi ~ p(zi | all other z’s, data) • Continue for a fixed number of iterations or convergence • Note that this is collapsed Gibbs sampling • Sample from p(z1, ….. zN| data), “collapsing” over f and q

Topic Model qd zi f wi 1:n 1:D

Sampling each Topic-Word Assignment count of topic t assigned to doc d count of word w assigned to topic t probability that word iis assigned to topic t

Convergence Example(from Newman et al, JMLR, 2009)

Complexity Time • O(N T) per iteration, where N is the number of “tokens”, T the number of topics • For fast sampling, see “Fast-LDA”, Porteous et al, ACM SIGKDD, 2008; also Yao, Mimno, McCallum, ACM SIGKDD 2009. • For distributed algorithms, see Newman et al., Journal of Machine Learning Research, 2009, e.g., T = 1000, N = 100 million Space • O(D T + T W + N), where D is the number of documents and W is the number of unique words (size of vocabulary) • Can reduce these size by using sparse matrices • Store non-zero counts for doc-topic and topic-word • Only apply smoothing at prediction time

16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?

Starting the Gibbs Sampling • Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

After 1 iteration

After 4 iterations

After 32 iterations ● ●

Software for Topic Modeling • Mark Steyver’s public-domain MATLAB toolbox for topic modeling on the Web psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Text Mining and Topic Modeling