Lecture 16: Unsupervised Learning from Text

Lecture 16: Unsupervised Learningfrom Text Padhraic SmythDepartment of Computer Science University of California, Irvine

Outline • General aspects of text mining • Named-entity extraction, question-answering systems, etc • Unsupervised learning from text documents • Motivation • Topic model and learning algorithm • Results • Extensions • Author-topic models • Applications • Demo of topic browser • Future directions

Different Aspects of Text Mining • Named-entity extraction: • Parsers to recognize names of people, places, genes, etc • E.g., GATE system • Question-answering systems • News summarization • Google news, Newsblaster (http://www1.cs.columbia.edu/nlp/newsblaster/) • Document clustering • Standard algorithms: k-means, hierarchical • Probabilistic approaches • Topic modeling • Representing document as mixtures of topics • And many more…

Named Entity-Extraction • Often a combination of • Knowledge-based approach (rules, parsers) • Machine learning (e.g., hidden Markov model) • Dictionary • Non-trivial since entity-names can be confused with real names • E.g., gene name ABS and abbreviation ABS • Also can look for co-references • E.g., “IBM today…… Later, the company announced…..” • Very useful as a preprocessing step for data mining, e.g., use entity-names to train a classifier to predict the category of an article

Example: GATE/ANNIE extractor • GATE: free software infrastructure for text analysis (University of Sheffield, UK) • ANNIE: widely used entity-recognizer, part of GATE http://www.gate.ac.uk/annie/

Question-Answering Systems • See additional slides on Dumais et al AskMSR system

Unsupervised Learning from Text • Large collections of unlabeled documents.. • Web • Digital libraries • Email archives, etc • Often wish to organize/summarize/index/tag these documents automatically • We will look at probabilistic techniques for clustering and topic extraction from sets of documents

Outline • Background on statistical text modeling • Unsupervised learning from text documents • Motivation • Topic model and learning algorithm • Results • Extensions • Author-topic models • Applications • Demo of topic browser • Future directions

Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

Enron email data 250,000 emails 28,000 authors 1999-2002

Other Examples of Data Sets • CiteSeer digital collection: • 700,000 papers, 700,000 authors, 1986-2005 • MEDLINE collection • 16 million abstracts in medicine/biology • US Patent collection • and many more....

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..

Probability Models for Documents • Example: 50,000 possible words in our vocabulary • Simple memoryless model, aka "bag of words" • 50,000-sided die • each side of the die represents 1 word • a non-uniform die: each side/word has its own probability • to generate N words we toss the die N times • gives a "bag of words" (no sequence information) • This is a simple probability model: • p( document | f ) = P p(word i | f ) • to "learn" the model we just count frequencies • p(word i) = number of occurences of i / total number

The Multinomial Model • Example: tossing a 6-sided die • P = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] • Multinomial model for documents: • V-sided “die” = probability distribution over possible words • Some words have higher probability than others • Document with K words generated by N memoryless “draws” • Typically interested in conditional multinomials, e.g., • p(words | spam) versus p(words | non-spam)

Real examples of Word Multinomials

Probabilistic Model P( Data | Parameters) Real World Data Parameters P( Parameters | Data) Statistical Inference

A Graphical Model p( doc | f ) = P p(wi | f ) f f= "parameter vector" = set of probabilities one per word p( w | f ) w2 w1 wn

Another view.... p( doc | f ) = P p(wi | f ) This is “plate notation” Items inside the plate are conditionally independent given the variable outside the plate There are “n” conditionallyindependent replicatesrepresented by the plate f wi i=1:n

Being Bayesian.... This is a prior on our multinomial parameters, e.g., a simple Dirichlet smoothing prior with symmetric parameter a to avoid estimates of probabilities that are 0 a f wi i=1:n

Being Bayesian.... Learning: infer p( f | words, a ) proportional to p( words | f) p(f|a) a f wi i=1:n

Multiple Documents p( corpus | f ) = P p( doc | f ) a f wi 1:n 1:D

Different Document Types p( w | f) is a multinomial over words a f wi 1:n

Different Document Types p( w | f) is a multinomial over words a f wi 1:n 1:D

Different Document Types p( w | f , zd) is a multinomial over words zd is the "label" for each doc a zd f wi 1:n 1:D

Different Document Types p( w | f , zd) is a multinomial over words zd is the "label" for each doc Different multinomials, depending on the value of zd (discrete) f now represents |z| different multinomials a zd f wi 1:n 1:D

Unknown Document Types Now the values of z for each document are unknown - hopeless? p a zd f wi 1:n 1:D

Unknown Document Types Now the values of z for each document are unknown - hopeless? p a Not hopeless :) Can learn about both z and q e.g., EM algorithm This gives probabilistic clustering p(w | z=k , f) is the kth multinomial over words zd f wi 1:n 1:D

Topic Model zi is a "label" for each word p( w | f , zi = k) = multinomial over words = a "topic" p( zi | qd)= distribution over topics that is document specific qd a zi f wi 1:n 1:D

Key Features of Topic Models • Generative model for documents in form of bags of words • Allows a document to be composed of multiple topics • Much more powerful than 1 doc -> 1 cluster • Completely unsupervised • Topics learned directly from data • Leverages strong dependencies at word level AND large data sets • Learning algorithm • Gibbs sampling is the method of choice • Scalable • Linear in number of word tokens • Can be run on millions of documents

Document generation as a probabilistic process • Each topic is a distribution over words • Each document a mixture of topics • Each word chosen from a single topic From parameters  (j) From parameters  (d)

Example of generating words MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 .... 1.0 .6 RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 .... .4 1.0 RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2.... Topics Mixtures θ Documents and topic assignments

Inference MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY? BANK? LOAN? LOAN? BANK? MONEY? .... ? RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY? RIVER? MONEY? BANK? LOAN? MONEY? .... ? ? RIVER? BANK? STREAM? BANK? RIVER? BANK?.... Topics Mixtures θ Documents and topic assignments

Bayesian Inference • Three sets of latent variables • topic mixtures θ • word distributions  • topic assignments z • Integrate out θ and  and estimate topic assignments: • Use MCMC with Gibbs sampling for approximate inference Sum over terms

Gibbs Sampling • Start with random assignments of words to topics • Repeat M iterations • Repeat for all words i • Sample a new topic assignment for word i conditioned on all other topic assignments • Each sample is simple: draw from a multinomial represented as a ratio of appropriate counts

16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?

Starting the Gibbs Sampling • Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

After 1 iteration

After 4 iterations

After 32 iterations ● ●

More Details on Learning • Gibbs sampling for x and z • Typically run several hundred Gibbs iterations • 1 iteration = full pass through all words in all documents • Estimating θand  • x and z sample -> point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Memory requirements can be a limitation for large corpora • Predictions on new documents • can average over θand  (from different samples, different runs)

History of topic models • origins in statistics: • latent class models in social science • admixture models in statistical genetics • applications in computer science • Hoffman, SIGIR, 1999 • Blei, Jordan, Ng, JMLR 2003 • Griffiths and Steyvers, PNAS, 2004 • more recent work • author-topic models: Steyvers et al, Rosen-Zvi et al, 2004 • Hierarchical topics: McCallum et al, 2006 • Correlated topic models: Blei and Lafferty, 2005 • Dirichlet process models: Teh, Jordan, et al • large-scale web applications: Buntine et al, 2004, 2005 • undirected models: Welling et al, 2004

Topic = probability distribution over words Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data

Examples of Topics from CiteSeer

Four example topics from NIPS

Examples Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

History of topic models • Latent class models in statistics (late 60’s) • “Aspect model”, Hoffman (1999) • Original application to documents • LDA Model: Blei, Ng, and Jordan (2001, 2003) • Variational methods • Topics Model: Griffiths and Steyvers (2003, 2004) • Gibbs sampling approach (very efficient) • More recent work on alternative (but similar) models, e.g., by Max Welling (ICS), Buntine, McCallum, and others

Comparing Topics and Other Approaches • Clustering documents • Computationally simpler… • But a less accurate and less flexible model • LSI/LSA/SVD • Linear project of V-dim word vectors into lower dimensions • Less interpretable • Not generalizable • E.g., authors or other side-information • Not as accurate • E.g., precision-recall: Hoffman, Blei et al, Buntine, etc • Probabilistic models such as topic Models • “next-generation” text modeling, after LSI • provide a modular extensible framework

Clusters v. Topics

Lecture 16: Unsupervised Learning from Text