Download
author topic models for large text corpora n.
Skip this Video
Loading SlideShow in 5 Seconds..
Author-Topic Models for Large Text Corpora PowerPoint Presentation
Download Presentation
Author-Topic Models for Large Text Corpora

Author-Topic Models for Large Text Corpora

272 Vues Download Presentation
Télécharger la présentation

Author-Topic Models for Large Text Corpora

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Author-Topic Models for Large Text Corpora Padhraic SmythDepartment of Computer Science University of California, Irvine In collaboration with: Mark Steyvers (UCI) Michal Rosen-Zvi (UCI) Tom Griffiths (Stanford)

  2. Outline • Problem motivation: • Modeling large sets of documents • Probabilistic approaches • topic models -> author-topic models • Results • Author-topic results from CiteSeer, NIPS, Enron data • Applications of the model • (Demo of author-topic query tool) • Future directions

  3. Data Sets of Interest • Data = set of documents • Large collection of documents: 10k, 100k, etc • Know authors of the documents • Know years/dates of the documents • …… • (will typically assume bag of words representation)

  4. Examples of Data Sets • CiteSeer: • 160k abstracts, 80k authors, 1986-2002 • NIPS papers • 2k papers, 1k authors, 1987-1999 • Reuters • 20k newspaper articles, 114 authors

  5. Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

  6. Enron email data 500,000 emails 5000 authors 1999-2002

  7. Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..

  8. A topic is represented as a (multinomial) distribution over words

  9. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  10. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval P(probabilistic | topic) = 0.25 P(learning | topic) = 0.50 P(Bayesian | topic) = 0.25 P(other words | topic) = 0.00 P(information | topic) = 0.5 P(retrieval | topic) = 0.5 P(other words | topic) = 0.0

  11. Graphical Model z Cluster Variable w Word n words

  12. Graphical Model z Cluster Variable w Word n words D documents

  13. Graphical Model Cluster Weights a z Cluster Variable f Cluster-Word distributions w Word n words D documents

  14. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  15. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  16. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  17. History of topic models • Latent class models in statistics (late 60’s) • Hoffman (1999) • Original application to documents • Blei, Ng, and Jordan (2001, 2003) • Variational methods • Griffiths and Steyvers (2003, 2004) • Gibbs sampling approach (very efficient)

  18. Word/Document countsfor 16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?

  19. Example of Gibbs Sampling • Assign word tokens randomly to topics: (●=topic 1; ●=topic 2 )

  20. After 1 iteration • Apply sampling equation to each word token

  21. After 4 iterations

  22. After 32 iterations  ● ●

  23. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  24. Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  25. Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  26. Approach • The author-topic model • a probabilistic model linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E.g., p(author | words in document) • E.g., p(topic | author)

  27. Graphical Model x Author z Topic

  28. Graphical Model x Author z Topic w Word

  29. Graphical Model x Author z Topic w Word n

  30. Graphical Model a x Author z Topic w Word n D

  31. Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

  32. Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from qX • Sample a word w from a multinomial topic distribution z

  33. Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

  34. Learning • Observed • W = observed words, A = sets of known authors • Unknown • x, z : hidden variables • Θ,  : unknown parameters • Interested in: • p( x, z | W, A) • p( θ ,  | W, A) • But exact inference is not tractable

  35. Step 1: Gibbs sampling of x and z a x Author q Marginalize over unknown parameters z Topic f w Word n D

  36. Step 2: MAP estimates of θand  a x Author Condition on particular samples of x and z q z Topic f w Word n D

  37. Step 2: MAP estimates of θand  a x Author q Point estimates of unknown parameters z Topic f w Word n D

  38. More Details on Learning • Gibbs sampling for x and z • Typically run 2000 Gibbs iterations • 1 iteration = full pass through all documents • Estimating θand  • x and z sample -> point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Predictions on new documents • can average over θand  (from different samples, different runs)

  39. Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

  40. Experiments on Real Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Pubmed: 27K abstracts, 50K authors • Removed stop words; no stemming • Ignore word order, just use word counts • Processing time: Nips: 2000 Gibbs iterations  8 hours CiteSeer: 2000 Gibbs iterations  4 days

  41. Four example topics from CiteSeer (T=300)

  42. More CiteSeer Topics

  43. Some topics relate to generic word usage

  44. What can the Model be used for? • We can analyze our document set through the “topic lens” • Applications • Queries • Who writes on this topic? • e.g., finding experts or reviewers in a particular area • What topics does this person do research on? • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more…..

  45. Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

  46. Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2002 • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent