1 / 51

Models for Authors and Text Documents

Models for Authors and Text Documents. Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford).

kareem
Télécharger la présentation

Models for Authors and Text Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Models for Authors and Text Documents Mark SteyversUCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

  2. These viewgraphs were developed by Professor Mark Steyvers and are intended for review by ICS 278 students. If you wish to use them for any other purposes please contact Professor Smyth (smyth@ics.uci.edu) or Professor Steyvers (msteyver@uci.edu)

  3. Goal • Automatically extract topical content of documents • Learn association of topics to authors of documents • Propose new efficient probabilistic topic model: the author-topic model • Some queries that model should be able to answer: • What topics does author X work on? • Which authors work on topic X? • What are interesting temporal patterns in topics?

  4. A topic is represented as a (multinomial) distribution over words

  5. Documents as Topics Mixtures:a Geometric Interpretation P(word1) 1 topic 1 = document 0 topic 2 1 P(word2) P(word3) 1 P(word1)+P(word2)+P(word3) = 1

  6. Previous topic-based models • Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) • EM implementation • Problem of overfitting • Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) • Clarified the pLSI model • Variational EM • Griffiths & Steyvers, (PNAS 2004) • Same generative model as LDA • Gibbs sampling technique for inference • Computationally simple • Efficient (linear with size of data) • Can be applied to >100K documents

  7. Approach with Author-Topic Models • Combine author models with topic models • Ignore style, focus on content of document • Learn the topics that authors write about • Learn two matrices: Authors Topics Topics Words

  8. Assumptions of Generative Model • Each author is associated with a topics mixture • Each document contains a mixture of topics • With multiple authors, the document will express a mixture of the topics mixtures of the co-authors • Each word in a text is generated from one topic and one author (potentially different for each word)

  9. Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from a qX • Sample a word w from a multinomial topic distribution z

  10. Graphical Model Matrix of author-topic distributions From the set of co-authors … 1. Choose an author Matrix of topic-word distributions 2. Choose a topic 3. Choose a word

  11. Model Estimation • Estimate x and zby Gibbs sampling(assignments of each word to an author and topic) • Integrate out F and Q • Estimation is efficient: linear in data size • Infer: • Author-Topic distributions (Q) • Topic-Word distributions (F)

  12. Gibbs sampling in Author-Topics • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

  13. Gibbs sampling procedure

  14. Start with random assignments to topics/authors

  15. Use all previous assignments, except for current word-token

  16. Sample topic and author, and move to next word-token

  17. Sample topic and author, and move to next word-token

  18. Sample topic and author, and move to next word-token

  19. Collect samples after >1000 iterations

  20. Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Removed stop words; no stemming • Word order is irrelevant, just use word counts • Processing time: Nips: 2000 Gibbs iterations  12 hours on PC workstation CiteSeer: 700 Gibbs iterations  111 hours

  21. Four example topics from CiteSeer (T=300)

  22. Four example topics from CiteSeer (T=300)

  23. Four more topics

  24. Some topics relate to generic word usage

  25. Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

  26. Four example topics from NIPS (T=100)

  27. ENRON Email: two example topics (T=100)

  28. ENRON Email: two topics not about Enron

  29. Stability of Topics • Content of topics is arbitrary across runs of model(e.g., topic #1 is not the same across runs) • However, • Majority of topics are stable over processing time • Majority of topics can be aligned across runs • Topics represent genuine structure in data

  30. Comparing NIPS topics from the same Markov chain BEST KL = 0.54 Re-ordered topics at t2=2000 WORST KL = 4.78 KL distance topics at t1=1000

  31. Comparing NIPS topics from two different Markov chains BEST KL = 1.03 Re-ordered topics from chain 2 WORST KL = 9.49 KL distance topics from chain 1

  32. Detecting Papers on Unusual Topics for Authors • We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:

  33. Author Separation • Can model attribute words to authors correctly within a document? • Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author A method1 is described which like the kernel1trick1 in support1vector1machines1SVMs1lets us generalizedistance1based2algorithms to operate in feature1spaces usually nonlinearlyrelated to the input1space This is done by identifying a class of kernels1 which can be represented as norm1based2distances1 in Hilbertspaces It turns1 out that commonkernel1algorithms such as SVMs1 and kernel1PCA1 are actually really distance1based2algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithmswork the present2work can form the basis1 for conceiving new algorithms This paperpresents2 a comprehensiveapproach for model2based2diagnosis2 which includesproposals for characterizing and computing2preferred2diagnoses2assuming that the system2description2 is augmented with a system2structure2 a directed2graph2 explicating the interconnections between system2components2Specifically we first introduce the notion of a consequence2 which is a syntactically2unconstrainedpropositional2sentence2 that characterizes all consistency2based2diagnoses2 and show2 that standard2characterizations of diagnoses2 such as minimalconflicts1correspond to syntactic2variations1 on a consequence2 Second we propose a new syntactic2variation on the consequence2 known as negation2normalform NNF and discuss its meritscompared to standardvariations Third we introduce a basicalgorithm2 for computingconsequences in NNF given a structuredsystem2description We show that if the system2structure2 does not contain cycles2 then there is always a linearsize2consequence2 in NNF which can be computed in lineartime2 For arbitrary1system2structures2 we show a preciseconnection between the complexity2 of computing2consequences and the topology of the underlyingsystem2structure2Finally we present2 an algorithm2 that enumerates2 the preferred2diagnoses2characterized by a consequence2 The algorithm2 is shown1 to take lineartime2 in the size2 of the consequence2 if the preferencecriterion1satisfies some generalconditions Written by (1) Scholkopf_B Written by (2) Darwiche_A

  34. Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2001 • We can calculate time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent • Do time-series correspond with known trends in computer science?

  35. Hot Topic: machine learning, data mining

  36. The inevitability of Bayes…

  37. Rise in Web/Mobile topics

  38. (Not so) Hot Topics

  39. Decline in programming languages, OS, ….

  40. Security research reborn….

  41. Decrease in use of Greek Letters 

  42. Burst of French writing in mid 90’s?

  43. Comparison to models that use less information Topics model Author model (topics, no authors) (authors, no topics)

  44. Matrix Factorization Interpretation AUTHOR-TOPIC MODEL Documents Topics Authors Documents A Topics = Words Words Authors TOPIC MODEL Documents Topics Documents Topics = Words Words AUTHOR MODEL Documents Author Documents A = Words Words Authors

  45. Comparison Results • Train models on part of a new document and predict remaining words • Without having seen any words from new document, author-topic information helps in predicting words from that document • Topics model is more flexible in adapting to new document after observing a number of words

  46. Author prediction with CiteSeer • Task: predict (single) author of new CiteSeer abstracts • Results: • For 33% of documents, author guessed correctly • Median rank of true author = 26 (out of 85,000)

  47. Perplexities for true author and any random author A = true author A = any author

  48. The Author-Topic Browser (a) Querying on author Pazzani_M Querying on topic relevant to author (b) Querying on document written by author http://www.ics.uci.edu/~michal/KDD/ATM.htm (c)

  49. New Applications/ Future Work • Finding relevant email: • "find emails similar to this email based on content” • "find people who wrote emails similar in content  to this one" • Reviewer Recommendation • “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” • Change Detection/Monitoring • Which authors are on the leading edge of new topics? • Characterize the “topic trajectory” of this author over time • Author Identification • Who wrote this document? Incorporation of stylistic information

More Related