Padhraic Smyth Department of Information and Computer Science University of California, Irvine

ICS 278: Data MiningLecture 14: Document Clustering and Topic ExtractionNote: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May 2003. Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Document Clustering • Set of documents D in term-vector form • no class labels this time • want to group the documents into K groups or into a taxonomy • Each cluster hypothetically corresponds to a “topic” • Methods: • Any of the well-known clustering methods • K-means • E.g., “spherical k-means”, normalize document distances • Hierarchical clustering • Probabilistic model-based clustering methods • e.g., mixtures of multinomials • Single-topic versus multiple-topic models • Extensions to author-topic models Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixture Model Clustering Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixture Model Clustering Conditional Independence model for each component (often quite useful to first-order) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixtures of Documents Terms 1 1 1 1 1 1 1 1 1 1 1 1 Component 1 1 1 1 1 1 1 Documents 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Component 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Terms 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Documents 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Treat as Missing Terms 1 1 1 1 1 C1 1 1 1 1 1 1 C1 1 1 1 C1 1 1 1 1 C1 1 1 1 C1 Documents 1 1 1 C1 C1 C2 1 1 1 C2 1 1 1 C2 1 1 1 1 C2 1 1 1 C2 1 1 1 1 1 C2 1 1 C2 1 1 1 C2 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Treat as Missing Terms 1 1 1 1 1 C1 P(C1|x1) P(C2|x1) 1 1 1 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 C1 P(C1|..) P(C2|..) Documents 1 1 1 C1 P(C1|..) P(C2|..) C1 P(C1|..) P(C2|..) C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 1 1 C2 P(C1|..) P(C2|..) 1 1 C2 P(C1|..) P(C2|..) 1 1 1 P(C1|..) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Treat as Missing Terms 1 1 1 1 1 C1 P(C1|x1) P(C2|x1) 1 1 1 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 1 C1 P(C1|..) P(C2|..) 1 1 1 C1 P(C1|..) P(C2|..) Documents 1 1 1 C1 P(C1|..) P(C2|..) C1 P(C1|..) P(C2|..) C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 C2 P(C1|..) P(C2|..) 1 1 1 1 1 C2 P(C1|..) P(C2|..) 1 1 C2 P(C1|..) P(C2|..) 1 1 1 P(C1|..) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273 Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06 A Document Cluster Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239 Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03 Another Document Cluster Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A topic is represented as a (multinomial) distribution over words Example topic #1 Example topic #2 SPEECH .0691 WORDS .0671 RECOGNITION .0412 WORD .0557 SPEAKER .0288 USER .0230 PHONEME .0224 DOCUMENTS .0205 CLASSIFICATION .0154 TEXT .0195 SPEAKERS .0140 RETRIEVAL .0152 FRAME .0135 INFORMATION .0144 PHONETIC .0119 DOCUMENT .0144 PERFORMANCE .0111 LARGE .0102 ACOUSTIC .0099 COLLECTION .0098 BASED .0098 KNOWLEDGE .0087 PHONEMES .0091 MACHINE .0080 UTTERANCES .0091 RELEVANT .0077 SET .0089 SEMANTIC .0076 LETTER .0088 SIMILARITY .0071 … … Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

The basic model…. C X1 X2 Xd Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model…. B C A X1 X2 Xd Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model…. B C A X1 X2 Xd Inference can be intractable due to undirected loops! Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model for documents…. • Multi-topic model • A document is generated from multiple components • Multiple components can be active at once • Each component = multinomial distribution • Parameter estimation is tricky • Very useful: • “parses” into high-level semantic components Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

History of multi-topic models • Latent class models in statistics • Hoffman 1999 • Original application to documents • Blei, Ng, and Jordan (2001, 2003) • Variational methods • Griffiths and Steyvers (2003) • Gibbs sampling approach (very efficient) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 2 3 4 GROUP 0.057185 DYNAMIC 0.152141 DISTRIBUTED 0.192926 RESEARCH 0.066798 MULTICAST 0.051620 STRUCTURE 0.137964 COMPUTING 0.044376 SUPPORTED 0.043233 INTERNET 0.049499 STRUCTURES 0.088040 SYSTEMS 0.038601 PART 0.035590 PROTOCOL 0.041615 STATIC 0.043452 SYSTEM 0.031797 GRANT 0.034476 RELIABLE 0.020877 PAPER 0.032706 HETEROGENEOUS 0.030996 SCIENCE 0.023250 GROUPS 0.019552 DYNAMICALLY 0.023940 ENVIRONMENT 0.023163 FOUNDATION 0.022653 PROTOCOLS 0.019088 PRESENT 0.015328 PAPER 0.017960 FL 0.021220 IP 0.014980 META 0.015175 SUPPORT 0.016587 WORK 0.021061 TRANSPORT 0.012529 CALLED 0.011669 ARCHITECTURE 0.016416 NATIONAL 0.019947 DRAFT 0.009945 RECURSIVE 0.010145 ENVIRONMENTS 0.013271 NSF 0.018116 “Content” components “Boilerplate” components Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

5 6 7 8 DIMENSIONAL 0.038901 RULES 0.090569 ORDER 0.192759 GRAPH 0.095687 POINTS 0.037263 CLASSIFICATION 0.062699 TERMS 0.048688 PATH 0.061784 SURFACE 0.031438 RULE 0.062174 PARTIAL 0.044907 GRAPHS 0.061217 GEOMETRIC 0.025006 ACCURACY 0.028926 HIGHER 0.041284 PATHS 0.030151 SURFACES 0.020152 ATTRIBUTES 0.023090 REDUCTION 0.035061 EDGE 0.028590 MESH 0.016875 INDUCTION 0.021909 PAPER 0.028602 NUMBER 0.022775 PLANE 0.013902 CLASSIFIER 0.019418 TERM 0.018204 CONNECTED 0.016817 POINT 0.013780 SET 0.018303 ORDERING 0.017652 DIRECTED 0.014405 GEOMETRY 0.013780 ATTRIBUTE 0.016204 SHOW 0.017022 NODES 0.013625 PLANAR 0.012385 CLASSIFIERS 0.015417 MAGNITUDE 0.015526 VERTICES 0.013554 9 10 11 12 INFORMATION 0.281237 SYSTEM 0.143873 PAPER 0.077870 LANGUAGE 0.158786 TEXT 0.048675 FILE 0.054076 CONDITIONS 0.041187 PROGRAMMING 0.097186 RETRIEVAL 0.044046 OPERATING 0.053963 CONCEPT 0.036268 LANGUAGES 0.082410 SOURCES 0.029548 STORAGE 0.039072 CONCEPTS 0.033457 FUNCTIONAL 0.032815 DOCUMENT 0.029000 DISK 0.029957 DISCUSSED 0.027414 SEMANTICS 0.027003 DOCUMENTS 0.026503 SYSTEMS 0.029221 DEFINITION 0.024673 SEMANTIC 0.024341 RELEVANT 0.018523 KERNEL 0.028655 ISSUES 0.024603 NATURAL 0.016410 CONTENT 0.016574 ACCESS 0.018293 PROPERTIES 0.021511 CONSTRUCTS 0.014129 AUTOMATICALLY 0.009326 MANAGEMENT 0.017218 IMPORTANT 0.021370 GRAMMAR 0.013640 DIGITAL 0.008777 UNIX 0.016878 EXAMPLES 0.019754 LISP 0.010326 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

13 14 15 16 MODEL 0.429185 PAPER 0.050411 TYPE 0.088650 KNOWLEDGE 0.212603 MODELS 0.201810 APPROACHES 0.045245 SPECIFICATION 0.051469 SYSTEM 0.090852 MODELING 0.066311 PROPOSED 0.043132 TYPES 0.046571 SYSTEMS 0.051978 QUALITATIVE 0.018417 CHANGE 0.040393 FORMAL 0.036892 BASE 0.042277 COMPLEX 0.009272 BELIEF 0.025835 VERIFICATION 0.029987 EXPERT 0.020172 QUANTITATIVE 0.005662 ALTERNATIVE 0.022470 SPECIFICATIONS 0.024439 ACQUISITION 0.017816 CAPTURE 0.005301 APPROACH 0.020905 CHECKING 0.024439 DOMAIN 0.016638 MODELED 0.005301 ORIGINAL 0.019026 SYSTEM 0.023259 INTELLIGENT 0.015737 ACCURATELY 0.004639 SHOW 0.017852 PROPERTIES 0.018242 BASES 0.015390 REALISTIC 0.004278 PROPOSE 0.016991 ABSTRACT 0.016826 BASED 0.014004 “Style” components Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents • Each document a mixture of topics • Each word chosen from a single topic • from parameters • from parameters (Blei, Ng, & Jordan, 2003) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents • Called Latent Dirichlet Allocation (LDA) • Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001) q z z z w w w Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

documents topics documents = P(w|z) topics P(z) words words LDA documents dims dims documents C = U D VT dims words words vectors SVD (Dumais, Landauer) P(w) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents wP(w|z = 1) = f (1) wP(w|z = 2) = f (2) HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH 0.0 MATHEMATICS 0.0 HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY 0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 topic 1 topic 2 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Choose mixture weights for each document, generate “bag of words” q = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian inference • Sum in the denominator over Tn terms • Full posterior only tractable to a constant Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian sampling • Sample from a Markov chain which converges to the target distribution of interest • Known as Markov chain Monte Carlo in general • Simple version is known as Gibbs sampling • Say we are interested in estimating p(x, y | D) • We can approximate this by sampling from p(x|y,D), p(y|x,D) in an iterative fashion • Useful when conditionals are known, but joint distribution is not easy to work with • Converges to true distribution under fairly broad assumptions • Can compute approximate statistics from intractable distributions Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling • Need full conditional distributions for variables • Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling iteration 1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling iteration 1 2 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling iteration 1 2 … 1000 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A visual example: Bars sample each pixel from a mixture of topics pixel = word image = document Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Interpretable decomposition • SVD gives a basis for the data, but not an interpretable one • The true basis is not orthogonal, so rotation does no good Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian model selection • How many topics T do we need? • A Bayesian would consider the posterior: • P(w|T) involves summing over all possible assignments z • but it can be approximated by sampling P(T|w)  P(w|T) P(T) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian model selection T = 10 P( w |T ) T = 100 Corpus (w) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Back to the bars data set Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

PNAS corpus preprocessing • Used all D = 28,154 abstracts from 1991-2001 • Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) • Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus • Also, PNAS class designations for 2001 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Running the algorithm • Memory requirements linear in T(W+D), runtime proportional to nT • T = 50, 100, 200, 300, 400, 500, 600, (1000) • Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 • All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Presentation Transcript

Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

* Fordham University Department of Computer and Information Science

University of California Irvine

Department of Computer and Information Science

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

UCI University of California, Irvine

Hawaii Pacific University and University of California Irvine

University of California Irvine

University of Southern California Department Computer Science

Department of Computer Science University of California—Los Angeles

Department of Computer Science and Information Engineering

Sabina Ohri Department of Economics University of California, Irvine

University of California, Irvine University Registrar

Norihiro Saikawa Department of Computer and Information Science Hosei University

Department of Computer Science and Engineering, University of California, Riverside.

Alfred Kobsa School of Information and Computer Science University of California, Irvine, U.S.A.

Department of Computer Science and Information Engineering

Shahram Ghandeharizadeh Computer Science Department University of Southern California

Shahram Ghandeharizadeh Computer Science Department University of Southern California

Department of Computer Science and Information Engineering

University of California, Irvine

University of California Irvine