Automatic Labeling of Multinomial Topic Models

Clustering Clustering Clustering Clustering Good Label (l1): “clustering algorithm” Context: SIGMOD Proceedings dimensional dimension dimension dimension Statistical topic models term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … wordprobdata 0.0358 university 0.0132 new 0.0119 results 0.0119 end 0.0116 high 0.0098 research 0.0096 figure 0.0089 analysis 0.0076 number 0.0073 institute 0.0072 … … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Good Label (l1):“clustering algorithm” Bad Label (l2):“hash join” Topic  algorithm … partition partition Latent Topic  Text Collection … algorithm birch Multinomial topic models algorithm algorithm Bad Label (l2):“body shape” join … … Collection (Context) shape Coverage; Discrimination Relevance Score Re-ranking … hash hash hash p(w|) body clustering algorithm; distance measure; … P(w|l2) P(w|) P(w|l1) D(|l1) < D(|l2) database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … = NLP Chunker Ngram stat. Topic Ranked Listof Labels = Unigram language model Candidate label pool Multinomial distribution of Terms DAIS The Database and Information Systems Laboratory at The University of Illinois at Urbana-Champaign Large Scale Information Management Automatic Labeling ofMultinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai Our Method Multinomial Topic Models: Hard to Interpret (Label) Topics term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … insulin foraging foragers collected grains loads collection nectar … Topic Model … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … (Multinomial mixture, PLSA, LDA, & lots of extensions) ? 1 2 Applications: topic extraction, IR, contextual text mining, opinion analysis… Blei et al. http://www.cs.cmu.edu/~lemur/science/topics.html Good labels = Understandable + Relevant+High Coverage + Discriminative Question: Can we automatically generate meaningful labels for topics? 1 1 2 Scoring and Re-ranking Labels Modeling the Relevance • Zero-order relevance: prefer phrases well covering top words Bias of using C to estimate l Pointwise mutual information based on C, pre-computed High Coverage inside topic (MMR): Discriminative across topic: • First-order relevance: prefer phrases with similar context (distribution) 3 4 Results: Sample Topic Labels Context-sensitive Labeling Indexing methods sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 selectivity estimation tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 sampling estimation approximation histogram selectivity histograms … Context: Database(SIGMOD Proceedings) Context: IR(SIGIR Proceedings) Clustering algorithms selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; the, of, a, and,to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 r treeb tree large data dependencies functional cube multivalued iceberg buc … north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 plane 0.02 air 0.02 flight 0.02 pilot 0.01 crew 0.01 force 0.01 accident 0.01 crash 0.005 multivalue dependency functional dependency Iceberg cube term dependency; independence assumption; air plane crashes air force iran contra • - Applicable to any task with unigram language model, such as labeling document clusters iran contra trial 5 6

Automatic Labeling of Multinomial Topic Models

Automatic Labeling of Multinomial Topic Models

Presentation Transcript

Formal Multinomial and Multiple-Bernoulli Language Models

multinomial

Ordinal and Multinomial Models

Ordinal and Multinomial models

Automatic Labeling of Multinomial Topic Models

Multinomial Distributions

Topic models

Topic Models

Automatic Prosody Labeling Final Presentation

Semi-automatic construction of topic ontology

Automatic Semantic Role Labeling

Project 2: Automatic Image Labeling

Multinomial Experiments

Automatic Generation of Dynamics Models

ESTIMATION OF DRIVERS ROUTE CHOICE USING MULTI-PERIOD MULTINOMIAL CHOICE MODELS

Probabilistic Topic Models

Automatic Labeling of Semantic Roles

Automatic Labeling Machine Failures and Maintenance Problems

Automatic Labeling Machine Market Size- KBV Research

ESTIMATION OF DRIVERS ROUTE CHOICE USING MULTI-PERIOD MULTINOMIAL CHOICE MODELS

Probabilistic Topic Models

How to Do Automatic Labeling