The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling

The IBP Compound DirichletProcess and its Application to Focused Topic Modeling Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei Presented by Eric Wang 9/16/2011

Introduction • Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous topic modeling framework. • Incorporating the hierarchical Dirichlet process (HDP) into the LDA allows for more flexible topic modeling by estimating the global topic proportions. • A drawback of HDP-LDA is that a topic that is rare globally will also have a low expected proportion within each document. • The authors propose a model that allows a rare topic to still have large mass within individual documents.

Hierarchical Dirichlet Process • The hierarchical Dirichlet process (HDP) is a prior for Bayesian nonparametric mixed membership modeling of data groups. • Hierarchically, it can be defined as where m indexes the data group. • In HDP, the expectation of the mixing weights in is . In practice, the mixing weights in is the global average of the mixture membership.

Indian Buffet Process • The Indian Buffet Process (IBP) defines a distribution over binary matrices with an infinite number of columns, and a finite number of non-zero entries. • Hierarchically, it is defined as where m and k denote the rows and columns of binary matrix b. It can be represented via a stick-breaking construction

IBP Compound Dirichlet Process • Combining HDP and IBP into single prior yields an infinite “spike-slab” prior (ICD). • A spike distribution (IBP) determines which variables are drawn from the slab (DP). • The model assumes the following generative process

IBP Compound Dirichlet Process • The atom masses of data group m is Dirichlet distributed as follows where • In this construction, the are the topic proportions for document m and B is a binary vector indicating usage of the dictionary elements.

Focused Topic Models • The authors use ICD to develop the Focused Topic model (FTM). • In this framework, a global distribution over topics is drawn and shared over all documents as in HDP-LDA. • Each document infers a subset of topics from the global menu. The subset is determined by the binary vector . Since the binary vector is independent of the global topic proportions, topics that are rare globally can still make up a large proportion of individual documents.

Focused Topic Models • The generative process for the FTM is as follows

Posterior Inference • To sample the topic indicator for word i in document m, where the integral has an analytical form and . • This is an important point because it suggests a general framework that can be adapted to other applications.

Posterior Inference • The joint probability of and the total number of words assigned to topic k is and is log differentiable with respect to and . • A hybrid MC algorithm is used to sample from their posteriors.

Posterior Inference • The topic weights are sampled as • And the binary topic indicators are sampled as • Notice here that if a topic is used, it is automatically considered “active”, and additional (unused) topics can be activated.

Empirical Results • The authors considered three different text datasets: • All models were run for 1000 iterations, with the first 500 iterations discarded as burn-in.

Empirical Results • Model Perplexity • Topic Correlation

Empirical Results • Here, the authors compare the number of topics a word appears in (a). The FTM has more concentrated topics. • In (b), the authors show the number of documents the topics appear in. The plot illustrates that HDP has many topics that appear in only a few documents, while a significant portion of the FTM topics appear in many documents.

Discussion • The authors have proposed a novel model called the IBP compound Dirichlet Process (ICD) that decouples the across-data topic prevalence and the intra-data topic proportions. • The Focused Topic Model (FTM) was developed from the ICD that addressed several key shortcomings of HDP-LDA. • In HDL-LDA, the global topic prevalence affects the proportion a topic can appear within a document, but in FTM, globally rare topics can still be highly occupied within a document. • FTM shows improved perplexity relative to HDP.

The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling

The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling

Presentation Transcript

MOST FOCUSED RESEARCH TOPIC

The Modeling Process

Topic Model Latent Dirichlet Allocation

Applications of Dirichlet Process Mixtures to Speaker Adaptation

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs

Hierarchical Dirichlet Process (HDP)

The Modeling Process

Maximum Entropy Modeling and its application to NLP

The Application of Fractal Process to Network Traffic Modeling

Exact and Approximate Sum Representations for the Dirichlet Process

Double Dirichlet Process Mixtures

Dirichlet process tutorial

Generalized Spatial Dirichlet Process Models

Topic modeling

Hierarchical Dirichlet Process (HDP)

Building Energy Code Modeling and Its application

TS Modeling Based on GMDH and Its application

Modeling Application Process

The Modeling Process….

Dirichlet Process Mixtures A gentle tutorial

Nitrogen and its Compound

Information Modeling : The process and the required competencies of its participants