iTopicModel: Information Network-Integrated Topic Modeling

iTopicModel: Information Network-Integrated Topic Modeling Yizhou Sun, Jiawei Han, Jing Gao and Yintao Yu Department of Computer Science University of Illinois at Urbana-Champaign 12/9/2009

Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

Background • Topic Modeling • Documents are independent from each other

Background (Cont.) • Document networks in real world • Documents integrated with information networks • Papers are linked via citations • Webpages are linked via hyperlinks • Blog users are linked via friendship relations • …… +

Motivation • Goal: Use document networks to improve the quality of topic models • Why and how links in document network may help build better topic models? • Text information from neighbors are utilized • Extend the co-occurrences among words • Extremely useful for short text documents • Derive topic models consistent with current document network • Neighbors should have similar topic distributions • Determine the number of topics • The network tells the structure

How Existing Topic Models Deal with Links? • Traditional topic models: PLSA and LDA • Have not considered the links between documents • Author-Topic Model • Consider links between authors and papers, and model is depended on particular domain • NetPLSA • Consider links between documents, and consider the network as a regularization constraint • Only works for undirected networks • Relational Topic Model (RTM) • Model how each link is generated based on topic distribution • Only works for binary networks; try to make predictions on links based on purely topic information

Why Not Pure Network Clustering? • Network clustering or graph partition algorithms do not use text information at all • The clusters are difficult to understand • The quality of clusters is not as good as topic models, due to using less information • The network itself may be not connected, tend to generate clusters of such outliers • E.g., co-author network

Our Method • Builds a unified generative model for links (structure) and text (content) • Works for directed / undirected, weighted / unweighted document networks

Model Set-Up • Graphical model for iTopicModel • ϴi=(ϴi1,ϴi2,…,ϴiT): topic distribution for document xi Structural Layer: follow the same topology as the document network Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(βz)

Objective Function • Objective function: joint probability • X: observed text information • G: document network • Parameters • ϴ: topic distribution; β: word distribution • ϴ is the most critical, need to be consistent with the text as well as the network structure Can model them separately! Structure part Text part

I. How to Model the Structure Part? • Joint distribution P(ϴ|G) • Need a global definition • The dilemma of global definition and local definition • Computational-wise, a global definition should be given • what is the probability to get a structure configuration ϴ given current network G? • Semantic-wise, heuristics can only give local definitions P(ϴi|ϴN(i)) • If we know the neighbor configurations of a node, it is very likely to have the probability P(ϴi|ϴN(i))

The Bridge: MRF • Markov Random Field (MRF) can connect the two definitions! • What is an MRF? • Given a graph G, each node i associating with a random variable Fi, F is said to be an MRF, if • p(f)>0 • p(fi|f-i)=p(fi|fN(i)) • An MRF can be factorized into the form • p(f) = 1/Z exp{-∑cVc(f)} • Z: Partition function, can be viewed as normalization constant for each MRF • C: clique in the graph • Vc(f): potential function for clique c Local Markovianity Property Global

Local Probability Definition: Heuristics • In our case, we build multivariate MRF for ϴ • Heuristics for local definition • A document’s topic distribution ϴi should be very related to its neighbors, especially out-neighbors • The expected value of ϴi should be close to weighted mean of its neighbors • The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i.e., a higher probability around the mean

Local Probability Definition: Formula • Model the heuristics using Dirichlet distribution • We use neighbors to construct a Dirichlet parameter for each document xi • Define the local probability using Dirichlet distribution with the parameter:

Check Heuristics (1) • A document’s topic distribution ϴi should be very related to its neighbors, especially out-neighbors • done

Check Heuristics (2) • The expected value of ϴi should be close to weighted mean of its neighbors If set all 1, a uniform prior is added

Check Heuristics (3) • The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i.e., a higher probability around the mean. • The precision of Dirichlet distribution tells how confident a configuration is around mean, which is

Example of Precision • Beta(2,2) vs. Beta(50,50) • Beta distribution is a two dimensional Dirichlet distribution, p=(p1, p2)~Beta(α,β), p1+ p2=1 f(p1, p2) p1

Global Probability Definition • Give the global definition corresponding to local definition: • Cliques only use single nodes and links • Potential function for larger cliques are set to 0 • Potential function: • Joint distribution:

Local and Global Definition Equivalence • For a local structure:

II. How to Model the Text Part? • P(X|ϴ, β)= • Each document is conditional independent given current structure • Each document is modeled as in PLSA

Parameter Estimations • Objective function: find parameters ϴ, β that maximizes the log-likelihood of joint probability • Approximate inference using EM: Text part Structure part

Discussions of MRF Modeling on Network Structure • Can we define other MRFs on network structure? • Yes. NetPLSA is a special case • Local definition • Global definition

Decide Topic Number • Q-function • Evaluate the modularity of a network • Best topic number T • maximize the Q-function by varying T

Build Topic Hierarchies • Different document network has inherent granulites of topics • E.g., consider conf network, co-author network and co-citation network • Using Q-function to decide the number of branches

Correlations of Text and Network • Consider two extreme cases of network and text • The links of the network among documents are randomly formed, and in this case network structure will not help topic modeling, and even deteriorate the performance of results • The links of the network among documents are built exactly through the text information, and in this case network structure will not improve the topic modeling performance too much • Correlation

Datasets • DBLP • Conf network, author network • Cora • Paper network via citation, co-author, and text

Case Study: Topic Hierarchy Building

Case Study: Topic Hierarchy Building • First level topic number

Performance Study • Document clustering accuracy Improve most for short text document

Correlation Study • If network is built from text, the performance will not be improved too much

Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusions

Conclusions • iTopicModel, a unified model for document networks, that provides an efficient approximate inference method for the model • Study some practical issues to use iTopicModel • Experiments show that iTopicModel have good performance • Future Work • How to combine different networks • Whether other priors provide better results • Topic prediction based on links

Q & A Thanks!

iTopicModel: Information Network-Integrated Topic Modeling