1 / 38

iTopicModel: Information Network-Integrated Topic Modeling

iTopicModel: Information Network-Integrated Topic Modeling. Yizhou Sun , Jiawei Han, Jing Gao and Yintao Yu Department of Computer Science University of Illinois at Urbana-Champaign 12/9/2009. Outline. Background and motivation Related work Modeling

mickeye
Télécharger la présentation

iTopicModel: Information Network-Integrated Topic Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. iTopicModel: Information Network-Integrated Topic Modeling Yizhou Sun, Jiawei Han, Jing Gao and Yintao Yu Department of Computer Science University of Illinois at Urbana-Champaign 12/9/2009

  2. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

  3. Background • Topic Modeling • Documents are independent from each other

  4. Background (Cont.) • Document networks in real world • Documents integrated with information networks • Papers are linked via citations • Webpages are linked via hyperlinks • Blog users are linked via friendship relations • …… +

  5. Motivation • Goal: Use document networks to improve the quality of topic models • Why and how links in document network may help build better topic models? • Text information from neighbors are utilized • Extend the co-occurrences among words • Extremely useful for short text documents • Derive topic models consistent with current document network • Neighbors should have similar topic distributions • Determine the number of topics • The network tells the structure

  6. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

  7. How Existing Topic Models Deal with Links? • Traditional topic models: PLSA and LDA • Have not considered the links between documents • Author-Topic Model • Consider links between authors and papers, and model is depended on particular domain • NetPLSA • Consider links between documents, and consider the network as a regularization constraint • Only works for undirected networks • Relational Topic Model (RTM) • Model how each link is generated based on topic distribution • Only works for binary networks; try to make predictions on links based on purely topic information

  8. Why Not Pure Network Clustering? • Network clustering or graph partition algorithms do not use text information at all • The clusters are difficult to understand • The quality of clusters is not as good as topic models, due to using less information • The network itself may be not connected, tend to generate clusters of such outliers • E.g., co-author network

  9. Our Method • Builds a unified generative model for links (structure) and text (content) • Works for directed / undirected, weighted / unweighted document networks

  10. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

  11. Model Set-Up • Graphical model for iTopicModel • ϴi=(ϴi1,ϴi2,…,ϴiT): topic distribution for document xi Structural Layer: follow the same topology as the document network Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(βz)

  12. Objective Function • Objective function: joint probability • X: observed text information • G: document network • Parameters • ϴ: topic distribution; β: word distribution • ϴ is the most critical, need to be consistent with the text as well as the network structure Can model them separately! Structure part Text part

  13. I. How to Model the Structure Part? • Joint distribution P(ϴ|G) • Need a global definition • The dilemma of global definition and local definition • Computational-wise, a global definition should be given • what is the probability to get a structure configuration ϴ given current network G? • Semantic-wise, heuristics can only give local definitions P(ϴi|ϴN(i)) • If we know the neighbor configurations of a node, it is very likely to have the probability P(ϴi|ϴN(i))

  14. The Bridge: MRF • Markov Random Field (MRF) can connect the two definitions! • What is an MRF? • Given a graph G, each node i associating with a random variable Fi, F is said to be an MRF, if • p(f)>0 • p(fi|f-i)=p(fi|fN(i)) • An MRF can be factorized into the form • p(f) = 1/Z exp{-∑cVc(f)} • Z: Partition function, can be viewed as normalization constant for each MRF • C: clique in the graph • Vc(f): potential function for clique c Local Markovianity Property Global

  15. Local Probability Definition: Heuristics • In our case, we build multivariate MRF for ϴ • Heuristics for local definition • A document’s topic distribution ϴi should be very related to its neighbors, especially out-neighbors • The expected value of ϴi should be close to weighted mean of its neighbors • The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i.e., a higher probability around the mean

  16. Local Probability Definition: Formula • Model the heuristics using Dirichlet distribution • We use neighbors to construct a Dirichlet parameter for each document xi • Define the local probability using Dirichlet distribution with the parameter:

  17. Check Heuristics (1) • A document’s topic distribution ϴi should be very related to its neighbors, especially out-neighbors • done

  18. Check Heuristics (2) • The expected value of ϴi should be close to weighted mean of its neighbors If set all 1, a uniform prior is added

  19. Check Heuristics (3) • The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i.e., a higher probability around the mean. • The precision of Dirichlet distribution tells how confident a configuration is around mean, which is

  20. Example of Precision • Beta(2,2) vs. Beta(50,50) • Beta distribution is a two dimensional Dirichlet distribution, p=(p1, p2)~Beta(α,β), p1+ p2=1 f(p1, p2) p1

  21. Global Probability Definition • Give the global definition corresponding to local definition: • Cliques only use single nodes and links • Potential function for larger cliques are set to 0 • Potential function: • Joint distribution:

  22. Local and Global Definition Equivalence • For a local structure:

  23. II. How to Model the Text Part? • P(X|ϴ, β)= • Each document is conditional independent given current structure • Each document is modeled as in PLSA

  24. Parameter Estimations • Objective function: find parameters ϴ, β that maximizes the log-likelihood of joint probability • Approximate inference using EM: Text part Structure part

  25. Discussions of MRF Modeling on Network Structure • Can we define other MRFs on network structure? • Yes. NetPLSA is a special case • Local definition • Global definition

  26. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

  27. Decide Topic Number • Q-function • Evaluate the modularity of a network • Best topic number T • maximize the Q-function by varying T

  28. Build Topic Hierarchies • Different document network has inherent granulites of topics • E.g., consider conf network, co-author network and co-citation network • Using Q-function to decide the number of branches

  29. Correlations of Text and Network • Consider two extreme cases of network and text • The links of the network among documents are randomly formed, and in this case network structure will not help topic modeling, and even deteriorate the performance of results • The links of the network among documents are built exactly through the text information, and in this case network structure will not improve the topic modeling performance too much • Correlation

  30. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusion

  31. Datasets • DBLP • Conf network, author network • Cora • Paper network via citation, co-author, and text

  32. Case Study: Topic Hierarchy Building

  33. Case Study: Topic Hierarchy Building • First level topic number

  34. Performance Study • Document clustering accuracy Improve most for short text document

  35. Correlation Study • If network is built from text, the performance will not be improved too much

  36. Outline • Background and motivation • Related work • Modeling • iTopicModel building and parameter estimation • Discussions of MRF modeling on network structure • Practical Issues • Decide topic number • Build topic hierarchies • Correlations of text and network • Experiments • Conclusions

  37. Conclusions • iTopicModel, a unified model for document networks, that provides an efficient approximate inference method for the model • Study some practical issues to use iTopicModel • Experiments show that iTopicModel have good performance • Future Work • How to combine different networks • Whether other priors provide better results • Topic prediction based on links

  38. Q & A Thanks!

More Related