1 / 31

Probabilistic Models for Discovering E-Communities

Probabilistic Models for Discovering E-Communities. Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006. Outline. Introduction Related Works Community-User-Topic Models Semantic Community Discovery Experiments Conclusion. Outline.

hansel
Télécharger la présentation

Probabilistic Models for Discovering E-Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006

  2. Outline • Introduction • Related Works • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  3. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  4. Social Network Analysis (SNA) • SNA is an established field in sociology • The goal of SNA • Discovering interpersonal relationships based on various modes of information carriers, such as emails and the Web • The community graph structure • How social actors gather into groups such that they are intra-group close and inter-group loose • An important characteristic of all SNs

  5. Discovering Community from Email Corpora • Typically the SN is constructed by measuring the intensity of contacts between email users. • An edge indicates a communication between two users is higher than certain frequency threshold • Problematic in some scenarios • A spammer in the email system sends out a lot of messages • The lack of semantic interpretation

  6. Proposed Method • The inner community property within SNs are examined by analyzing the semantic information such as emails • A generative Bayesian network is used to model the generation of communication in an SN • Similarity among social actors are modeled as a hidden layer in the proposed probabilistic model

  7. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  8. Related Work: Document Content Characterization • Several factors, either observable or latent, are modeled as variables in the generative Bayesian network • Topic-Word model • Documents are considered as a mixture of topics • Each topic corresponds to a multinomial distribution over words • Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]

  9. Related Work (2) • Author-Word model • The author x is chosen randomly from ad [A. McCallum, 1999] • Author-Topic model • Involves both the author and the topic • Perform well for document content characterization [M. Steyvers et al., 2004]

  10. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  11. Community-User-Topic Models (CUT) • Communication document • A document carrier of communication • Basic idea • The issue of a communication document indicates the activities of and is also conditioned on the community structure within an SN • Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic variables

  12. CUT1: Modeling Community with Users (1) • Assume an SN community is more than a group of users • Similar to that assumed in a topology-based method • Treat each community as a multinomial distribution over users

  13. CUT1: Modeling Community with Users (2) • Compute the posterior probability P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT1 is it relaxes the community’s impact on the generated topics

  14. CUT2: Modeling Community with Topics (1) • An SN community consists of a set of topics • CUT2 differs from CUT1 in strengthening the relation between community and topic

  15. CUT2: Modeling Community with Topics (2) • Similarly, compute P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT2 is it might lead to loose ties between community and users

  16. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  17. Practical Algorithm: Gibbs Sampling • Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples • Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated

  18. Gibbs Sampling for CUT

  19. Estimation of the Conditional Probability • Estimating P(ci, ui, zi|wi) for CUT1 and CUT2 CUT1: CUT2:

  20. EnF-Gibbs: Gibbs Sampling with Entropy Filtering • Non-informative words are ignored after A times of iterations

  21. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  22. Experiment Setup • Data: Enron email dataset • Made public by Federal Energy Regulatory Commission • Fix the number of communities C at 6 and the number of topics T at 20 • The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively

  23. Experiment Result-1 Table 1: Topics discovered by CUT1 Table 2: Abbreviations

  24. Experiment Result-2 Fig: Communities/topics of an employee

  25. Experiment Result-3 Fig: A community discovered by CUT2

  26. Experiment Result-4 D..steffes = vice president of Enron in charge of government affairs Cara.semperger = a senior analyst Mike.grigsby = a marketing manager Rick.buy = chief risk management officer

  27. Experiment Result-5 • Similarity between two clustering results: Fig: Community similarity comparisons

  28. Experiment Result-6 Fig: Efficiency of EnF-Gibbs

  29. Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

  30. Conclusion and Future Work • Two versions of Community-User-Topic models are presented for community discovery in SNs. • EnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filtering • Experiments show that the proposed method effectively tags communities with topic semantics • It would be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs

  31. Illustration of Dirichlet Distribution Several images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).

More Related