220 likes | 334 Vues
This paper presents a statistical model designed to track popular events in online social communities. With the rise of platforms like Facebook, Twitter, and Blogger, monitoring the diffusion and evolution of interests related to trending topics has become essential. The model integrates a network of users and a stream of documents to analyze popularity trends and content evolution over time. It provides insights into users' interests, allowing researchers to understand how topics emerge and fade within digital communities.
E N D
PET: A Statistical Model for Popular Events Tracking in Social Communities Cindy Xide Lin1, Bo Zhao1, Qiaozhu Mei2, Jiawei Han1 1University of Illinois at Urbana-Champaign, 2University of Michigan KDD 2010 2010. 09. 16. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University
Contents • Introduction • Concept Definition • Problem Definition • Model • Interest model • Topic model • Experiment • Data Collection • Baseline and Gold standard • Analysis on Popularity Trend • Analysis on Content Evolution • Conclusions & Discussions
Introduction • Boom of online communities • e.g., Facebook, Blogger, Twitter, … • Facilitates the information creation, sharing and diffusion. • Popular topic or event can spread much faster. • Needs to track the diffusion and evolution of a popular event • Hot topics emerge, prevail and die • It is desirable to monitor whether people like, what they like, and how their interests change over time • e.g., Who are still interested in watching Avatar 50 days after its release date?
Introduction • Tracking the evolution of a popular topic is challenging • Diffusion of an event is vague • e.g., You don’t know whether I am interest in an event • e.g., and even if you do, from whom did I get this interest. • Fortunately, a large volume of text data is generated from the social communities. • Besides Communicating with friends, a web user also constantly generates text contents such as blog. • A network structure and a text collection which evolve simultaneously and interrelatedly.
Goal • Tracking Popular Eventin a time-variant social community • A stream of text information • A stream of network structures • Modeling the interest of user • Modeling the change of topic
Concept Definition: Network Stream 2 6 v2 v6 1 v1 3 v3 5 4 v5 v4 Gk: The snapshot of network at time tk G = { G1, G2, …, Gn }
Concept Definition: Document Stream 2 w2, w2 w3, w1, … w8, w6 w2, w5, … 6 v2 dk,2 dk,5 v6 1 w1, w2 w3, w1, … 3 w4, w1 w1, w1, … v1 dk,1 dk,3 v3 5 4 w7, w7 w7, w7, … w2, w6 w2, w5, … v5 v4 dk,5 dk,4 Document Collection Stream D = {D1, D2, …, DT} Documents collections Dk = {dk,1, dk,2, …., dk,N}
Concept Definition: Topic and Event • Topic • topic θ is a multinomial distribution of words {p(w|θ)}w∈W • Topic has different version over time, denoting the version at time tk as θk • Event • A stream of topics Theta E = {θ0E, θ1E, θ2E, … θTE} • θ0E is the primitive topic of the event • θkEcorresponds to the version of θ0E at time tk • Indicates the major aspects of the event in network Gk
Concept Definition: Interest • Interest • hk(i): node vi in Gk has a certain level of interest in the particular event at time tk • Real value between 0 and 1 • Hk = {hk(1), hk(2), …, hk(N)}
Problem: Popular Event Tracking • Inputs • Network Stream G • Document Stream D • Primitive topic of an event θ0 • Task1: Popularity Tracking • Inferring the latent stream of interests. (Hk) • providing much richer information about how the interest e • Task2: Topic Tracking • Inferring the latent stream of topics about the event ΘE • Keeping track of the new development about the event, • Understanding event evolution
Intuitions • Observation 1. Interest and Connections • The behavior of each individual is usually influenced by its friend. • Observation 2. Interest and History • The behavior of each individual should be generally consistent over time. • Events should not change dramatically. • Observation 3. Content and Interest • When an individual has a higher level of interest in an event, the content she generates should be more likely to be related to the event
The General Model • Current interest and topic depends on • Current network • Current Documents • Previous history (Markovian simplification) • Formal representation • P(Hk, Θk| Gk, Dk, Hk-1)
Assumption • How to model P(Hk, Θk | Gk, Dk, Hk-1)? • Assumption 1. • Given current network structure Gk and previous Hk-1, • Current interest status Hk is independent of the document collection Dk • Hkㅛ Dk | Gk, Hk-1 • People first become interested in the event and therefore generate discussion it • Assumption 2. • Given the current interest status Hk and the document collection Dk, • The current topic model k is independent of Gk and Hk-1 • θk ㅛ Gk, Hk-1| Hk, Dk • Once the author has developed an interest in the event, the contents she writes will only depend on the event itself and the level of interest • P( Hk, Θk | Gk, Dk, Hk-1 ) = P(Hk | Gk, Hk-1) P(Θk|Hk, Dk)
Interest Model 0.3 0.2 0.8 0.1 0.2 1 h’=1*0.2+0.3*0.8+0.2*0.1 = 0.46 • Gibbs Random field • Great use in studying natural processes • (Gibbs distribution) • cf. (Gaussian distribution is a special member of Gibbs distribution family) • P (Hk | Gk, Hk-1) • h’(k) is weighted sum of friends’ interest • The first part is transition energy of node i • The last part represents neighbors expectation
Topic Model • Considering each document is generated two multinomial component model • Background model: θkB • Modeling Common words • Latent event topic model: θkE • Modeling discriminative and meaningful words • The probability of generating word • P(Θk|Hk, Dk)
Twitter Data collection • Selecting 5000 users with follower-followee relationship • Considering each day as a time point (tk: the kth day) • Document dk,i is obtained by concatenating tweets displayed by user i in k • weight of relationship between user equals the number of tweets displayed by user I by following user j during the period from tk-30 to tk.
Baseline and Gold standard • BOM: extracting the daily box office at Mojo • The box office earning is a trustworthy criterion to reflect the movie’s popularity • GInt: Google Insight • PET • PET- : special version of PET by removing network structure • JonK / Cont
Analysis on Popularity Trend • PET always has the best performance • Historic, textual and structured information is reflected well • PET- can not response sufficiently to sudden changes
Conclusion & Discussion • Propose the novel problem of Popular Event Tracking • Propose popular event tracking model, PET • Unified probabilistic framework to model different factors • Covers classical models • Experimental studies show that PET outperforms existing ones • PET is not good framework for tracking interest • There exist the more accurate data such as Google Insight. • Tracking topic changing is a novel problem. • PET detects and tracks topic evolution well.