160 likes | 267 Vues
On the Bursty Evolution of Blogspace. Ravi Kumar, Jasmine Novak, Prabhakar Raghavan and Andrew Tomkins IBM Almaden Research Center, Verity Inc. WWW 2003. Main contributions. Time graph and blog graph Communities in Blogspace Temporal bursts : from a sequence of document to sets of blogs
E N D
On the Bursty Evolution of Blogspace Ravi Kumar, Jasmine Novak, Prabhakar Raghavan and Andrew Tomkins IBM Almaden Research Center, Verity Inc. WWW 2003
Main contributions • Time graph and blog graph • Communities in Blogspace • Temporal bursts : from a sequence of document to sets of blogs • Link blogs topically and temporally focused • Blogspace evolution
Community Extraction of Blogspace • Communities are collections of pages which provide information on a similar topic or share a point of view. • Kleinberg (2000), co-citation, dense bipartite subgraph (signature) • Flake (2000) network flow
Bursts • Event: model bursts • A large number of short spurious bursts vs. fragmenting long bursts into many smaller bursts • E.g. email: NSF grant (Kleinberg 2002) • Relevant events and irrelevant events • Bursty: fraction of relevant events from large to small
Bursty communities of blogs • A given topic within a community: within a time interval • One member of blog poets posts a series of daily poems about other bloggers • A blogger Dawn hosts a poll to determine the funniest and sexiest blogger
Approach • Community Extraction • Burst Analysis
Time Graph • A set V of nodes where each node v 2 V has an associated interval D(v) on the time axis (called the durationof v) • A set E of edges where each e 2 E is a triple (u; v; t) where u and v are nodes in V and t is a point in time in the interval D(u) D(v) • Gt = (Vt , Et)
Community Extraction NP-hard to find dense subgraph • 1.Preprocessing: remove all pages that contain more than a certain number of in-links (too famous) • 2.Pruning: degree 1,2 are removed, degree3 are checked (K3). They are seeds • 3.Expansion: determines the vertex that contains most links to the current community by tk threshold.
Burst analysis • Arrival of edges in the blog graph as an event stream • Kleinberg algorithm, obtain the weight of every burst in C • Apply on each extracted community in the graph
Data acquisition • From 7 blog sites: • http://www.blogger.com • http://www.memepool.com • http://www.globeofblogs .com • http://www.metafilter.com • http://blogs.salon.com • http://www.blogtree.com • Web_Logs subtree of Yahoo
Resulting blog graph • 750 K links among 25K blogs • 22,299 nodes, 70,472 unique edges, 777,653 multiple edges, average 11 multiple edges every blog • Generate time graph
Results - Connectivity • Strongly connected components
Conclusion • Present a detailed picture of a web publishing phenomenon • Around the end of 2001, Blogspace began a dramatic increase in connectedness, and in local-scale community structure • Dramatic increases of bursty link creation behavior • Tools are applicable to other evolving hyperlinked corpora.