280 likes | 554 Vues
The Political Blogosphere and the 2004 U.S. Election: “Divided They Blog”. By Lada Adamic , HP Labs, & Natalie Glance, Intelliseek Applied Research Center. Agenda:. General background and terms Study goals Methodology: creating 2 data sets Analysis Summing up.
E N D
The Political Blogosphere and the 2004 U.S. Election: “Divided They Blog” By LadaAdamic, HP Labs, & Natalie Glance, Intelliseek Applied Research Center
Agenda: General background and terms Study goals Methodology: creating 2 data sets Analysis Summing up
US Presidential Election, 2/11/2004 Name: George W. Bush Party: Republican (“American conservatism”) Home State: Texas Electoral Vote: 286 Name: John Kerry Party: Democratic (“Modern American liberalism”) Home State: Massachusetts Electoral Vote: 251
Political blogs What is a blog? ~35million blogs worldwide by end of 2006, and ~173million in 2011. 2004: 32million US citizens read blogs 2004: 63million use internet to get informed about politics
“Blogosphere” as a social network Various ways of drawing the blogosphere graph: Each blog/ post is a vertex and a directed edge from post A to B is added if A contains a link to B. Each blog/ post is a vertex, undirected weighted edge is added between two posts based on their similarity. (Similarity can be calculated in various ways) And more.
“Blogosphere” as a social network Links in-between blogs may appear in two different ways: 2. Blogroll links 1. Post citations
“The Political Blogosphere and the 2004 US Election: Divided They Blog” Study goals: Identify differences between sub-communities of political blogs (focusing on conservative vs. liberal blogs), both linking patterns and discussion topics. (Why is this interesting? “cyber-balkanization”)
Dataset #1: Wide Snapshot • Gather list of labeled blogs from online blog directories (“BlogCatalog”, “eTalkingHead”, etc.) • Collect snapshots of front pages of each blog, February 2005 • Extract links to additional political blogs, save only those cited by others at least ~20 times • Manually/automatically set labels for new list • Collect snapshots of new list and join the 2 lists together
Dataset #1: Wide Snapshot Final dataset contained: 1494 listed blogs in total: 759 liberal, 735 conservative Snapshot of front page collected for 676 liberal blogs and 659 conservative ones No distinction between blogroll links to links in specific posts (post citations) – all links are referred to as “page links”
Dataset #1: Wide Snapshot 91% of links stay within their community Conservative blogs show a greater tendency to link: 84% of conservative blogs link to at least one other blog, as opposed to 74% of liberal blogs Conservative blogs link to 15.1 other blogs on average, liberal to 13.6 on average
Dataset #1: Wide Snapshot “…as common in almost every large subset of sites on web, the distribution of inlinks is highly uneven, with a few blogs of either persuasion having over a hundred incoming links, while hundreds of blogs have just one or two.”
Dataset #2: Corpus of Posts from Selected Blogs • Take the top 100 blogs from each community with maximum page links • Use “blogPulse” to retrieve the number of post citations pointing at each blog during the months of October and November 2004 (indicating current popularity) • Choose top 20 from each list based on post-citations ranking, omitting a few websites with unusual formats or a primary function other than blogging • Create a corpus of blog posts from 40 blogs selected above, in the time frame of August 2004 to November 2004 (“blogPulse” provides tools to crawl weblog pages and segment them into individual posts)
Dataset #2: Corpus of Posts from Selected Blogs 12,470 posts from left leaning blogs, 10,414 posts from right leaning blogs Selected blogs – examples: Liberal Conservative
Analyses: Strength of each community Varied conversations • Using citations • Using textual similarity Interaction with mainstream media Occurrences of names of political figures
Analysis 2: Varied conversations 1st method focused on similarity between blogs based on common links (any URL, not neccesarily a blog). Cosine similarity: XA is a binary vector, where entry i is set to 1 or 0 corresponding to whether blog A cited URL(i) or not. Pairwise cosine similarity was computed for all 40 blogs.
Analysis 2: Varied conversations Average similarity between liberal blogs and conservative blogs: 0.03. Average similarity amongst liberal blogs: 0.09. Average similarity amongst conservative blogs: 0.11. Statistically significant difference. P-value of ~0.004 based on ANOVA. When removing political blogs from URL’s, difference was no longer significant (we already saw that conservative blogs tend to more actively relate to one another).
Analysis 2: Varied conversations 2nd method focused on similarity between blogs based on textual content, particularly “informative phrases”. Used a phrase-finding algorithm to identify 498 phrases in the 40 blogs. Similarity was based on cosine-similarity again, this time using TF*IDF metric. TF-IDF stands for “Term Frequency - Inverse Document Frequency”. TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
Reminder: Cosine similarity is defined as Analysis 2: Varied conversations This time XA is a binary vector, where the entry corresponding to phrase p is given by . is the number of times phrase p appears in blog A. N = 1,768,887 is the number of all blogs in “blogPulse” dataset, found in Oct-Nov 2004. is the number of blogs containing the phrase p out of all N blogs from “blogPulse”. Results: average similarity between blogs of opposite persuasions (0.1) was smaller than that of liberal (0.57) and conservative (0.54) pairs.
Analysis 3: Interaction with mainstream media Focusing on links to formal news articles, some online news sites (e.g. National Review, Washington Times) were found to receive the majority of their links from conservative blogs while others (e.g. LA Times, Wall Street Journal) – from liberal blogs.
Analysis 3: Interaction with mainstream media Dataset #1 Dataset #2
Analysis 3: Interaction with mainstream media Mentions of the “CBS forged documents” article, on time series graph:
Analysis 4: Mentioning names of political figures Overall pattern: Democrats are more often mentioned by right-leaning bloggers, and vice versa...
Summing up The political blogosphere is, in some ways, divided between liberals and conservatives: • Links are mostly within each community • Discussion topics and political figures mentioned differ • Conservative blogs are more tightly linked Future research directions: divide posts by author instead of blog, how do news and ideas spread in both communities, and blogs that do not count as “liberal” nor “conservative” – do they form a bridge in between or rather a separate community?