Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16th 2008

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

Key Observations • Understanding communication in social media requires identifying and modeling communities • Communities are a result of collective, social interactions and usage.

Developed and evaluated innovative approaches for community detection A new algorithm for finding communities in social datasets SimCUT, a novel algorithm for combining structural and semantic information First to comprehensively analyze two important, new social media forms Feed Readership Microblogging Usage and Communities Built systems, infrastructure and datasets for the social media research community Contributions

Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

Social Media Describes the online technologies and practices that people use to share opinions, insights, experiences, and perspectives and engage with each other. UGC + Social Network ~Wikipedia

What you… Think blogs Say Podcasts See Flickr, YouTube Hear Pandora, Last.fm Do Twitter,Jaiku, Pownce It’s about YOU!

Who are our... Friends Facebook Colleagues LinkedIn Virtual Avatars secondlife Also about US

What we share Knowledge Wikipedia Links del.icio.us, StumbleUpon Love/Hate yelp, Upcoming Location FireEagle, BrightKite Spaces Ustream, Qik How We Share

Communities • Social interactions • build communities • Shared Interests • Common Beliefs • Events • Organization/Location

What is a Community Political Blogs • A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. • Graph • Citation Network • Affiliation Network • Sentiment Information • Shared Resource (tags, videos..) Twitter Network Facebook Network

Existing Approaches Clustering Approach • Agglomerative/Hierarchical Incrementally, group similar nodes to form clusters Communities in Football League (Hierarchical Clustering) Football Teams

Existing Approaches Clustering Approach • Agglomerative/Hierarchical Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Existing Approaches Clustering Approach • Agglomerative/Hierarchical • Divisive/Partition based (Girvan Newman) Normalized Cut (NCut) (Shi, Malik) Political Books

Existing Approaches Graph Laplacian Normalized Cuts Cost of edges deleted to disconnect the graph Total cost of all edges that start from B The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik) The second smallest eigenvector of the graph Laplacian is the Fiedler vector. The graph can be recursively partitioned using the sign of the values in its Fielder vector.

Existing Approaches • Modularity Score (Newman et al.) • Measure of quality of clustering eii fraction of intra-community edges ai expected value of eii disregarding communities • Q = 0 Communities are random • Q >0 Higher values are better • Optimizing modularity is NP-Hard* • Spectral Methods • Heuristics * (Brandes et al.)

Limitations Existing methods Do not scale well for Web graphs Fail to exploit the underlying graph’s distributions Unable to use available meta-data and semantic features.

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

Special Properties of Social Datasets • The Long Tail • 80/20 Rule or Pareto distribution • Few blogs get most attention/links • Most are sparsely connected • Motivation • Web graphs are large, but sparse • Expensive to compute community structure over the entire graph • Goal • Approximate the membership of the nodes using only a small portion of the entire graph.

Special Properties of Social Datasets • Intuition • communities are defined by the core (A) and the membership of the rest of the network (B) can be approximated by how they link to the core. • Direct Method • NCut (Baseline) • Approximation • Singular Value Decomposition (SVD) • Sampling • Heuristic

Approximating Communities ICWSM ‘08 Nodes ordered by degree • SVD (low rank) • Sampling based Approach • Communities can be extracted by sampling only columns from the head (Drineas et al.) • Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to. r

Approximating Communities ICWSM ‘08 • Dataset: A blog dataset of 6000 blogs. Heuristic Approximation Original Adjacency Modularity = 0.51

Approximating Communities ICWSM ‘08 Similar Modularity Lower Time • AdvantageFaster detection using small portion of the graph, less Memory. • SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns More Time Low Modularity

Approximating Communities ICWSM ‘08 Blog Dataset Social network datasets: Additional evaluations using Variation of Information score

Tags are free meta-data! • Other semantic features: • Sentiments • Named Entities • Readership information • Geolocation information • etc. • How to combine this for detecting communities?

Social Media Graphs Links Between Nodes and Tags Links Between Nodes Simultaneous Cuts

Communities in Social Media A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

SimCUT: Simultaneously Clustering Tags and Graphs WebKDD ‘08 Nodes Tags Tags Tags Nodes Nodes Tags Nodes Fiedler Vector Polarity β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut

SimCUT: Simultaneously Clustering Tags and Graphs WebKDD ‘08 Clustering Only Links Clustering Links + Tags β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut

Datasets • Citeseer (Getoor et al.) • Agents, AI, DB, HCI, IR, ML • Words used in place of tags • Blog data • derived from the WWE/Buzzmetrics dataset • Tags associated with Blogs derived from del.icio.us • For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) • Pairwise similarity computed • RBF Kernel for Citeseer • Cosine for blogs

Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags

Clustering Tags and Graphs Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information

Varying Scaling Parameter β β >> 1 β=0 β=1 Accuracy = 36% Accuracy = 39% Accuracy = 62% Only Graph Only Tags Graphs & Tags Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004)

Effect of Number of Tags, Clusters • Mutual Information • Measures the dependence between two random variables. • Compares results with ground truth Link only has lower MI More Semantics helps Citeseer Similar results for real, blog datasets

Tags are one type of meta-data! • Other semantic information: • Sentiments • Named Entities • Readership information • Geolocation information • etc. • How do we get additional semantics?

Additional Semantics (TREC 06, IJCAI/AND 07) • BlogVox: • Sentiments and Opinions • SemNews: • Named Entities, beliefs, facts • Link Polarity: • Sentiment from anchor text • Readership: • Feed subscriptions and usage (AAAI SS 05, HICS 06, IJSWIS) (ICWSM 07) (ICWSM 07)

Key Observations • Understanding communication in social media requires identifying and modeling communities • Communities are a result of collective, social interactions and usage.

Feeds Readership http://ftm.umbc.edu ICWSM ‘07 Folders Use folder label as topics/tags. Group similar folders together. Rank Feeds under a “topic”

83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006 Feed Subscription Statistics ICWSM ‘07 Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions

Feeds That Matter http://ftm.umbc.edu ICWSM ‘07 • Communities from Feed Subscriptions • A Common vocabulary emerges from folder names • Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity Folder Usage # of Users Using a Folder Rank of a Folder (By number of Feeds in it)

Tag Cloud After Merging Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.

Feed Recommendations http://ftm.umbc.edu ICWSM ‘07 • Two feeds are similar if they are categorized under similar folders If you like X you will like….. • Feed Distillation for “Politics” • Merged folders: “political”, “political blogs” • Talking Points Memo: by Joshua Micah Marshal • Daily Kos: State of the Nation • Eschaton • The Washington Monthly • Wonkette, Politics for People with Dirty Minds • http://instapundit.com/ • Informed Comment • Power Line • AMERICAblog: Because a great nation deserves the truth • Crooks and Liars Tech Knitting

Wikipedia is our collective wisdom Twitter is our collective consciousness

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense

Presentation Transcript

Foundations of Inter-Domain Routing Ph.D. Dissertation Defense

Ph.D. Dissertation Defense

Dissertation Oral Defense

DISSERTATION DEFENSE

Social Media and Virtual Communities

Dissertation Defense

360 Publishing, Social Media, and Repurposing Content

Social Media Mining Min Song, Ph.D. Associate Professor

Dissertation Defense Presentation

Social Media Communities: Financial

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java

Beyond Sentiment Mining Social Media

Dissertation Defense

Dissertation Defense

Ph.D. defense

SOCIAL MEDIA MANAGEMENT & CONTENT SHARING

Optimizing Content with SEO and Social Media

Content strategy for social media

Creating Content for Social Media

Types of Social Media Content - Realigning Social Media Marketing with Content Writing

Creating Effective Social Media Content

Social Media Communities